Handling Null Values in PySpark: Using dropna() and fillna() methods

Spread the love

In this article, I will go over two different methods for handling null values in PySpark

dropna() -> used for dropping the null values
fillna() -> used for filling up the null values

Importing Data

Let’s quickly import a custom data and dive right into the two methods

df = spark.read.csv(
    './data/custom_data/HandlingNulls.csv',
    header=True,
    inferSchema=True
)
>>> df.show()
+-------+----+----------+------+
|   Name| Age|Experience|Salary|
+-------+----+----------+------+
|  Preet|null|         5| 95000|
| Prapti|  30|         3| 85000|
|  Kajal|  50|         0| 20000|
|Kaushik|  52|        23|100000|
|  Viraj|  31|         3| 50000|
|  Sagar|null|      null| 70000|
|   null|  32|         5| 50000|
|   null|  28|      null|  null|
+-------+----+----------+------+

As we can see,

There are some rows with one null value and some with more than one null value.
Also, there are null values in the string columns as well as numeric columns.

Dropping Null Values

As discussed earlier, we can use the dropna() method to drop null values. This method takes 3 parameters

Parameter	Detail
how	str, optional If “any” is selected, PySpark will drop records where at least one of the fields is null. In the case of “all”, only the records where all fields are null will be removed. By default, PySpark will take the “any” mode.
thresh	int, optional default None. If set, PySpark will ignore the how parameter and only drop the records with less than thresh non-null values.
subset	str, tuple or list, optional will use to make its decision.

Let’s go over the different parameters and different use-cases for the dropna() method

# Default
>>> df.dropna().show()
+-------+---+----------+------+
|   Name|Age|Experience|Salary|
+-------+---+----------+------+
| Prapti| 30|         3| 85000|
|  Kajal| 50|         0| 20000|
|Kaushik| 52|        23|100000|
|  Viraj| 31|         3| 50000|
+-------+---+----------+------+

# Dropping rows with null values in "Name" column
>>> df.dropna(subset='Name').show()
+-------+----+----------+------+
|   Name| Age|Experience|Salary|
+-------+----+----------+------+
|  Preet|null|         5| 95000|
| Prapti|  30|         3| 85000|
|  Kajal|  50|         0| 20000|
|Kaushik|  52|        23|100000|
|  Viraj|  31|         3| 50000|
|  Sagar|null|      null| 70000|
+-------+----+----------+------+

# Dropping rows with null values in "Name" and "Salary" columns
>>> df.dropna(subset=['Name', 'Salary']).show()
+-------+----+----------+------+
|   Name| Age|Experience|Salary|
+-------+----+----------+------+
|  Preet|null|         5| 95000|
| Prapti|  30|         3| 85000|
|  Kajal|  50|         0| 20000|
|Kaushik|  52|        23|100000|
|  Viraj|  31|         3| 50000|
|  Sagar|null|      null| 70000|
+-------+----+----------+------+

# Keep rows with at least 2 non null values
>>> df.dropna(thresh=2).show()
+-------+----+----------+------+
|   Name| Age|Experience|Salary|
+-------+----+----------+------+
|  Preet|null|         5| 95000|
| Prapti|  30|         3| 85000|
|  Kajal|  50|         0| 20000|
|Kaushik|  52|        23|100000|
|  Viraj|  31|         3| 50000|
|  Sagar|null|      null| 70000|
|   null|  32|         5| 50000|
+-------+----+----------+------+

Filling Null Values

We can use the fillna() method for filling null values. It takes two parameters.

Parameter	Detail
value	int, float, string, bool or dict Value to replace null values with. If the value is a dict, then the subset is ignored and the value must be a mapping from column name (string) to replacement value. The replacement value must be an int, float, boolean, or string.
subset	str, tuple or list, optional an optional list of column names to consider. Columns specified in the subset that do not have matching data types are ignored. For example, if the value is a string, and the subset contains a non-string column, then the non-string column is simply ignored.

Let’s look at some examples

# Default -> String Replacement
>>> df.fillna('NA').show()
+-------+----+----------+------+
|   Name| Age|Experience|Salary|
+-------+----+----------+------+
|  Preet|null|         5| 95000|
| Prapti|  30|         3| 85000|
|  Kajal|  50|         0| 20000|
|Kaushik|  52|        23|100000|
|  Viraj|  31|         3| 50000|
|  Sagar|null|      null| 70000|
|     NA|  32|         5| 50000|
|     NA|  28|      null|  null|
+-------+----+----------+------+

# Default -> Numeric Replacement
>>> df.fillna(0).show()
+-------+---+----------+------+
|   Name|Age|Experience|Salary|
+-------+---+----------+------+
|  Preet|  0|         5| 95000|
| Prapti| 30|         3| 85000|
|  Kajal| 50|         0| 20000|
|Kaushik| 52|        23|100000|
|  Viraj| 31|         3| 50000|
|  Sagar|  0|         0| 70000|
|   null| 32|         5| 50000|
|   null| 28|         0|     0|
+-------+---+----------+------+

# Fill respective null values
>>> df.fillna('Missing Value').fillna(0).show()
+-------------+---+----------+------+
|         Name|Age|Experience|Salary|
+-------------+---+----------+------+
|        Preet|  0|         5| 95000|
|       Prapti| 30|         3| 85000|
|        Kajal| 50|         0| 20000|
|      Kaushik| 52|        23|100000|
|        Viraj| 31|         3| 50000|
|        Sagar|  0|         0| 70000|
|Missing Value| 32|         5| 50000|
|Missing Value| 28|         0|     0|
+-------------+---+----------+------+

# Fill null values in Age column by 10 and null values in Salary column by 10000
>>> df.fillna(10, 'Age').fillna(10000, 'Salary').show()
+-------+---+----------+------+
|   Name|Age|Experience|Salary|
+-------+---+----------+------+
|  Preet| 10|         5| 95000|
| Prapti| 30|         3| 85000|
|  Kajal| 50|         0| 20000|
|Kaushik| 52|        23|100000|
|  Viraj| 31|         3| 50000|
|  Sagar| 10|      null| 70000|
|   null| 32|         5| 50000|
|   null| 28|      null| 10000|
+-------+---+----------+------+

>>> df.fillna({
...     'Age': 0,
...     'Salary': 10000
... }).show()
+-------+---+----------+------+
|   Name|Age|Experience|Salary|
+-------+---+----------+------+
|  Preet|  0|         5| 95000|
| Prapti| 30|         3| 85000|
|  Kajal| 50|         0| 20000|
|Kaushik| 52|        23|100000|
|  Viraj| 31|         3| 50000|
|  Sagar|  0|      null| 70000|
|   null| 32|         5| 50000|
|   null| 28|      null| 10000|
+-------+---+----------+------+

# Fill null values in Salary column by average of Salary column

>>> # Let's find the average of Salary column first
>>> avg_salary = df.agg({'Salary': 'avg'}).collect()[0][0]

>>> # Let's use this to fill the null values in Salary column
>>> df.fillna(avg_salary, 'Salary').show()
+-------+----+----------+------+
|   Name| Age|Experience|Salary|
+-------+----+----------+------+
|  Preet|null|         5| 95000|
| Prapti|  30|         3| 85000|
|  Kajal|  50|         0| 20000|
|Kaushik|  52|        23|100000|
|  Viraj|  31|         3| 50000|
|  Sagar|null|      null| 70000|
|   null|  32|         5| 50000|
|   null|  28|      null| 67142|
+-------+----+----------+------+

We can use user-defined functions to replace null values as well, but more on this in some other article.

Hopefully, this article helps you understand how you can handle null values in PySpark

Spread the love

One thought on “Handling Null Values in PySpark: Using dropna() and fillna() methods”

zoritoler imol says:

January 24, 2023 at 11:48 am

Wonderful site. A lot of helpful info here. I am sending it to a few friends ans additionally sharing in delicious. And obviously, thanks for your sweat!

Handling Null Values in PySpark: Using dropna() and fillna() methods

Importing Data

Dropping Null Values

Filling Null Values

Preet Parmar

One thought on “Handling Null Values in PySpark: Using dropna() and fillna() methods”

Leave a Reply Cancel reply

SQL Simplified: Let’s go back to the basics

Handling customers with same name

“useState”: The most commonly used hook in React

Python DefaultDict – Ability to assign default values to your keys

How and when to use “decorators” in Python

Feature Importance: a special use case of Random Forest Classifier

SQLite: Integrating Python and SQL

NamedTuple: A pythonic way for writing your code

Live Weather using BeautifulSoup

Dropbox API: Automating Dropbox downloads using Python

Importing Data

Dropping Null Values

Filling Null Values

Preet Parmar

You might also like

One thought on “Handling Null Values in PySpark: Using dropna() and fillna() methods”

Leave a Reply Cancel reply