October 14, 2024

Handling Null Values in PySpark: Using dropna() and fillna() methods

Spread the love

In this article, I will go over two different methods for handling null values in PySpark

  • dropna() -> used for dropping the null values
  • fillna() -> used for filling up the null values

Importing Data

Let’s quickly import a custom data and dive right into the two methods

df = spark.read.csv(
    './data/custom_data/HandlingNulls.csv',
    header=True,
    inferSchema=True
)
>>> df.show()
+-------+----+----------+------+
|   Name| Age|Experience|Salary|
+-------+----+----------+------+
|  Preet|null|         5| 95000|
| Prapti|  30|         3| 85000|
|  Kajal|  50|         0| 20000|
|Kaushik|  52|        23|100000|
|  Viraj|  31|         3| 50000|
|  Sagar|null|      null| 70000|
|   null|  32|         5| 50000|
|   null|  28|      null|  null|
+-------+----+----------+------+

As we can see,

  • There are some rows with one null value and some with more than one null value.
  • Also, there are null values in the string columns as well as numeric columns.

Dropping Null Values

As discussed earlier, we can use the dropna() method to drop null values. This method takes 3 parameters

ParameterDetail
howstr, optional
If “any” is selected, PySpark will drop records where at least one of the fields is null. In the case of “all”, only the records where all fields are null will be removed. By default, PySpark will take the “any” mode.
threshint, optional
default None. If set, PySpark will ignore the how parameter and only drop the records with less than thresh non-null values.
subsetstr, tuple or list, optional
will use to make its decision.

Let’s go over the different parameters and different use-cases for the dropna() method

# Default
>>> df.dropna().show()
+-------+---+----------+------+
|   Name|Age|Experience|Salary|
+-------+---+----------+------+
| Prapti| 30|         3| 85000|
|  Kajal| 50|         0| 20000|
|Kaushik| 52|        23|100000|
|  Viraj| 31|         3| 50000|
+-------+---+----------+------+

# Dropping rows with null values in "Name" column
>>> df.dropna(subset='Name').show()
+-------+----+----------+------+
|   Name| Age|Experience|Salary|
+-------+----+----------+------+
|  Preet|null|         5| 95000|
| Prapti|  30|         3| 85000|
|  Kajal|  50|         0| 20000|
|Kaushik|  52|        23|100000|
|  Viraj|  31|         3| 50000|
|  Sagar|null|      null| 70000|
+-------+----+----------+------+

# Dropping rows with null values in "Name" and "Salary" columns
>>> df.dropna(subset=['Name', 'Salary']).show()
+-------+----+----------+------+
|   Name| Age|Experience|Salary|
+-------+----+----------+------+
|  Preet|null|         5| 95000|
| Prapti|  30|         3| 85000|
|  Kajal|  50|         0| 20000|
|Kaushik|  52|        23|100000|
|  Viraj|  31|         3| 50000|
|  Sagar|null|      null| 70000|
+-------+----+----------+------+

# Keep rows with at least 2 non null values
>>> df.dropna(thresh=2).show()
+-------+----+----------+------+
|   Name| Age|Experience|Salary|
+-------+----+----------+------+
|  Preet|null|         5| 95000|
| Prapti|  30|         3| 85000|
|  Kajal|  50|         0| 20000|
|Kaushik|  52|        23|100000|
|  Viraj|  31|         3| 50000|
|  Sagar|null|      null| 70000|
|   null|  32|         5| 50000|
+-------+----+----------+------+

Filling Null Values

We can use the fillna() method for filling null values. It takes two parameters.

ParameterDetail
valueint, float, string, bool or dict
Value to replace null values with. If the value is a dict, then the subset
is ignored and the value must be a mapping from column name (string) to replacement value. The replacement value must be an int, float, boolean, or string.
subsetstr, tuple or list, optional
an optional list of column names to consider. Columns specified in the subset that do not have matching data types are ignored. For example, if the value is a string, and the subset contains a non-string column, then the non-string column is simply ignored.

Let’s look at some examples

# Default -> String Replacement
>>> df.fillna('NA').show()
+-------+----+----------+------+
|   Name| Age|Experience|Salary|
+-------+----+----------+------+
|  Preet|null|         5| 95000|
| Prapti|  30|         3| 85000|
|  Kajal|  50|         0| 20000|
|Kaushik|  52|        23|100000|
|  Viraj|  31|         3| 50000|
|  Sagar|null|      null| 70000|
|     NA|  32|         5| 50000|
|     NA|  28|      null|  null|
+-------+----+----------+------+

# Default -> Numeric Replacement
>>> df.fillna(0).show()
+-------+---+----------+------+
|   Name|Age|Experience|Salary|
+-------+---+----------+------+
|  Preet|  0|         5| 95000|
| Prapti| 30|         3| 85000|
|  Kajal| 50|         0| 20000|
|Kaushik| 52|        23|100000|
|  Viraj| 31|         3| 50000|
|  Sagar|  0|         0| 70000|
|   null| 32|         5| 50000|
|   null| 28|         0|     0|
+-------+---+----------+------+

# Fill respective null values
>>> df.fillna('Missing Value').fillna(0).show()
+-------------+---+----------+------+
|         Name|Age|Experience|Salary|
+-------------+---+----------+------+
|        Preet|  0|         5| 95000|
|       Prapti| 30|         3| 85000|
|        Kajal| 50|         0| 20000|
|      Kaushik| 52|        23|100000|
|        Viraj| 31|         3| 50000|
|        Sagar|  0|         0| 70000|
|Missing Value| 32|         5| 50000|
|Missing Value| 28|         0|     0|
+-------------+---+----------+------+

# Fill null values in Age column by 10 and null values in Salary column by 10000
>>> df.fillna(10, 'Age').fillna(10000, 'Salary').show()
+-------+---+----------+------+
|   Name|Age|Experience|Salary|
+-------+---+----------+------+
|  Preet| 10|         5| 95000|
| Prapti| 30|         3| 85000|
|  Kajal| 50|         0| 20000|
|Kaushik| 52|        23|100000|
|  Viraj| 31|         3| 50000|
|  Sagar| 10|      null| 70000|
|   null| 32|         5| 50000|
|   null| 28|      null| 10000|
+-------+---+----------+------+

>>> df.fillna({
...     'Age': 0,
...     'Salary': 10000
... }).show()
+-------+---+----------+------+
|   Name|Age|Experience|Salary|
+-------+---+----------+------+
|  Preet|  0|         5| 95000|
| Prapti| 30|         3| 85000|
|  Kajal| 50|         0| 20000|
|Kaushik| 52|        23|100000|
|  Viraj| 31|         3| 50000|
|  Sagar|  0|      null| 70000|
|   null| 32|         5| 50000|
|   null| 28|      null| 10000|
+-------+---+----------+------+

# Fill null values in Salary column by average of Salary column

>>> # Let's find the average of Salary column first
>>> avg_salary = df.agg({'Salary': 'avg'}).collect()[0][0]

>>> # Let's use this to fill the null values in Salary column
>>> df.fillna(avg_salary, 'Salary').show()
+-------+----+----------+------+
|   Name| Age|Experience|Salary|
+-------+----+----------+------+
|  Preet|null|         5| 95000|
| Prapti|  30|         3| 85000|
|  Kajal|  50|         0| 20000|
|Kaushik|  52|        23|100000|
|  Viraj|  31|         3| 50000|
|  Sagar|null|      null| 70000|
|   null|  32|         5| 50000|
|   null|  28|      null| 67142|
+-------+----+----------+------+

We can use user-defined functions to replace null values as well, but more on this in some other article.


Hopefully, this article helps you understand how you can handle null values in PySpark


Spread the love

One thought on “Handling Null Values in PySpark: Using dropna() and fillna() methods

Leave a Reply

Your email address will not be published. Required fields are marked *