October 16, 2024

SimpleImputer: Replacing null values for Machine Learning Projects

Spread the love

Facing a dataset with missing values is very common in any project. Let’s take a look at how can we tackle such a situation using the SimpleImputer module provided by Scikit-Learn.


Looking at the data

For this article, I will be using the famous California House Price Dataset. Let’s take a look at the data and some of its properties

>>> housing_data.head()
   longitude  latitude  housing_median_age   ...  households  median_income  median_house_value  ocean_proximity
0    -122.23     37.88                41.0   ...       126.0         8.3252            452600.0         NEAR BAY
1    -122.22     37.86                21.0   ...      1138.0         8.3014            358500.0         NEAR BAY
2    -122.24     37.85                52.0   ...       177.0         7.2574            352100.0         NEAR BAY
3    -122.25     37.85                52.0   ...       219.0         5.6431            341300.0         NEAR BAY
4    -122.25     37.85                52.0   ...       259.0         3.8462            342200.0         NEAR BAY

>>> housing_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
  • There are total of 20,640 rows and 10 columns.
  • All the columns except ocean_proximity are of numeric data type.
  • total_bedrooms column contain some null values.

Missing Data

Having missing data in the dataset is pretty common in practical projects, there is nothing like perfect data. When we encounter missing data in our data we could:

  • Get rid of the corresponding row.
  • Get rid of the entire column.
  • Replace the null values with some value (zero, mean, median, etc.)

The first two options are pretty straightforward but might not be the solution you would opt for. For replacing the null values we could manually write a code to handle that or we could utilize a handy class SimpleImputer, provided by Scikit-Learn.


General Syntax of SimpleImputer

Let’s look at the parameters, attributes, and methods available for SimpleImputer.
You can find the complete documentation here.

ParametersDescription
missing_valuesint, float, str, np.nan or None, default=np.nan
The placeholder for the missing values. All occurrences of missing_values will be imputed.
strategystr, default=’mean’
The imputation strategy:
– If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
– If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
– If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.
– If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
fill_valuestr or numerical value, default=None
When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.
Parameters for SimpleImputer

AttributesDescription
statistics_an array of shapes (n_features,)
The imputation fill value for each feature
n_features_in_int
A number of features were seen during the fit.
feature_names_in_ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
Attributes for SimpleImputer

MethodsDescription
fit(X)Fit the imputer on X.
fit_transform(X)Fit to data, then transform it.
get_params()Get parameters for this estimator.
inverse_transform(X)Convert the data back to the original representation.
set_params(**params)Set the parameters of this estimator.
transform(X)Impute all missing values in X.
Methods for SimpleImputer

Implemeting SimpleImputer

Using SimpleImputer can be broken down into some steps:

  • Create a SimpleImputer instance with the appropriate arguments.
  • Fitting the instance to the desired data.
  • Transforming the data.

For the simplicity of this article, we will impute only the numeric columns. So let’s remove the one categorical column first

housing_data_numeric = housing_data.drop(labels=['ocean_proximity'], axis=1)
>>> housing_data_numeric.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB

Now let’s define a SimpleImputer instance which will replace the missing values by the mean of the column

# Importing the class
from sklearn.impute import SimpleImputer

# Defining SimpleImputer
imputer = SimpleImputer(strategy='mean')

# Fitting the SimpleImputer
imputer.fit(housing_data_numeric)

Before we proceed, let’s quickly compare the statistics of our instance with the actual mean of the columns to confirm that it is correct

>>> imputer.statistics_
array([-1.19569704e+02,  3.56318614e+01,  2.86394864e+01,  2.63576308e+03,
        5.37870553e+02,  1.42547674e+03,  4.99539680e+02,  3.87067100e+00,
        2.06855817e+05])

>>> housing_data_numeric.mean()
longitude               -119.569704
latitude                  35.631861
housing_median_age        28.639486
total_rooms             2635.763081
total_bedrooms           537.870553
population              1425.476744
households               499.539680
median_income              3.870671
median_house_value    206855.816909
dtype: float64

Okay, so both the mean values match which is a good sign.


Let’s now transform our data using the SimpleImputer instance we defined.

imputed_data = imputer.transform(housing_data_numeric)
>>> imputed_data
array([[-1.2223e+02,  3.7880e+01,  4.1000e+01, ...,  1.2600e+02,
         8.3252e+00,  4.5260e+05],
       [-1.2222e+02,  3.7860e+01,  2.1000e+01, ...,  1.1380e+03,
         8.3014e+00,  3.5850e+05],
       [-1.2224e+02,  3.7850e+01,  5.2000e+01, ...,  1.7700e+02,
         7.2574e+00,  3.5210e+05],
       ...,
       [-1.2122e+02,  3.9430e+01,  1.7000e+01, ...,  4.3300e+02,
         1.7000e+00,  9.2300e+04],
       [-1.2132e+02,  3.9430e+01,  1.8000e+01, ...,  3.4900e+02,
         1.8672e+00,  8.4700e+04],
       [-1.2124e+02,  3.9370e+01,  1.6000e+01, ...,  5.3000e+02,
         2.3886e+00,  8.9400e+04]])

The transform() returns the data either in a numpy array or a sparse matrix. For viewing the result in a more presentable way, let’s convert the output to a DataFrame


imputed_df = pd.DataFrame(
    imputed_data,
    columns=housing_data_numeric.columns,
    index=housing_data_numeric.index
    )
>>> imputed_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20640 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB

Okay, we can see that there are no null values in our total_bedrooms column.


We can also fit and transform the data by using fit_transform() method directly on our data.

imputed_data = imputer.fit_transform(housing_data_numeric)
>>> imputed_data
array([[-1.2223e+02,  3.7880e+01,  4.1000e+01, ...,  1.2600e+02,
         8.3252e+00,  4.5260e+05],
       [-1.2222e+02,  3.7860e+01,  2.1000e+01, ...,  1.1380e+03,
         8.3014e+00,  3.5850e+05],
       [-1.2224e+02,  3.7850e+01,  5.2000e+01, ...,  1.7700e+02,
         7.2574e+00,  3.5210e+05],
       ...,
       [-1.2122e+02,  3.9430e+01,  1.7000e+01, ...,  4.3300e+02,
         1.7000e+00,  9.2300e+04],
       [-1.2132e+02,  3.9430e+01,  1.8000e+01, ...,  3.4900e+02,
         1.8672e+00,  8.4700e+04],
       [-1.2124e+02,  3.9370e+01,  1.6000e+01, ...,  5.3000e+02,
         2.3886e+00,  8.9400e+04]])

Choosing the correct strategy

You can argue which strategy is better, mean or median. We can always do a Randomized Search or Grid Search to see which strategy leads to better performance. But a general rule is that the mean is affected by outliers whereas the median is not. So if your data has too many outliers, the median is a better option.


Spread the love

One thought on “SimpleImputer: Replacing null values for Machine Learning Projects

Leave a Reply

Your email address will not be published. Required fields are marked *