Facing a dataset with missing values is very common in any project. Let’s take a look at how can we tackle such a situation using the SimpleImputer module provided by Scikit-Learn.
Looking at the data
For this article, I will be using the famous California House Price Dataset. Let’s take a look at the data and some of its properties
>>> housing_data.head()
longitude latitude housing_median_age ... households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 ... 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 ... 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 ... 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 ... 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 ... 259.0 3.8462 342200.0 NEAR BAY
>>> housing_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
- There are total of 20,640 rows and 10 columns.
- All the columns except ocean_proximity are of numeric data type.
- total_bedrooms column contain some null values.
Missing Data
Having missing data in the dataset is pretty common in practical projects, there is nothing like perfect data. When we encounter missing data in our data we could:
- Get rid of the corresponding row.
- Get rid of the entire column.
- Replace the null values with some value (zero, mean, median, etc.)
The first two options are pretty straightforward but might not be the solution you would opt for. For replacing the null values we could manually write a code to handle that or we could utilize a handy class SimpleImputer, provided by Scikit-Learn.
General Syntax of SimpleImputer
Let’s look at the parameters, attributes, and methods available for SimpleImputer.
You can find the complete documentation here.
Parameters | Description |
---|---|
missing_values | int, float, str, np.nan or None, default=np.nan The placeholder for the missing values. All occurrences of missing_values will be imputed. |
strategy | str, default=’mean’ The imputation strategy: – If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data. – If “median”, then replace missing values using the median along each column. Can only be used with numeric data. – If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned. – If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data. |
fill_value | str or numerical value, default=None When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types. |
Attributes | Description |
---|---|
statistics_ | an array of shapes (n_features,) The imputation fill value for each feature |
n_features_in_ | int A number of features were seen during the fit. |
feature_names_in_ | ndarray of shape (n_features_in_,) Names of features seen during fit. Defined only when X has feature names that are all strings. |
Methods | Description |
---|---|
fit(X) | Fit the imputer on X. |
fit_transform(X) | Fit to data, then transform it. |
get_params() | Get parameters for this estimator. |
inverse_transform(X) | Convert the data back to the original representation. |
set_params(**params) | Set the parameters of this estimator. |
transform(X) | Impute all missing values in X. |
Implemeting SimpleImputer
Using SimpleImputer can be broken down into some steps:
- Create a SimpleImputer instance with the appropriate arguments.
- Fitting the instance to the desired data.
- Transforming the data.
For the simplicity of this article, we will impute only the numeric columns. So let’s remove the one categorical column first
housing_data_numeric = housing_data.drop(labels=['ocean_proximity'], axis=1)
>>> housing_data_numeric.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
dtypes: float64(9)
memory usage: 1.4 MB
Now let’s define a SimpleImputer instance which will replace the missing values by the mean of the column
# Importing the class
from sklearn.impute import SimpleImputer
# Defining SimpleImputer
imputer = SimpleImputer(strategy='mean')
# Fitting the SimpleImputer
imputer.fit(housing_data_numeric)
Before we proceed, let’s quickly compare the statistics of our instance with the actual mean of the columns to confirm that it is correct
>>> imputer.statistics_
array([-1.19569704e+02, 3.56318614e+01, 2.86394864e+01, 2.63576308e+03,
5.37870553e+02, 1.42547674e+03, 4.99539680e+02, 3.87067100e+00,
2.06855817e+05])
>>> housing_data_numeric.mean()
longitude -119.569704
latitude 35.631861
housing_median_age 28.639486
total_rooms 2635.763081
total_bedrooms 537.870553
population 1425.476744
households 499.539680
median_income 3.870671
median_house_value 206855.816909
dtype: float64
Okay, so both the mean values match which is a good sign.
Let’s now transform our data using the SimpleImputer instance we defined.
imputed_data = imputer.transform(housing_data_numeric)
>>> imputed_data
array([[-1.2223e+02, 3.7880e+01, 4.1000e+01, ..., 1.2600e+02,
8.3252e+00, 4.5260e+05],
[-1.2222e+02, 3.7860e+01, 2.1000e+01, ..., 1.1380e+03,
8.3014e+00, 3.5850e+05],
[-1.2224e+02, 3.7850e+01, 5.2000e+01, ..., 1.7700e+02,
7.2574e+00, 3.5210e+05],
...,
[-1.2122e+02, 3.9430e+01, 1.7000e+01, ..., 4.3300e+02,
1.7000e+00, 9.2300e+04],
[-1.2132e+02, 3.9430e+01, 1.8000e+01, ..., 3.4900e+02,
1.8672e+00, 8.4700e+04],
[-1.2124e+02, 3.9370e+01, 1.6000e+01, ..., 5.3000e+02,
2.3886e+00, 8.9400e+04]])
The transform() returns the data either in a numpy array or a sparse matrix. For viewing the result in a more presentable way, let’s convert the output to a DataFrame
imputed_df = pd.DataFrame(
imputed_data,
columns=housing_data_numeric.columns,
index=housing_data_numeric.index
)
>>> imputed_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20640 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
dtypes: float64(9)
memory usage: 1.4 MB
Okay, we can see that there are no null values in our total_bedrooms column.
We can also fit and transform the data by using fit_transform() method directly on our data.
imputed_data = imputer.fit_transform(housing_data_numeric)
>>> imputed_data
array([[-1.2223e+02, 3.7880e+01, 4.1000e+01, ..., 1.2600e+02,
8.3252e+00, 4.5260e+05],
[-1.2222e+02, 3.7860e+01, 2.1000e+01, ..., 1.1380e+03,
8.3014e+00, 3.5850e+05],
[-1.2224e+02, 3.7850e+01, 5.2000e+01, ..., 1.7700e+02,
7.2574e+00, 3.5210e+05],
...,
[-1.2122e+02, 3.9430e+01, 1.7000e+01, ..., 4.3300e+02,
1.7000e+00, 9.2300e+04],
[-1.2132e+02, 3.9430e+01, 1.8000e+01, ..., 3.4900e+02,
1.8672e+00, 8.4700e+04],
[-1.2124e+02, 3.9370e+01, 1.6000e+01, ..., 5.3000e+02,
2.3886e+00, 8.9400e+04]])
Choosing the correct strategy
You can argue which strategy is better, mean or median. We can always do a Randomized Search or Grid Search to see which strategy leads to better performance. But a general rule is that the mean is affected by outliers whereas the median is not. So if your data has too many outliers, the median is a better option.
It’s actually a nice and useful piece of info. I’m happy that you shared this helpful info with us. Please stay us up to date like this. Thanks for sharing.