October 16, 2024

Evaluating Regression Models: Improving your model’s efficiency

Spread the love

In this article, I will go over various evaluation metrics available for a regression model. I will also go over the advantages and disadvantages of all the various metrics. Please note, this article isn’t about the in-depth mathematics behind these metrics, instead, it will focus more on the application side of these metrics.

The evaluation metrics which we will cover are:

  • R Squared / Adjusted R Squared
  • Mean Squared Error (MSE) / Root Mean Squared Error (RMSE)
  • Mean Absolute Error (MAE)

Preparing our data

Before we jump into the metrics, let’s quickly import our data, do a little cleanup and fit the data into a linear regression model. We will then use that model’s performance and evaluate using the mentioned metrics.

# Importing Libraries
from sklearn.datasets import fetch_california_housing

# Importing Data
X, y = fetch_california_housing(as_frame=True, return_X_y=True)

# Splitting the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

# Pre-processing the data
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Creating pre-process pipeline
preprocessing = Pipeline(steps=[
    ('impute', SimpleImputer()),
    ('scale', StandardScaler()),
])

X_train_processed = preprocessing.fit_transform(X_train)

# Simple Linear Regression
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X_train_processed, y_train)

# Predictions
y_pred = lin_reg.predict(X_train_processed)

The above code:

  • Fetches the data
  • Splits the data into training and test set
  • Pre-processes the data
  • Fits the data into a linear regression model
  • Predicts the result using the linear regression model

Please Note: This article focuses only on the evaluation metrics and not on the above topics. At the time of writing this article, I do have an article about splitting the data, a link to that article can be found here.


Evaluating the model

R Square/Adjusted R Square

  • R Square measures how much of variability in dependent variable can be explained by the model It’s the square of the correlation coefficient
    • It’s a good measure to determine how well the model fits the dependent variables
    • It doesn’t take into consideration of overfitting problem
    • Best possible score is 1
  • Adjusted R Square penalises additional independent variables added to the model and adjust the metric to prevent the overfitting issue
from sklearn.metrics import r2_score

predicted_r2_score = r2_score(y_train, y_pred)
print(f'R2 score of predicted values: {predicted_r2_score}')

Output:

R2 score of predicted values: 0.6125511913966952

As we can see 61% of the dependent variability can be explained by the model


Mean Squared Error (MSE) / Root Mean Squared Error (RMSE)

  • Mean Squared Error
    • It is the sum of square of prediction error (which is real output minus the predicted output) divided by the number of data points
    • It gives an absolute number on how much the predicted result deviate from the actual value
    • It doesn’t provide much insights but is a good metric to compare different models
    • It gives larger penalisation to big prediction error
  • Root Mean Squared Error
    • It’s the root of MSE
    • More commonly used than MSE
from sklearn.metrics import mean_squared_error

predicted_mse = mean_squared_error(y_train, y_pred)
print(f'Predicted MSE: {predicted_mse}')

predicted_rmse = np.sqrt(predicted_mse)
print(f'Predicted RMSE: {predicted_rmse}')

Output:

Predicted MSE: 0.5179331255246699
Predicted RMSE: 0.7196757085831575

Mean Absolute Error (MAE)

  • It is similar to MSE. The only difference is that instead of taking the sum of square of error (like in MSE), it takes the sum of absolute value of error
  • Compared to MSE or RMSE, it is more direct representation of sum of error terms
  • It treats all the errors the same
from sklearn.metrics import mean_absolute_error

predicted_mae = mean_absolute_error(y_train, y_pred)
print(f'Predicted MAE: {predicted_mae}')

Output:

Predicted MAE: 0.5286283596581934

Conclusion

There are other evaluating metrics also like explained variance, max error, root mean squared log error, and so on. I have only discussed the above since these are the most common ones and are widely used.


Spread the love

Leave a Reply

Your email address will not be published. Required fields are marked *