October 15, 2024

RandomizedSearchCV: An automated way of improving your model’s performance

Spread the love

To get the optimal solution, we need to fine-tune our model with different values of the hyperparameters. This can be a daunting task, fortunately, Scikit-Learn provides libraries that help us to do that. The general idea is to try to out multiple values (either from a given set of values or from a range of values) and compare the scores for all those values, then choose the one which has the best score out of all.

Preparing Data

Fetch the data

We use fetch_openml to get the MNIST dataset.

import numpy as np
from sklearn.datasets import fetch_openml

mnist = fetch_openml(name='mnist_784', 
                     version=1)

X = mnist['data']
y = mnist['target'].astype(np.uint8)

Splitting the dataset into Training and Test dataset

By default, the MNIST dataset is shuffled into the training and test datasets and arranged. The First 60000 rows are the Training set and the remaining are the Test set.

X_train = X[:60000]
X_test = X[60000:]
y_train = y[:60000]
y_test = y[60000:]

LinearSVC Model as our base model

Implementing LinearSVC Model and checking its performance

Importing Libraries:

from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

Initializing the model, fitting our data, and checking the accuracy of the model

lin_svc = LinearSVC()
lin_svc.fit(X_train, y_train)

y_pred = lin_svc.predict(X_train)
acc_scr_lin_svc = accuracy_score(y_train, y_pred)
print(f'LinearSVC Accuracy Score: {round(acc_scr_lin_svc*100, 2)}%')

Output:

LinearSVC Accuracy Score: 86.81%

The baseline performance measure is not that bad, but bad enough that we can see improvement when we fine-tune the model. So we will proceed with the LinearSVC model.

Please Note: The purpose of this article is not to find the most optimal solution, but to go through the exercise of fine-tuning the model

Before we proceed to the fine-tuning, let’s quickly create a pipeline with pre-processing of data.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

clf = Pipeline([
    ('scaler', StandardScaler()),
    ('lin_svc', LinearSVC())
])

RandomizedSearchCV

The idea behind RandomizedSearchCV is:

  • We provide a range of hyperparameters (param_distribution) for the model to go through
  • The model then goes through all the combinations and generates a score for each combination (each dictionary)
  • It then selects the combination which has the highest score

Link to the original documentation

Understanding the hyperparameters of the model first

To create a param_distribution, we need to first understand the various hyperparameters of the model

LinearSVC hyperparameters:

Quickly looking at the document for LinearSVC we can see that it provides quite a lot of hyperparameters, but let’s focus on some of them here:

  • penalty: {‘l1’, ‘l2’}
  • tol: float, default=1e-4
  • C: float, default=1.0

Creating the param_distribution for our RandomizedSearchCV

A param_distribution is of dictionary type or a list of dictionaries. Let’s start with individual dictionaries and then we can create a list of them

To get the key for your dictionary, we can use get_params().keys() on our classifier.

clf.get_params().keys()

Output:

dict_keys(['memory', 'steps', 'verbose', 'scaler', 'lin_svc', 'scaler__copy', 'scaler__with_mean', 'scaler__with_std', 'lin_svc__C', 'lin_svc__class_weight', 'lin_svc__dual', 'lin_svc__fit_intercept', 'lin_svc__intercept_scaling', 'lin_svc__loss', 'lin_svc__max_iter', 'lin_svc__multi_class', 'lin_svc__penalty', 'lin_svc__random_state', 'lin_svc__tol', 'lin_svc__verbose'])

Creating the param_distribution list:

param_distribution = [
    {
        'lin_svc__C': [x for x in range(1, 10)],
        'lin_svc__penalty': ['l2'],
        'lin_svc__tol': [1e-2, 1e-3, 1e-4, 1e-5]
    }
]

Implementing RandomizedSearchCV :

Initializing:

from sklearn.model_selection import RandomizedSearchCV

random_search = RandomizedSearchCV(estimator=clf,
                                  param_distributions=param_distribution,
                                  cv=3,
                                  verbose=2)

Implementing RandomizedSearchCV on the subset of the training set, to reduce the time:

X_train_subset = X_train[:1000]
y_train_subset = y_train[:1000]

random_search .fit(X_train_subset, y_train_subset)

Output:

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] END lin_svc__C=6, lin_svc__penalty=l2, lin_svc__tol=0.0001; total time=   1.1s
[CV] END lin_svc__C=6, lin_svc__penalty=l2, lin_svc__tol=0.0001; total time=   1.0s
[CV] END lin_svc__C=6, lin_svc__penalty=l2, lin_svc__tol=0.0001; total time=   0.8s
[CV] END lin_svc__C=4, lin_svc__penalty=l2, lin_svc__tol=1e-05; total time=   1.6s
[CV] END lin_svc__C=4, lin_svc__penalty=l2, lin_svc__tol=1e-05; total time=   1.3s
[CV] END lin_svc__C=4, lin_svc__penalty=l2, lin_svc__tol=1e-05; total time=   1.1s
[CV] END lin_svc__C=2, lin_svc__penalty=l2, lin_svc__tol=0.001; total time=   0.7s
[CV] END lin_svc__C=2, lin_svc__penalty=l2, lin_svc__tol=0.001; total time=   0.7s
[CV] END lin_svc__C=2, lin_svc__penalty=l2, lin_svc__tol=0.001; total time=   0.6s
[CV] END lin_svc__C=9, lin_svc__penalty=l2, lin_svc__tol=0.01; total time=   0.4s
[CV] END lin_svc__C=9, lin_svc__penalty=l2, lin_svc__tol=0.01; total time=   0.4s
[CV] END lin_svc__C=9, lin_svc__penalty=l2, lin_svc__tol=0.01; total time=   0.4s
[CV] END lin_svc__C=1, lin_svc__penalty=l2, lin_svc__tol=0.01; total time=   0.4s
[CV] END lin_svc__C=1, lin_svc__penalty=l2, lin_svc__tol=0.01; total time=   0.3s
[CV] END lin_svc__C=1, lin_svc__penalty=l2, lin_svc__tol=0.01; total time=   0.3s
[CV] END lin_svc__C=1, lin_svc__penalty=l2, lin_svc__tol=1e-05; total time=   1.6s
[CV] END lin_svc__C=1, lin_svc__penalty=l2, lin_svc__tol=1e-05; total time=   1.2s
[CV] END lin_svc__C=1, lin_svc__penalty=l2, lin_svc__tol=1e-05; total time=   1.1s
[CV] END lin_svc__C=4, lin_svc__penalty=l2, lin_svc__tol=0.001; total time=   0.7s
[CV] END lin_svc__C=4, lin_svc__penalty=l2, lin_svc__tol=0.001; total time=   0.7s
[CV] END lin_svc__C=4, lin_svc__penalty=l2, lin_svc__tol=0.001; total time=   0.5s
[CV] END lin_svc__C=6, lin_svc__penalty=l2, lin_svc__tol=0.001; total time=   0.7s
[CV] END lin_svc__C=6, lin_svc__penalty=l2, lin_svc__tol=0.001; total time=   0.6s
[CV] END lin_svc__C=6, lin_svc__penalty=l2, lin_svc__tol=0.001; total time=   0.5s
[CV] END lin_svc__C=5, lin_svc__penalty=l2, lin_svc__tol=0.01; total time=   0.4s
[CV] END lin_svc__C=5, lin_svc__penalty=l2, lin_svc__tol=0.01; total time=   0.4s
[CV] END lin_svc__C=5, lin_svc__penalty=l2, lin_svc__tol=0.01; total time=   0.3s
[CV] END lin_svc__C=2, lin_svc__penalty=l2, lin_svc__tol=1e-05; total time=   1.5s
[CV] END lin_svc__C=2, lin_svc__penalty=l2, lin_svc__tol=1e-05; total time=   1.3s
[CV] END lin_svc__C=2, lin_svc__penalty=l2, lin_svc__tol=1e-05; total time=   1.1s

RandomizedSearchCV(cv=3,
                   estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                             ('lin_svc', LinearSVC())]),
                   param_distributions=[{'lin_svc__C': [1, 2, 3, 4, 5, 6, 7, 8,
                                                        9],
                                         'lin_svc__penalty': ['l2'],
                                         'lin_svc__tol': [0.01, 0.001, 0.0001,
                                                          1e-05]}],
                   verbose=2)

Let’s take a look at the best estimator:

random_search.best_estimator_

Output:

Pipeline(steps=[('scaler', StandardScaler()),
                ('lin_svc', LinearSVC(C=1, tol=1e-05))])

Using the best estimator, we will train the training data and see its performance:

random_search.best_estimator_.fit(X_train, y_train)

Let’s now check the accuracy for the best:

y_pred = random_search.best_estimator_.predict(X_train)
rnd_src_acc_src = accuracy_score(y_train, y_pred)
print(f'Rnd Src Accuracy Score: {round(rnd_src_acc_src*100, 2)}%')

Output:

Rnd Src Accuracy Score: 92.11%

As we can see that we were able to increase the accuracy score to 92.11% just by tweaking some parameters. We could apply more options or tweak other parameters as well.

Conclusion

Fine-tuning a model is an important step in your Machine Learning process. Before we fine-tune a model, we should narrow it down to a few models first using the performance metrics and then fine-tune them to gain a performance boost


Spread the love

Leave a Reply

Your email address will not be published. Required fields are marked *