October 14, 2024

Bagging and Pasting: Ensemble Learning using Scikit-Learn

Spread the love

One way to get a diverse set of classifiers for ensemble learning is to use very different training algorithms. Another approach is to use the same training algorithm for every predictor and train them on different random subsets of the training set. When sampling is performed with replacement, this method is called bagging (short for bootstrap aggregating). When sampling is performed without replacement, it is called pasting.

Both bagging and pasting allow training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor.

Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors. The aggregation function is typically the statistical mode for classification or the average for the regression.

Predictors can all be trained in parallel, via different CPU cores. Similarly, predictions can be made in parallel. This is one of the reasons bagging and pasting scale very well.


Getting the data

from sklearn.datasets import make_moons
X, y = make_moons(n_samples=1000, noise=0.4, random_state=42)

Bagging / Pasting in Scikit-Learn

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=100,
    bootstrap=True,
    n_jobs=-1
)

Let’s walk through the code:

  • Our base estimator is the DecisionTreeClassifier (base_estimator)
  • We train 500 DecisionTreeClassifer (n_estimators)
  • Each DecisionTreeClassifier is trained on 100 training instances (max_samples)
  • Sampling is performed with replacement (bootstrap=True)
    • If we want to perform Pasting instead, we can modify bootstrap to False
  • We use all the available cores (n_jobs=-1)

Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting. But the extra diversity also means that the predictors end up being less correlated, so the ensemble’s variance is reduced. Overall, bagging often results in better models.


Out-of-Bag Evaluation

With bagging, some instances may be sampled several times for any given predictor, while others may not be sampled at all. By default, BaggingClassifier samples m training instances with replacement (bootstrap=True), where m is the size of the training set. This means only about 63% of the training instances are sampled on average for each predictor. The remaining 37% of the training instances that are not sampled are called out-of-bag (oob) instances.

Since a predictor never sees these instances during training, it can be evaluated on these instances, without the need for a separate validation set. We can evaluate the ensemble itself by averaging out the out-of-bag evaluations for each predictor.

In Scikit-Learn, we can set oob_score=True when creating a BaggingClassifier to request an automatic oob evaluation. The resulting evaluation score is available through the oob_score_ variable.

bag_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=100,
    bootstrap=True,
    n_jobs=-1,
    oob_score=True
)
bag_clf.fit(X, y)
print(bag_clf.oob_score_)

Output:
>>> print(bag_clf.oob_score_)
0.854

from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X)
accuracy = accuracy_score(y, y_pred)
print(accuracy)

Output:
>>> print(accuracy)
0.874

Random Patches and Random Subspaces

The BaggingClassifer class supports sampling the features as well. Sampling features is controlled by two hyperparameters: max_features and bootstrap_features. They work the same way as max_samples and bootstrap, but for feature sampling instead. Thus each predictor will be trained on a random subset of the input features.

This is very useful when we are dealing with high-dimensional inputs. Sampling both training instances and features is called the Random Patches method. Keeping all the training instances but sampling features is called the Random Subspaces method.


Spread the love

One thought on “Bagging and Pasting: Ensemble Learning using Scikit-Learn

Leave a Reply

Your email address will not be published. Required fields are marked *