October 15, 2024

train_test_split(): making splitting your data easier for your ML pipeline

Spread the love

Splitting the data into a test and train set is one of the first things we would do in our Machine Learning process. While training our algorithm we keep aside a part of our data (test data) and only train our algorithm on the train data. The reason for doing so is to evaluate our final model on data that is completely new to it, which avoids any bias which may occur while training our model. In some cases, we divide our data into train, validation, and test set as well. The validation set is used to validate our model before testing it on the test set.

Preparing the data

Before we proceed with splitting the data, let’s quickly get our data. For ease of this article, I have downloaded the data in a CSV file, so we can simply load the data using the read_csv method provided by the pandas library.

# Importing Library
import numpy as np

# Defining Data
x = np.arange(1, 25).reshape(12, 2)
y = np.array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0])

Output:

x
array([[ 1,  2],
       [ 3,  4],
       [ 5,  6],
       [ 7,  8],
       [ 9, 10],
       [11, 12],
       [13, 14],
       [15, 16],
       [17, 18],
       [19, 20],
       [21, 22],
       [23, 24]])

y
array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0])

We can see that our data has 2 columns with 10 rows.

train_test_split()

The general syntax is:

train_test_split(*arrays, test_size, train_size, random_state, shuffle, stratify)

Here’s the link for the official documentation.

Let’s take a deeper dive into train_test_split()
ParameterDescription
*arraysa sequence of indexable with the same length / shape[0]
Allowed inputs are lists, numpy arrays, scipy-sparse matrices, or pandas dataframes
test_size
(optional)
float or int, default=None
– If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.
– If int, represents the absolute number of test samples.
– If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.
train_size
(optional)
float or int, default=None
– If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split.
– If int, represents the absolute number of train samples.
– If None, the value is automatically set to the complement of the test size.
random_state
(optional)
int, RandomState instance or None, default=None
Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.
shuffle
(optional)
bool, default=True
Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.
stratify
(optional)
array-like, default=None
If not None, data is split in a stratified fashion, using this as the class label.
ReturnsDescription
splittinglist, length=2*len(arrays)
A list containing train-test split of inputs.
It returns the splits in a particular order: X_train, X_test, y_train, y_test

Now let’s look at the train_test_split in action without any optional values

# Importing Library
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y)

Output:

x_train
array([[21, 22],
       [ 7,  8],
       [11, 12],
       [19, 20],
       [ 5,  6],
       [17, 18],
       [ 1,  2],
       [23, 24],
       [ 3,  4]])

x_test
array([[13, 14],
       [ 9, 10],
       [15, 16]])

y_train
array([1, 0, 0, 0, 1, 1, 0, 0, 1])

y_test
array([0, 1, 1])
  • Running the above code will result in a different set of training and tests set everytime, but sometimes we need to reproduce our result. So in such cases we can use random_state parameter.
    • We can set random_state to any non negative integer
  • Our original data had 12 rows and we got a training sample of 9 rows and a test sample of 3 rows. That’s because the default size of train_size is 25%. This ratio is generally fine for many applications, but this ratio can be tweaked using test_size or train_size parameters.
    • If we provide both test_size and train_size and sum of both exceeds 1 or 100% then it will cause an error.
    • If an integer value is provided then the resulting set will have that many rows.

Let’s now add a random_state parameter and set a test size of 4 rows

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=4, test_size=4)

Output:

x_train
array([[17, 18],
       [ 5,  6],
       [23, 24],
       [ 1,  2],
       [ 3,  4],
       [11, 12],
       [15, 16],
       [21, 22]])

x_test
array([[ 7,  8],
       [ 9, 10],
       [13, 14],
       [19, 20]])

y_train
array([1, 1, 0, 0, 1, 0, 1, 1])

y_test
array([0, 1, 0, 0])

With this change, we will get the same result every time we run the code. Also, we can see that the test set now has only 4 rows.

The samples of datasets are shuffled randomly (unless shuffle is set to False, then the dataset is not shuffled) and then split into the training and test sets according to the size you defined.

You can see that y has six zeros and six ones. However, the test set has three zeros out of four items. If you want to (approximately) keep the proportion of y values through the training and test sets, then pass stratify=y. This will enable stratified splitting:

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=4, test_size=4, stratify=y)

Output:

x_train
array([[21, 22],
       [ 1,  2],
       [15, 16],
       [13, 14],
       [17, 18],
       [19, 20],
       [23, 24],
       [ 3,  4]])

x_test
array([[11, 12],
       [ 7,  8],
       [ 5,  6],
       [ 9, 10]])

y_train
array([1, 0, 1, 0, 1, 0, 0, 1])

y_test
array([0, 0, 1, 1])

Now we can see that y_train has 50% 1s and y_test also has 50% 1s. Stratified splits are desirable in some cases, like when you’re classifying an imbalanced dataset, a dataset with a significant difference in the number of samples that belong to distinct classes. Like in our case, y_train has 62.5% 1s whereas y_test has only 25% 1s.

Please Note: train_test_split will throw an error if the least populated class in y has only 1 member. The minimum number of grounds for any class cannot be less than 2. In such cases, you will need to bin the values and use that for the stratified split.

There are other methods like StratifiedShuffleSplit, which can be used to split the data into stratified splits.

Conclusion

Hopefully, this article was able to give you the basic knowledge about splitting the data in your Machine Learning pipeline.


Spread the love

Leave a Reply

Your email address will not be published. Required fields are marked *