Examples of splitting a dataset into train and test with Scikit-learn

by Alex
Examples of splitting a dataset into train and test with Scikit-learn

If you’re splitting a dataset into data for training and testing, there are a few things to keep in mind. What follows is a discussion of three best practices to consider when partitioning like this. As well as a demonstration of how to implement these considerations in Python. This article discusses three specific features to keep in mind when partitioning a dataset, approaches to solving related problems, and practical implementation in Python. For our examples, we will use the train_test_split module of the Scikit-learn library, which is very useful for splitting datasets, whether or not you will use Scikit-learn to perform other machine learning tasks. Of course, you can perform such splits in some other way (perhaps using Numpy only). The Scikit-learn library includes useful features to make this a bit easier.


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=0.67,
random_state=42)

You may have used this module for splitting data in the past, but overlooked some details.

Random line shuffling

The first thing to pay attention to is: are your instances shuffled? This should be done as long as there is no reason not to shuffle the data (e.g., they represent time intervals). We should make sure that our instances are not sampled by class. This potentially introduces some unwanted bias into our model. For example, look at how one version of the iris dataset, arranges its instances when loaded:

from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
print(f "dataset classes: {iris.target}")
Dataset classes: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]

If such a dataset with three classes with an equal number of instances in each is divided into two samples: 2/3 for training and 1/3 for testing, the resulting subsets will have zero overlap of class labels. This is obviously unacceptable in trait learning for class prediction. Fortunately, the train_test_split function automatically shuffles the data by default (you can override this by setting the shuffle parameter to False).

  • Both the feature vector and the target vector (X and y) must be passed into the function.
  • You must set the random_state argument for reproducibility.
  • You must also define either train_size or test_size, but both are unnecessary. If you explicitly set both parameters, they should add up to 1.

You can make sure that our classes are now shuffled.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=0.67,
random_state=42)
print(f "Classes in y_train:\n{y_train}")
print(f "Classes in y_test:\n{y_test}")
Classes in y_train:
[1 2 1 0 2 1 0 0 0 1 2 0 0 0 1 0 1 2 0 1 2 0 2 2 1 1 2 1 0 1 2 0 0 1 1 0 2
0 0 1 1 2 1 2 2 1 0 0 2 2 0 0 0 1 2 0 2 2 0 1 1 2 1 2 0 2 1 2 1 1 1 0 1 1
0 1 2 2 0 1 2 2 0 2 0 1 2 2 1 2 1 1 2 2 0 1 2 0 1 2]
Classes in y_test:
[1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1
0 0 0 2 1 1 0 0 1 2 2 1 2]

Stratification (even distribution) of classes

This reflection is as follows. Are the number of classes in the datasets separated for training and testing evenly distributed?

import numpy as np
print(f "Number of lines in y_train by class: {np.bincount(y_train)}")
print(f "Number of lines in y_test by class: {np.bincount(y_test)}")
Number of lines in y_train by class: [31 35 34]
Number of rows in y_test by class: [19 15 16]

This is not an equal split. The main idea is whether our algorithm gets an equal opportunity to learn the features of each of the classes presented, and then test the learning results, on an equal number of instances of each class. While this is especially important for small datasets, it is desirable to pay constant attention to this question. We can specify the proportion of classes when splitting into training and testing datasets using the stratify parameter of the train_test_split function. It is worth noting that we will stratify according to the class distribution in y.

X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=0.67,
random_state=42,
stratify=y)
print(f "Number of lines in y_train by class: {np.bincount(y_train)}")
print(f "Number of lines in y_test by class: {np.bincount(y_test)}")
Number of lines in y_train by class: [34 33 33]
Number of rows in y_test by class: [16 17 17]

It looks better now, and the numbers presented tell us that this is the best possible division.

Additional separation

The third consideration relates to test data (validation sampling). Does it make sense to have only one test dataset for our task. Or should we prepare two such sets – one to validate our models during their fine-tuning, and another as a final dataset to compare models and choose the best one. train test validation If we define 2 such sets, it will mean that one sample, will be stored until all assumptions are checked, all hyperparameters are tuned, and all models are trained for maximum performance. It will then be shown to the models only once as the last step in our experiments. If you want to use datasets for testing and validation, creating them with train_test_split is easy. To do this, we split the entire dataset once to allocate the training sample. Then again to split the remaining data into datasets for testing and validation. Below, using the digits dataset, we split 70% for training and temporarily assign the remainder to testing. Remember to apply the methods described above.

from sklearn.datasets import load_digits
digits = load_digits()
X, y = digits.data, digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=0.7,
random_state=42,
stratify=y)
print(f "Number of lines in y_train by class: {np.bincount(y_train)}")
print(f "Number of lines in y_test by class: {np.bincount(y_test)}")
Number of lines in y_train by class: [124 127 124 128 127 127 127 125 122 126]
Number of lines in y_test by class: [54 55 53 55 54 55 54 54 52 54]

Note the stratified classes in the resulting sets. Then we re-divide the test dataset.

X_test, X_val, y_test, y_val = train_test_split(X_test, y_test,
train_size=0.5,
random_state=42,
stratify=y_test)
print(f "Number of rows in y_test by class: {np.bincount(y_test)}")
print(f "Number of rows in y_val by class: {np.bincount(y_val)}")
Number of lines in y_test by class: [27 27 27 27 27 28 27 27 26 27]
Number of rows in y_val by class: [27 28 26 28 27 27 27 27 26 27]

Note the stratification of classes across datasets, which is optimal. Now you are ready to train, validate, and test as many machine learning models as you see fit for your data. Another tip: you might consider using cross-validation instead of a simple training/testing or training/validation/testing strategy. We’ll look at cross-validation next time.

Related Posts

LEAVE A COMMENT