Train, test and Validation Sets in Machine Learning
What are Train, Test and Validation sets?
When building a machine learning model, once you prepare your data, you split your data into 3 parts- train, test and validation sets. For each step, we need a separate dataset. What are these sets and why we need them?
- Training set: This is the largest part in terms of the size of the dataset. This is the set is used to train the model. The model parameters learn their patterns from this training data.
- Validation set: Machine learning training process is an iterative process. We have to train multiple models by trying different combinations of hyperparameters. Then, we evaluate the performance of each model on the validation set. Therefore, the validation test is useful for hyperparameter tuning or selecting the best model out of different models.
- Test set: Once the hyperparameter tuning is done, we select the best model with an optimal combination of those hyperparameters. We measure the performance of that model using the test set.
How do you split your data into train, test and validation sets?
- Random selection
Random selection, as the name implies can be used to split the data at random, and assign to each set.
We can use the fast_ml model library train_valid_test_split to do that, as follows:
from fast_ml.model_development import train_valid_test_split
X_train, y_train, X_valid, y_valid, X_test, y_test = train_valid_test_split('---dataframe---', target = '--target_variable---',
2. Split using custom code
The other way is to use custom code that ensure that the dataset is divided in a fair manner, using similar combination of data in each group so that there is no bias.
What is the train-validation-test split ratio?
Typically, the training and test data set is split into an 80:20 ratio.
Thus, 20% of the data is set aside for testing purposes. The ratio changes based on the size of the data. In case, the data size is very large, one also goes for a 90:10 data split ratio where the validation data set represents 10% of the data.
In the scenario where we have validation sets as well, we mostly go for the ratio 70–20–10, with 70 as train, 20 as validation and 10 for test set.