Member-only story

Interview Question: What is the difference between a train set, validation set, and test set in machine learning

Crystal X
2 min readNov 17, 2022

--

In supervised machine learning, labelled data is put in a dataset that must be trained and predicted on. Kaggle is a good website to work on machine learning competition questions because for the most part, the competition comes with a train dataset and a test dataset.

The train dataset must be initially cleaned and pre-processed, with the dependent and independent variables separated.The train set is then separated into training and validation sets by taking a small portion of the train dataset and using it for validation purposes.

The train set is the sample data that is trained and fitted into the model and then predictions are made on the validation set.

The validation set is the sample of data used to provide an unbiased evaluation of the model fit on the training dataset whilst tuning the hyperparameters. The hyperparameters on the model can be fine tuned to achieve the most accurate prediction on the validation set. Once the user is happy with the accuracy of the predictions made on the validation set, he can then make predictions on the test set.

The test dataset is used to provide an unbiased evaluation of a model fit on the training dataset.

--

--

Crystal X
Crystal X

Written by Crystal X

I have over five decades experience in the world of work, being in fast food, the military, business, non-profits, and the healthcare sector.

No responses yet