As I have come towards the final lessons of Udacity’s Introduction to Machine Learning course, I have come to the lesson where cross validation is discussed. I have to say that cross validation is one area where I am weak, so I am looking forward to studying this technique so I can hone my skills.
Simply put, cross validation is a statistical method used to estimate the skill of machine learning models. It is commonly used in machine learning to compare and select a model for a given predictive modelling problem because it is easy to understand, easy to implement and results in skill estimates that generally have a lower bias than other methods available.
To avoid overfitting the data, the dataset is broken down into two sections, the training data and the validation data (also called test data). The training data is trained and fit into the estimator, while the validation data is used to make predictions on. The diagram below, taken from the sklearn website, is a typical example of how cross validation works in model training:-
In this post I intend to discuss perhaps the easiest cross validation technique in the sklearn…