Member-only story
Use scipy’s Kolmogorov-Smirnov test to determine if two columns are from the same sample
Have you ever worked on a dataset where the training set has a high level or accuracy but the test set does not? I have recently come across such a problem in a Kaggle community competition. I had tried a number of algorithms and feature selection techniques, but I still could not get the test set to get an accuracy score anywhere near the level of accuracy I achieved when I made predictions on the validation set. The link for the competition that I entered that has a wide disparity between accuracy in the training and test sets can be found here:- https://www.kaggle.com/competitions/99-dapt-sao-ih-hotel-booking
I was not sure exactly what to do about this conundrum, so I researched the internet to see if I could find an answer (and I didn’t ask ChatGPT either, I might add). One thing that I discovered is the fact that when working on machine learning projects, sometimes the test set does not come from the same data that the training set did, and this will cause problems. I certainly found this to be the case because the number of unique values in the training set was different from the number of unique values in the test set. I tried to accommodate this by aligning some of the columns of the training and test set, but it did not have a great impact on the accuracy, only improving it a little…