Count stickers with Kaggle
I didn’t get around to entering Kaggle’s monthly competition because I have been busy studying other things, like statistics. I did, however, put my statistics studies away to enter the season 5 episode 1 playground competition before the cutoff date.
The object of this competition is to count the number of stickers sold in various stores in various countries over a period of time.
I wrote the script in Python using a Jupyter Notebook and saving it in my Kaggle account.
The first thing that I did after creating the Jupyter notebook in Kaggle was to import the libraries that I would need to execute it and make predictions on the number of stickers sold. The libraries that I imported are:-
- Pandas to create dataframes and process data,
- Numpy to create numpy arrays and perform numerical computations,
- Os to go into the operating system of the Kaggle website,
- Sklearn to provide machine learning functionality,
- Pylab and scipy to print off qq plots,
- Matplotlib to visualise the data, and
- Seaborn to statistically visualise the data.
I used pandas to read the csv files extracted from the Kaggle directory and converted them to dataframes, being train, test, and submission:-
I checked for null values and dropped all of the null values from the train dataframe:-
I used matplotlib to create a histogram to analyse the number of stickers sold in the train dataframe:-
I dropped the id column because pandas automatically indexes each row of a dataframe:-
I created a function, transform_date, that is used to convert the dates in the train and test dataframes to various variables to aide in the prediction based on the date:-
I implemented the transform_date function on the train and test dataframes:-
I then dropped the date column from the train and test dataframe because it is no longer needed:-
I used sklearn’s OrdinalEncoder to encode all of the columns of data that are of dtype object:-
I defined the dependent and independent variables, being y and X respectively:-
I used sklearn’s train_test_split function to split the X and y variables into training and validating datasets:-
I defined the model, using sklearn’s ExtraTreesRegressor:-
I made predictions on the validation set.
The performance metric for this competition is the mean absolute percentage error (mape), which is a regression loss metric used to measure the accuracy of a model’s predictions. It calculates the average percentage difference between the actual and predicted values:-
I used scipy to create a qq plot of the validation set’s predictions:-
I then used matplotlib to plot the validation set’s predictions across the regression line:-
I made predictions on the test set:-
I prepared the submission by positing the predictions in the num_sold column and converted the submission dataframe to a csv file:-
I saved my work and submitted it to Kaggle for scoring:-
I scored 0.36 mape, which is not an unrespectable score. Of course there are other things that can be done to increase the score, such as using a different algorithm, but I am happy with the result that I have achieved in the short amount of time that I had to work on the competition.
I have created a code review to accompany this blog post and it can be viewed here:- https://youtu.be/CF4sFslseZg
