Sitemap

Count stickers with Kaggle

4 min readJan 20, 2025

I didn’t get around to entering Kaggle’s monthly competition because I have been busy studying other things, like statistics. I did, however, put my statistics studies away to enter the season 5 episode 1 playground competition before the cutoff date.

The object of this competition is to count the number of stickers sold in various stores in various countries over a period of time.

I wrote the script in Python using a Jupyter Notebook and saving it in my Kaggle account.

The first thing that I did after creating the Jupyter notebook in Kaggle was to import the libraries that I would need to execute it and make predictions on the number of stickers sold. The libraries that I imported are:-

  1. Pandas to create dataframes and process data,
  2. Numpy to create numpy arrays and perform numerical computations,
  3. Os to go into the operating system of the Kaggle website,
  4. Sklearn to provide machine learning functionality,
  5. Pylab and scipy to print off qq plots,
  6. Matplotlib to visualise the data, and
  7. Seaborn to statistically visualise the data.
Press enter or click to view image in full size

I used pandas to read the csv files extracted from the Kaggle directory and converted them to dataframes, being train, test, and submission:-

Press enter or click to view image in full size

I checked for null values and dropped all of the null values from the train dataframe:-

I used matplotlib to create a histogram to analyse the number of stickers sold in the train dataframe:-

Press enter or click to view image in full size

I dropped the id column because pandas automatically indexes each row of a dataframe:-

I created a function, transform_date, that is used to convert the dates in the train and test dataframes to various variables to aide in the prediction based on the date:-

Press enter or click to view image in full size

I implemented the transform_date function on the train and test dataframes:-

I then dropped the date column from the train and test dataframe because it is no longer needed:-

I used sklearn’s OrdinalEncoder to encode all of the columns of data that are of dtype object:-

Press enter or click to view image in full size

I defined the dependent and independent variables, being y and X respectively:-

I used sklearn’s train_test_split function to split the X and y variables into training and validating datasets:-

Press enter or click to view image in full size

I defined the model, using sklearn’s ExtraTreesRegressor:-

Press enter or click to view image in full size

I made predictions on the validation set.

The performance metric for this competition is the mean absolute percentage error (mape), which is a regression loss metric used to measure the accuracy of a model’s predictions. It calculates the average percentage difference between the actual and predicted values:-

Press enter or click to view image in full size

I used scipy to create a qq plot of the validation set’s predictions:-

Press enter or click to view image in full size

I then used matplotlib to plot the validation set’s predictions across the regression line:-

Press enter or click to view image in full size

I made predictions on the test set:-

Press enter or click to view image in full size

I prepared the submission by positing the predictions in the num_sold column and converted the submission dataframe to a csv file:-

I saved my work and submitted it to Kaggle for scoring:-

Press enter or click to view image in full size

I scored 0.36 mape, which is not an unrespectable score. Of course there are other things that can be done to increase the score, such as using a different algorithm, but I am happy with the result that I have achieved in the short amount of time that I had to work on the competition.

I have created a code review to accompany this blog post and it can be viewed here:- https://youtu.be/CF4sFslseZg

--

--

Crystal X
Crystal X

Written by Crystal X

I have over five decades experience in the world of work, being in fast food, the military, business, non-profits, and the healthcare sector.

No responses yet