Sitemap

Predict on the number of beats per minute in a song with Python

5 min readSep 3, 2025
Press enter or click to view image in full size

One thing that I think every person loves in one form or another is music. I have to say that listening to music is one of the ways that I relax when I want to rest for a little while. Therefore, when Kaggle announced their season 5 episode 9 playground competition was to predict on the number of beats per minute in a song, I eager jumped into solving the machine learning problem.

This playground competition involved a regression model, so I decided to use a linear regression model, which is an early statistical model that has made its way into the realms of the much newer science, machine learning.

Happily, I only had to run the completed algorithm up once to achieve an acceptable score, so I will review the process that I undertook to make the predictions.

I have written the program in Python in Kaggle’s Jupyter Notebook and saved it to my Kaggle account.

The problem statement for this competition, being the first step in working on a machine learning project, is to predict the number of beats per minute for a song.

The next step, after the problem misstatement has been declared, is to import all of the necessary libraries into the first cell of the Jupyter Notebook, being:-

  • Pandas is a data processing library that creates dataframes used in the program.
  • Numpy is a numerical library that performs numerical computations, creates numpy arrays, and creates distributions.
  • Os goes into the operating system of Kaggle’s network.
  • Scipy is Python’s scientific library that performs scientific and statistical tests.
  • Sklearn provides machine learning functionality to the program.
  • Statsmodels houses functions to carry out statistical tests on the data.
  • Matplotlib is a visualisation library.
  • Seaborn is a statistical visualisation library.

I used os to go into the system and retrieve the files that would be used in the program, and then used pandas to read them and convert them into dataframes, being:-

  • Train
  • Test
  • Submission

I then used pandas to ensure that I could see every column of data in the dataframe:-

Press enter or click to view image in full size

I conducted a Kilmogorov-Smirnov test and determine that none of the columns of data in the test dataframe were from the same distribution as the train dataframe:-

Press enter or click to view image in full size

I dropped the column, id, from the train and test dataframes because pandas automatically indexes every row in its dataframes, so this column of data is redundant.

I defined the target, which is the ‘BeatsPerMinute’ column in the train dataframe.

I used matplotlib to create a histogram of the target:-

Press enter or click to view image in full size

I used seaborn to create a heatmap of the train dataframe, and it can be seen that the features are not highly correlated:-

Press enter or click to view image in full size

I defined the dependent and independent variables:-

Press enter or click to view image in full size

I used sklearn’s train_test_split to split the dataset into training and validating sets:-

Press enter or click to view image in full size

I defined the model as sklearn’s LinearRegression model:-

Press enter or click to view image in full size

I made predictions on the validation set, defining them as y_pred:-

Press enter or click to view image in full size

I calculated the error, being the root mean squared error. It goes without saying that the lower the error, the better.

Which is the coefficient of determination.

I also calculated the r2 score, which tells how well the model’s predictions approximate the actual data. Ideally, the score should be as close to 1 as possible, but in this instance because the score is close to 0, this reveals that the model predicts no better than the mean:-

Press enter or click to view image in full size

I used matplotlib to visualise the predications, and the score of R2 was confirmed in this instance:-

Press enter or click to view image in full size

I used statsmodels to create a qqplot of the prediction, and it can be seen that it forms a normal distribution:-

Press enter or click to view image in full size

I predicted on the test set and used matplotlib to create a histogram, and it can be seen that the prediction is a normal distribution that hovers around the mean:-

Press enter or click to view image in full size

I prepared the predictions to be submitted to Kaggle and when I submitted my work to Kaggle, I achieved a root mean squared error (RMSE) of 26, which is in line with the other scores that appeared in the leaderboard of this competition question:-

Press enter or click to view image in full size

If I had the time, I would experiment with other models to see if I can improve the score.

I have created a video to accompany this blog post and it can be viewed here:- https://www.kaggle.com/code/tracyporter/play-5-9-linear-regression

--

--

Crystal X
Crystal X

Written by Crystal X

I have over five decades experience in the world of work, being in fast food, the military, business, non-profits, and the healthcare sector.

No responses yet