Predict on the number of beats per minute in a song with Python
One thing that I think every person loves in one form or another is music. I have to say that listening to music is one of the ways that I relax when I want to rest for a little while. Therefore, when Kaggle announced their season 5 episode 9 playground competition was to predict on the number of beats per minute in a song, I eager jumped into solving the machine learning problem.
This playground competition involved a regression model, so I decided to use a linear regression model, which is an early statistical model that has made its way into the realms of the much newer science, machine learning.
Happily, I only had to run the completed algorithm up once to achieve an acceptable score, so I will review the process that I undertook to make the predictions.
I have written the program in Python in Kaggle’s Jupyter Notebook and saved it to my Kaggle account.
The problem statement for this competition, being the first step in working on a machine learning project, is to predict the number of beats per minute for a song.
The next step, after the problem misstatement has been declared, is to import all of the necessary libraries into the first cell of the Jupyter Notebook, being:-
- Pandas is a data processing library that creates dataframes used in the program.
- Numpy is a numerical library that performs numerical computations, creates numpy arrays, and creates distributions.
- Os goes into the operating system of Kaggle’s network.
- Scipy is Python’s scientific library that performs scientific and statistical tests.
- Sklearn provides machine learning functionality to the program.
- Statsmodels houses functions to carry out statistical tests on the data.
- Matplotlib is a visualisation library.
- Seaborn is a statistical visualisation library.
I used os to go into the system and retrieve the files that would be used in the program, and then used pandas to read them and convert them into dataframes, being:-
- Train
- Test
- Submission
I then used pandas to ensure that I could see every column of data in the dataframe:-
I conducted a Kilmogorov-Smirnov test and determine that none of the columns of data in the test dataframe were from the same distribution as the train dataframe:-
I dropped the column, id, from the train and test dataframes because pandas automatically indexes every row in its dataframes, so this column of data is redundant.
I defined the target, which is the ‘BeatsPerMinute’ column in the train dataframe.
I used matplotlib to create a histogram of the target:-
I used seaborn to create a heatmap of the train dataframe, and it can be seen that the features are not highly correlated:-
I defined the dependent and independent variables:-
I used sklearn’s train_test_split to split the dataset into training and validating sets:-
I defined the model as sklearn’s LinearRegression model:-
I made predictions on the validation set, defining them as y_pred:-
I calculated the error, being the root mean squared error. It goes without saying that the lower the error, the better.
Which is the coefficient of determination.
I also calculated the r2 score, which tells how well the model’s predictions approximate the actual data. Ideally, the score should be as close to 1 as possible, but in this instance because the score is close to 0, this reveals that the model predicts no better than the mean:-
I used matplotlib to visualise the predications, and the score of R2 was confirmed in this instance:-
I used statsmodels to create a qqplot of the prediction, and it can be seen that it forms a normal distribution:-
I predicted on the test set and used matplotlib to create a histogram, and it can be seen that the prediction is a normal distribution that hovers around the mean:-
I prepared the predictions to be submitted to Kaggle and when I submitted my work to Kaggle, I achieved a root mean squared error (RMSE) of 26, which is in line with the other scores that appeared in the leaderboard of this competition question:-
If I had the time, I would experiment with other models to see if I can improve the score.
I have created a video to accompany this blog post and it can be viewed here:- https://www.kaggle.com/code/tracyporter/play-5-9-linear-regression
