Use catboost regressor to predict on the probability of a road accident
When I enter Kaggle’s monthly playground competitions I normally use models from Python’s machine learning library, sklearn. When I had a look at this month’s dataset, however, I noted that there were quite a few columns featuring columns. I could have used sklearn’s ordinal encoder to encode all of the columns of dtype object, but decided on the off chance to use catboost as the model because I had heard that it will train categorical data.
Catboost is a powerful, fast, and highly accurate machine learning algorithm developed by Yandex, designed specifically for handling categorical data with minimal processing. Catboost stands for categorical boosting. It is a gradient boosting algorithm that builds an ensemble of decision trees to make predictions.
The key features of catboost are:-
- It employs automatic categorical encoding
- It employs efficient gradient boosting
- It is robust to overfitting
- It is fast and scalable
- It is a great model to use for tabular data
I have created a Jupyter Notebook and saved the Python code in my Kaggle account.
The first thing I did was to import the libraries that I would need to execute the program, being:-
- Numpy, which creates numpy arrays, distributions, and performs numerical computations,
- pandas , which create dataframes and series, and processes data,
- Os to go into the operating system,
- Sklearn to provide machine learning functionality,
- Catboost to train the data, and
- Matplotlib and seaborn to visualise the data.
I converted the columns of dtype boolean to integer in both the train and test dataframes:-
I dropped the column, ‘id’, in both the train and test dataframes because pandas automatically indexes rows:-
I used matplotlib to create a histogram of the column, ‘accident_risk’, which is the target. It can be seen in the diagram below that the distribution is skewed to the right:-
I defined the variable, cat_cols, to create a list of all of the features of dtype object:-
I used pandas to one hot encode all of the categorical columns of data:-
I then defined the dependent (y) and independent (X) variables:-
I used sklearn to split the dataset into training and validating sets:-
I used catboost to define the model and train it:-
I made predictions on the validation set and obtained a root mean squared error (rmse) of 0.056:-
I used matplotlib to plot the actual values against the predictions:-
I then predicted on the test set:-
I then prepared the predictions for submission to Kaggle, converting the dataframe into a csv file:-
When I submitted my work to Kaggle, I attained a rmse of 0.056, which is not too bad:-
I decided that, since, my first attempt yield good results, I would not try to improve my score by employing different machine learning models.
I have created a code review of this model and it can be viewed here:- https://youtu.be/2fi6z6VT9vI
