Sitemap

Use catboost regressor to predict on the probability of a road accident

4 min readOct 8, 2025

When I enter Kaggle’s monthly playground competitions I normally use models from Python’s machine learning library, sklearn. When I had a look at this month’s dataset, however, I noted that there were quite a few columns featuring columns. I could have used sklearn’s ordinal encoder to encode all of the columns of dtype object, but decided on the off chance to use catboost as the model because I had heard that it will train categorical data.

Catboost is a powerful, fast, and highly accurate machine learning algorithm developed by Yandex, designed specifically for handling categorical data with minimal processing. Catboost stands for categorical boosting. It is a gradient boosting algorithm that builds an ensemble of decision trees to make predictions.

The key features of catboost are:-

  1. It employs automatic categorical encoding
  2. It employs efficient gradient boosting
  3. It is robust to overfitting
  4. It is fast and scalable
  5. It is a great model to use for tabular data

I have created a Jupyter Notebook and saved the Python code in my Kaggle account.

The first thing I did was to import the libraries that I would need to execute the program, being:-

  • Numpy, which creates numpy arrays, distributions, and performs numerical computations,
  • pandas , which create dataframes and series, and processes data,
  • Os to go into the operating system,
  • Sklearn to provide machine learning functionality,
  • Catboost to train the data, and
  • Matplotlib and seaborn to visualise the data.
Press enter or click to view image in full size

I converted the columns of dtype boolean to integer in both the train and test dataframes:-

Press enter or click to view image in full size

I dropped the column, ‘id’, in both the train and test dataframes because pandas automatically indexes rows:-

Press enter or click to view image in full size

I used matplotlib to create a histogram of the column, ‘accident_risk’, which is the target. It can be seen in the diagram below that the distribution is skewed to the right:-

Press enter or click to view image in full size

I defined the variable, cat_cols, to create a list of all of the features of dtype object:-

Press enter or click to view image in full size

I used pandas to one hot encode all of the categorical columns of data:-

Press enter or click to view image in full size

I then defined the dependent (y) and independent (X) variables:-

Press enter or click to view image in full size

I used sklearn to split the dataset into training and validating sets:-

Press enter or click to view image in full size

I used catboost to define the model and train it:-

Press enter or click to view image in full size

I made predictions on the validation set and obtained a root mean squared error (rmse) of 0.056:-

Press enter or click to view image in full size

I used matplotlib to plot the actual values against the predictions:-

Press enter or click to view image in full size

I then predicted on the test set:-

Press enter or click to view image in full size

I then prepared the predictions for submission to Kaggle, converting the dataframe into a csv file:-

Press enter or click to view image in full size

When I submitted my work to Kaggle, I attained a rmse of 0.056, which is not too bad:-

Press enter or click to view image in full size

I decided that, since, my first attempt yield good results, I would not try to improve my score by employing different machine learning models.

I have created a code review of this model and it can be viewed here:- https://youtu.be/2fi6z6VT9vI

--

--

Crystal X
Crystal X

Written by Crystal X

I have over five decades experience in the world of work, being in fast food, the military, business, non-profits, and the healthcare sector.

No responses yet