I have written several posts about sklearn’s dataset section where the library has various utilities that generate datasets for the data scientist to experiment on. Sometimes it can be a bit difficult to find datasets to practice on, so in instances such as this it can be a big time saver to employ sklearn’s dataset utility and make a dataset from scratch. My most recent post on creating datasets can be found here:- https://medium.com/mlearning-ai/taking-the-mystery-out-of-sklearns-confusion-matrix-and-classification-report-2cc73dfebaa6

I have spent quite a bit of time writing posts on the creation of datasets, so I thought it would be a good idea to write…


I have recently taken Udacity’s free Introduction to Machine Learning course in an attempt to update and upgrade my current skill set. Upon completing the course, have decided to write a post about what I studied and the projects I undertook to learn the course material and hopefully advance in my profession:-

Lesson 1

Lesson 1 of the course was the introductory lesson, which scoped out what I was expected to learn on the course. The course was intended to last about four months in duration, but I was in a hurry to finish it because I wanted to grasp…


When making predictions on data, it is important to evaluate the metrics involved in the prediction as a point to endeavour to correct as many errors as possible. One evaluation metric that is quite straightforward is sklearn’s confusion matrix. The diagram below shows how a binary confusion matrix operates. Ideally all of the positives would be true and all of the negatives would also be true, with no false positives and no false negatives:-


In my last post I discussed how sklearn’s train_test_split is a helper function, which splits the data up into training and validating sets as part of the cross validation process because the data that is predicted on cannot be the same data that had previously been trained on. After the data has been split, it must be put into a model for training and fitting, and this is where sklearn’s simplest cross validation tool can be used to iterate through the training a set number of times to come up with a mean score and it’s standard deviation. The cross_val_score…


As I have come towards the final lessons of Udacity’s Introduction to Machine Learning course, I have come to the lesson where cross validation is discussed. I have to say that cross validation is one area where I am weak, so I am looking forward to studying this technique so I can hone my skills.

Simply put, cross validation is a statistical method used to estimate the skill of machine learning models. …


The last several posts I have written about has concerned how to reduce the features of a dataset to hopefully remove redundant or nonessential information, reduce noise and improve accuracy of predictions. A recent post I have written regarding feature selection can be found here:- https://medium.com/mlearning-ai/how-to-select-features-using-selectkbest-in-python-c5a5239969f0

One way to reduce the features of a dataset that is not necessarily feature selection is principle component analysis, or PCA. PCA is a linear reduction technique using Singular Value Decomposition of the data to project it to a lower dimensional space. In linear algebra SVD is a factorisation of a real or complex…


I have recently been studying Udacity’s Introduction to Machine Learning course in an attempt to learn more about the skills involved in this technology. I have recently studied lesson 12 of the course, which covered the different techniques involved in feature selection. Feature selection is the technique where we choose features in our data that contribute the most to the target variable. The advantages of feature selection are: a reduction in overfitting, a possible improvement in accuracy, and faster training time.

The most recent post I wrote specifically on feature selection models in sklearn models, where I discuss the SelectPercentile…


I was quite interested to learn that sklearn’s Decision Tree algorithm has several parameters in its coding that prevent overfitting. Some of these parameters are min_leaf_sample and max_depth that work together to prevent overfitting when data is trained. Cost complexity pruning provides another option to control the size of a tree. This pruning technique is parameterised by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. As ccp_alpha increases, more of the tree is pruned, thus creating a decision tree that generalised better.

One way that ccp_alpha is used is in the process of…


One interesting thing that I have found is that the linear regression model, Lasso, can be used to select features when making predictions on a dataset. The reason for this is because Lasso puts a constraint on the sum of the absolute values of the model parameters: the sum has to be less than a fixed value (upper bound). In order to do this, the method applies a shrinking (regularisation) process where it penalises the coefficients of the regression variables, shrinking some of them to zero.

The regularisation process is controlled by the alpha parameter in the Lasso model. The…


On my previous post concerning the Universal Studios review dataset, I mentioned there were a few things that I could try to improve the accuracy of the prediction, such as trying out synthetic minority oversampling technique, or SMOTE. The fact is that I tried SMOTE, and it worked, but the accuracy of the validation set was very poor, so I abandoned it in this dataset.

The link to my most recent post I wrote concerning this dataset can be found here:- https://tracyrenee61.medium.com/an-exploration-into-natural-language-processing-with-the-universal-studios-dataset-527644d42f4e

There is another technique, however, that is said to improve accuracy too. This technique is feature selection. The…

Tracyrenee

I have over 46 years experience in the world of work, being in fast food, the military, business, non-profits, and the healthcare sector.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store