When I am working on a dataset one of the first things that I do is to check for missing values. It is important that missing values are dealt with because the model will not work if there are any missing values in the dataset.
Missing values are important because, depending on the type, they can sometimes bias the results of an analysis or prediction. Bias in data is an error that occurs when certain elements of a dataset are overweighted or overrepresented. Biased datasets don’t accurately represent a model’s use case, which leads to skewed outcomes, systematic prejudice,and low accuracy.
There are two ways to handle missing values, being:-
- Delete the missing values
- Impute the missing values
There are several ways to delete missing values in a dataset, being:-
- Delete the entire row
- Delete the entire column
The code in Python to delete an entire column of data is:-
If the entire row needs to be deleted then the “axis=0, inplace=True” would be inserted inside the brackets of the function.
I personally prefer to just drop missing values, especially if they are in the training set. The reason for this is because one never knows what the missing value should have been in the first place.
There are some instances when a person would need to impute missing data, especially if it is in a test set. I have written the code below to illustrate some different types of ways that null values can be imputed in a dataset:-
Python’s machine learning library, sklearn, has facilities to impute missing values. Some of sklearn’s imputation facilities are:-
- SimpleImputer: missing values can be imputed with a provided constant value or using mean, median, or mode of each column where the missing value is located.
- IterativeImputer: this imputer models each feature of missing values as a function of other…