Member-only story
An experiment on using p-values to select features in a house price dataset
In my previous post, I discussed how statsmodels can be used to perform hypothesis testing on a dataset, and the post can be found here:- https://tracyrenee61.medium.com/use-statsmodels-linear-regression-to-hypothesise-test-a-house-price-dataset-ba89bf4dad24
In this post I have conducted an experiment based upon the contents of my previous post. A p-value of over 0.05 indicates the evidence is not strong enough to suggest an effect exists in the population. With this in mind, in theory it should be acceptable to remove those features in a dataset that have a high p-value.
I decided to modify the code in the previous post to compare the accuracy of the Boston House Prices dataset when the features that have a high p-value are removed.
I created the program using Google Colab, which is a free online Jupyter Notebook.
Once I created the program I imported the libraries I would need to execute the program. I then loaded the Boston House Prices dataset, which is a toy dataset that is loaded in the sklearn library. Please note that the Boston House Price dataset has been deprecated for ethical considerations:-