How to prune a decision tree to prevent overfitting in Python
I was quite interested to learn that sklearn’s Decision Tree algorithm has several parameters in its coding that prevent overfitting. Some of these parameters are min_leaf_sample and max_depth that work together to prevent overfitting when data is trained. Cost complexity pruning provides another option to control the size of a tree. This pruning technique is parameterised by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. As ccp_alpha increases, more of the tree is pruned, thus creating a decision tree that generalised better.
One way that ccp_alpha is used is in the process of post pruning. Post pruning is a technique where we create a decision tree and then begin to remove all of the insignificant branches. A decision tree that is constructed to its full depth can be overfit, but if the decision tree is shallow then there is the possibility of underfitting. Finding the optimal depth of a decision tree is accomplished by pruning. One way of pruning a decision tree is by the technique of reduced error pruning, and this is where the parameter ccp_alpha is used.
This post will cover the technique of post pruning by utilising the parameter, ccp_alpha. I created the dataset that would be used in this post by employing sklearn’s make_guassian_quantiles to…