Interview question — Predict on the Iris dataset using sklearn’s DecisionTreeClassifier
Because there is a great deal of competition in the data science field, it is a good idea study as many practice interview questions as possible in order to discover what skills interviewers want to see in a data science interview.
One interview question that I have found is with regard to the famous and easy to use iris dataset:-
Build a decision tree model on the iris dataset where the variable ‘variety’ is dependent and all other variables are independent. Find the accuracy of the model.
The Iris flower data set,or Fisher’s Iris data set, is a multivariate data set used and made famous by the British statistician and biologist Ronald Fisher in his 1936 paper, ‘The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis’. The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other. Fisher’s paper was published in the Annals of Eugenics (today the Annals of Human Genetics) and includes discussion of the contained techniques’ applications to the field of phrenology.
I created the program in Python using Google Colab, which is a free online Jupyter Notebook hosted by Google. Google Colab is a fantastic free resource that can be used to code in Python, with its only drawback being that it does not have an undo function. Therefore, care needs to be taken not to inadvertently overwrite or delete valuable code.
When the Jupyter Notebook was created, I imported the libraries that I would need to execute the program, being:-
- Numpy to create numpy arrays and perform numeric computations,
- Pandas to create dataframes and process data,
- Sklearn to provide machine learning functionality,
- Matplotlib to visualise the data, and
- Seaborn to statistically analyse the data.