SK Learn – Decision Trees and Penguins

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. We’re going to try to predict the species of penguins based on their body measurements.

Firstly we’ll import the .csv file and take a look at the data.

In [129]:

import pandas as pd

df = pd.read_csv(r'C:\Users\hanna\Downloads\penguins.csv')
df.head(5)

Out[129]:

	species	island	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	MALE
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	FEMALE
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	FEMALE
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	FEMALE

df.info() and df.decribe() are simple and awesome ways of checking the data available.

In [130]:

print(df.info())
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   culmen_length_mm   342 non-null    float64
 3   culmen_depth_mm    342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                334 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB
None
       culmen_length_mm  culmen_depth_mm  flipper_length_mm  body_mass_g
count        342.000000       342.000000         342.000000   342.000000
mean          43.921930        17.151170         200.915205  4201.754386
std            5.459584         1.974793          14.061714   801.954536
min           32.100000        13.100000         172.000000  2700.000000
25%           39.225000        15.600000         190.000000  3550.000000
50%           44.450000        17.300000         197.000000  4050.000000
75%           48.500000        18.700000         213.000000  4750.000000
max           59.600000        21.500000         231.000000  6300.000000

Thankfully I already know this dataset so I’m going to move straight to dropping null values as we won’t need those.

In [131]:

df.dropna(inplace=True)
df.isna().sum()

Out[131]:

species              0
island               0
culmen_length_mm     0
culmen_depth_mm      0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

Seaborn is a fantastic library for visualising datasets.

In [132]:

import seaborn as sns

sns.boxplot(x="culmen_length_mm",data=df,color ="orange");

As we know we want to look at species I want to know how many of each species we have in the dataset.

In [133]:

df.species.value_counts()

Out[133]:

Adelie       146
Gentoo       120
Chinstrap     68
Name: species, dtype: int64

.corr is a great way to instantly see the correlation between numerical columns. 1 is completely correlated. Anything negative is negatively correlated.

In [134]:

corr = df.corr()
corr

Out[134]:

	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g
culmen_length_mm	1.000000	-0.228640	0.652126	0.589066
culmen_depth_mm	-0.228640	1.000000	-0.578730	-0.472987
flipper_length_mm	0.652126	-0.578730	1.000000	0.873211
body_mass_g	0.589066	-0.472987	0.873211	1.000000

If you only have a few columns to view then the seaborn heatmap is a great visual tool for correlation:

In [135]:

sns.heatmap(corr,vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(220, 20, as_cmap=True),
    square=True, annot = True
);

Splitting out the features so we know what we want to use to predict. We want the feature names to predict the species and the class names in the tree to be the species names. In this case we’re going to use all of the numerical columns.

In [136]:

feature_names = ['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g']
X = df[feature_names]
y = df['species']
class_names = df['species'].unique()

Training the Model

Now it’s time to train the model: We import the DecisionTree Classifier. We have chosen 10 leaves, and a max depth of 2. (Later on we can tweak the min_samples_leaf and the max_depth if we want to change the tree size to better suit the dataset). We are going to fit the X and y as defined above.

In [137]:

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(
    min_samples_leaf=10,
    max_depth=2
)

clf.fit(X, y)

Out[137]:

DecisionTreeClassifier(max_depth=2, min_samples_leaf=10)

Now we can import Tree and we can see a little tree below!

In [138]:

from sklearn import tree

tree.plot_tree(clf);

The gini index (or gini impurity) calculates the amount of probability of a specific feature that is classified incorrectly when selected randomly.

The sample size is the sample size of penguins that make up each node.

The value is how many samples of that node fall into each category (Adelie, Chinstrap, Gentoo).

The true values go to the left (as you look at the tree) and the false values go to the right. You can see this easier below where the Adelie node that begins the tree grows down to the left. The false values form the first node go to the right and that’s where we can see the Gentoo node. We don’t see Chinstrap until the next layer down.

The tree is pretty good but this is where matplotlib comes in handy for visualising things… The deeper the colour, the better the match. These colours look pretty good to me.

In [139]:

import matplotlib.pyplot as plt

fig, axes = plt.subplots(
    nrows=1,
    ncols=1,
    figsize=(3, 4),
    dpi=300
)

tree.plot_tree(
    clf,
    feature_names=feature_names, 
    class_names=class_names,
    filled=True,
    fontsize=4
);

Now we can make our own predictions if we wish. We can input our own measurements into the model and see what it comes out with. The prediction is in the order of: ‘culmen_length_mm’, ‘culmen_depth_mm’, ‘flipper_length_mm’, ‘body_mass_g’

The model prediction is that these measurements are for a ‘Gentoo’ penguin:

In [140]:

clf.predict([
    [40, 16, 220, 3500]
])

Out[140]:

array(['Gentoo'], dtype=object)

Now it’s time to evaluate the model:

We’ll take a sample of 10 rows. Sample_X is the feature names as we defined earlier. Sample_y is the species name. Let’s have a look at them…

In [141]:

sample = df.sample(10)
sample_X = sample[feature_names]
sample_y = sample['species']
sample_X

Out[141]:

	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g
257	44.4	17.3	219.0	5250.0
260	42.7	13.7	208.0	3950.0
34	36.4	17.0	195.0	3325.0
102	37.7	16.0	183.0	3075.0
157	45.2	17.8	198.0	3950.0
221	50.0	16.3	230.0	5700.0
150	36.0	17.1	187.0	3700.0
172	42.4	17.3	181.0	3600.0
105	39.7	18.9	184.0	3550.0
95	40.8	18.9	208.0	4300.0

In [142]:

sample_y

Out[142]:

257       Gentoo
260       Gentoo
34        Adelie
102       Adelie
157    Chinstrap
221       Gentoo
150       Adelie
172    Chinstrap
105       Adelie
95        Adelie
Name: species, dtype: object

Now we’re going to predict the penguin species from the measurements we’ve added as variables (sample_X) and then compare them against our known species names for those values (sample_y) and see if it gets them right.

In [143]:

predictions = clf.predict(sample_X)
predictions

Out[143]:

array(['Chinstrap', 'Gentoo', 'Adelie', 'Adelie', 'Chinstrap', 'Gentoo',
       'Adelie', 'Adelie', 'Adelie', 'Chinstrap'], dtype=object)

It’s only a small sample but this shows that the model predictions using the feature_names matched the known species names of the penguins. It’s a good start.

In [144]:

predictions == sample_y

Out[144]:

257    False
260     True
34      True
102     True
157     True
221     True
150     True
172    False
105     True
95     False
Name: species, dtype: bool

We can see above that there were incorrect predictions.

train_test_split

Now it’s time to split the data to check that the decision tree works.

We’ll train the model on 80% and test it on 20% of the data.

In [145]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df[feature_names],
    df['species'],
    test_size=0.2
)

In [146]:

model = DecisionTreeClassifier(
    max_depth=2,
    min_samples_leaf=12,
    random_state=1
)

model.fit(X_train, y_train)

Out[146]:

DecisionTreeClassifier(max_depth=2, min_samples_leaf=12, random_state=1)

In [147]:

ground_truth = y_test

predictions = model.predict(X_test)

predictions == ground_truth

Out[147]:

310     True
336     True
248     True
57      True
305    False
       ...  
174     True
1       True
110     True
79      True
90      True
Name: species, Length: 67, dtype: bool

Time to score the model out of 1.

In [148]:

model.score(X_test, y_test)

Out[148]:

0.835820895522388

Now we can look at the classification report for each of the penguin species (again with a maximum value of 1). We can see from below that the model works best with Adelie penguins.

In [149]:

from sklearn.metrics import classification_report

print(classification_report(
    y_test,
    predictions,
    labels=class_names
))

              precision    recall  f1-score   support

      Adelie       0.85      0.92      0.88        24
   Chinstrap       0.65      0.93      0.76        14
      Gentoo       1.00      0.72      0.84        29

    accuracy                           0.84        67
   macro avg       0.83      0.86      0.83        67
weighted avg       0.87      0.84      0.84        67

Finally, we’ll take a look at the confusion matrix…Thankfully, it’s not as confusing as it may seem.

In [150]:

from sklearn.metrics import confusion_matrix

confusion_matrix(
    y_test,
    predictions,
    labels=class_names
)

Out[150]:

array([[22,  2,  0],
       [ 1, 13,  0],
       [ 3,  5, 21]], dtype=int64)

The above array is represented here in this diagram of a confusion matrix. The order of the array L-R and Top to bottom is Adelie, Chinstrap and Gentoo. The true positives we can see from the visual representation are 22 Adelie, 13 Chinstrap and 21 Gentoo. The model managed to predict all of the Gentoo penguins correctly!

If you want to know more about SKlearn Decision Trees and what else you can do with them then visit: https://scikit-learn.org/stable/modules/tree.html

To find this dataset visit: https://www.kaggle.com/parulpandey/penguin-dataset-the-new-iris