The content for this site available on GitHub. If you want to launch the notebooks interactively click on the binder stamp below. Binder

< Experiment Decision Tree Classifier on Iris dataset with prepruning | Contents | Genetic Algorithm on Knapsack Problem >

Experiment: Decision Tree Classifier on iris dataset without pruning

🌳 Quick Summary: Unpruned Decision Tree (Iris Dataset)

Goal: Build a tree that perfectly fits the training data by growing until all leaves are pure or until no further splits are possible.


🧠 Core Idea:

  • Grow the tree fully, without any restrictions on depth, leaf size, or impurity decrease.
  • Captures all patterns—including noise—leading to possible overfitting.

Note: unpruned trees are not usually used in practice but are useful for understanding the behavior of decision trees. In general...Pre-pruning avoids overfitting by stopping early—fast but risks missing patterns; post-pruning fixes overfitting by trimming later—slower but usually smarter.


🧮 Configuration (No Pruning Applied):

  • max_depth=None
  • min_samples_split=2
  • min_samples_leaf=1
  • ccp_alpha=0.0 (no cost complexity pruning)

🔧 Implications:

Aspect Expectation Notes
Accuracy Very High (train), Medium-High (test) Will overfit due to excessive depth (high variance)
Overfitting Likely Tree memorizes training data.
Interpretability Poor Very deep trees are hard to follow.
Speed Fast on Iris But deeper trees may slow down large datasets.

🔑 Characteristics:

  • Overfits small/noisy datasets.
  • Good for exploratory analysis.
  • Not ideal for generalization without pruning.
In [1]:
%reload_ext autoreload
%autoreload 2
In [2]:
# decision tree specific imports
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree

# experiment helper imports
from helpers.base_imports import *

Setup experiment with data and model

In [3]:
DATASET_NAME = "iris-20test-shuffled-v1"
exp = Experiment(
    type="c",  # classification
    name="dtc-iris-unpruned",
    dataset=DATASET_NAME,
)
exp
Loading 'classification-experiments.csv'
Loading 'dtc-iris-unpruned' experiment
Loading 'dtc-iris-unpruned' estimator/model/pipeline
Out[3]:
Experiment(c, dtc-iris-unpruned, iris-20test-shuffled-v1)
In [4]:
# add the steps to the pipeline
steps = [
    # NOTE: DTs don't need scaling, but we include it here for consistency when comparing to other classifiers
    ("scaler", StandardScaler()),
    (
        "classifier",
        DecisionTreeClassifier(
            criterion="entropy",  # gini tends to be faster but similar performance
            splitter="best",  # best split or random
            max_depth=None,  # no max depth (so will likely overfit)
            min_samples_split=2,  # require at least 2 samples to split a node
            min_samples_leaf=1,  # require at least 1 sample in each leaf
            min_weight_fraction_leaf=0.0,
            max_features=None,  # consider all features when looking for the best split
            random_state=RANDOM_SEED,
            max_leaf_nodes=None,  # allow unlimited leaf nodes???
            min_impurity_decrease=0.0,  # node will split if decrease in impurity is at least this much
            class_weight=None,  # all classes are weighted/treated equally
            ccp_alpha=0.0,  # complexity parameter for minimal cost-complexity pruning, 0 means no pruning
        ),
    ),
]
exp.estimator = Pipeline(
    steps=steps,
    memory=CACHE_DIR,
)
In [5]:
exp.estimator.get_params()
Out[5]:
{'memory': '.cache',
 'steps': [('scaler', StandardScaler()),
  ('classifier', DecisionTreeClassifier(criterion='entropy', random_state=0))],
 'verbose': False,
 'scaler': StandardScaler(),
 'classifier': DecisionTreeClassifier(criterion='entropy', random_state=0),
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True,
 'classifier__ccp_alpha': 0.0,
 'classifier__class_weight': None,
 'classifier__criterion': 'entropy',
 'classifier__max_depth': None,
 'classifier__max_features': None,
 'classifier__max_leaf_nodes': None,
 'classifier__min_impurity_decrease': 0.0,
 'classifier__min_samples_leaf': 1,
 'classifier__min_samples_split': 2,
 'classifier__min_weight_fraction_leaf': 0.0,
 'classifier__monotonic_cst': None,
 'classifier__random_state': 0,
 'classifier__splitter': 'best'}

Get dataset by name (eda already done in another notebook and train/test split saved so we will be working with the same data)

In [6]:
notes, X_train, X_test, y_train, y_test, target_names = get_dataset(exp.dataset)
print(notes)
print(target_names)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Dataset: iris-20test-shuffled-v1
X_train shape: (120, 4)
X_test shape: (30, 4)
y_train shape: (120,)
y_test shape: (30,)
Train: 80.00% of total
Test: 20.00% of total
Notes: None
Created by save_dataset() helper at 2024-07-09 12:28:10

  target_names
0       setosa
1   versicolor
2    virginica
Out[6]:
((120, 4), (30, 4), (120, 1), (30, 1))
In [7]:
# inspect data
disp_df(pd.concat([X_train, y_train], axis=1))
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 6.1 3.0 4.6 1.4 1
1 7.7 3.0 6.1 2.3 2
2 5.6 2.5 3.9 1.1 1
3 6.4 2.8 5.6 2.1 2
4 5.8 2.8 5.1 2.4 2
5 5.3 3.7 1.5 0.2 0
6 5.5 2.3 4.0 1.3 1
7 5.2 3.4 1.4 0.2 0
8 6.5 2.8 4.6 1.5 1
9 6.7 2.5 5.8 1.8 2
10 6.8 3.0 5.5 2.1 2
11 5.1 3.5 1.4 0.3 0
12 6.0 2.2 5.0 1.5 2
13 6.3 2.9 5.6 1.8 2
14 6.6 2.9 4.6 1.3 1
15 7.7 2.6 6.9 2.3 2
16 5.7 3.8 1.7 0.3 0
17 5.0 3.6 1.4 0.2 0
18 4.8 3.0 1.4 0.3 0
19 5.2 2.7 3.9 1.4 1
20 5.1 3.4 1.5 0.2 0
21 5.5 3.5 1.3 0.2 0
22 7.7 3.8 6.7 2.2 2
23 6.9 3.1 5.4 2.1 2
24 7.3 2.9 6.3 1.8 2
25 6.4 2.8 5.6 2.2 2
26 6.2 2.8 4.8 1.8 2
27 6.0 3.4 4.5 1.6 1
28 7.7 2.8 6.7 2.0 2
29 5.7 3.0 4.2 1.2 1
30 4.8 3.4 1.6 0.2 0
31 5.7 2.5 5.0 2.0 2
32 6.3 2.7 4.9 1.8 2
33 4.8 3.0 1.4 0.1 0
34 4.7 3.2 1.3 0.2 0
35 6.5 3.0 5.8 2.2 2
36 4.6 3.4 1.4 0.3 0
37 6.1 3.0 4.9 1.8 2
38 6.5 3.2 5.1 2.0 2
39 6.7 3.1 4.4 1.4 1
40 5.7 2.8 4.5 1.3 1
41 6.7 3.3 5.7 2.5 2
42 6.0 3.0 4.8 1.8 2
43 5.1 3.8 1.6 0.2 0
44 6.0 2.2 4.0 1.0 1
45 6.4 2.9 4.3 1.3 1
46 6.5 3.0 5.5 1.8 2
47 5.0 2.3 3.3 1.0 1
48 6.3 3.3 6.0 2.5 2
49 5.5 2.5 4.0 1.3 1
50 5.4 3.7 1.5 0.2 0
51 4.9 3.1 1.5 0.2 0
52 5.2 4.1 1.5 0.1 0
53 6.7 3.3 5.7 2.1 2
54 4.4 3.0 1.3 0.2 0
55 6.0 2.7 5.1 1.6 1
56 6.4 2.7 5.3 1.9 2
57 5.9 3.0 5.1 1.8 2
58 5.2 3.5 1.5 0.2 0
59 5.1 3.3 1.7 0.5 0
60 5.8 2.7 4.1 1.0 1
61 4.9 3.1 1.5 0.1 0
62 7.4 2.8 6.1 1.9 2
63 6.2 2.9 4.3 1.3 1
64 7.6 3.0 6.6 2.1 2
65 6.7 3.0 5.2 2.3 2
66 6.3 2.3 4.4 1.3 1
67 6.2 3.4 5.4 2.3 2
68 7.2 3.6 6.1 2.5 2
69 5.6 2.9 3.6 1.3 1
70 5.7 4.4 1.5 0.4 0
71 5.8 2.7 3.9 1.2 1
72 4.5 2.3 1.3 0.3 0
73 5.5 2.4 3.8 1.1 1
74 6.9 3.1 4.9 1.5 1
75 5.0 3.4 1.6 0.4 0
76 6.8 2.8 4.8 1.4 1
77 5.0 3.5 1.6 0.6 0
78 4.8 3.4 1.9 0.2 0
79 6.3 3.4 5.6 2.4 2
80 5.6 2.8 4.9 2.0 2
81 6.8 3.2 5.9 2.3 2
82 5.0 3.3 1.4 0.2 0
83 5.1 3.7 1.5 0.4 0
84 5.9 3.2 4.8 1.8 1
85 4.6 3.1 1.5 0.2 0
86 5.8 2.7 5.1 1.9 2
87 4.8 3.1 1.6 0.2 0
88 6.5 3.0 5.2 2.0 2
89 4.9 2.5 4.5 1.7 2
90 4.6 3.2 1.4 0.2 0
91 6.4 3.2 5.3 2.3 2
92 4.3 3.0 1.1 0.1 0
93 5.6 3.0 4.1 1.3 1
94 4.4 2.9 1.4 0.2 0
95 5.5 2.4 3.7 1.0 1
96 5.0 2.0 3.5 1.0 1
97 5.1 3.5 1.4 0.2 0
98 4.9 3.0 1.4 0.2 0
99 4.9 2.4 3.3 1.0 1
100 4.6 3.6 1.0 0.2 0
101 5.9 3.0 4.2 1.5 1
102 6.1 2.9 4.7 1.4 1
103 5.0 3.4 1.5 0.2 0
104 6.7 3.1 4.7 1.5 1
105 5.7 2.9 4.2 1.3 1
106 6.2 2.2 4.5 1.5 1
107 7.0 3.2 4.7 1.4 1
108 5.8 2.7 5.1 1.9 2
109 5.4 3.4 1.7 0.2 0
110 5.0 3.0 1.6 0.2 0
111 6.1 2.6 5.6 1.4 2
112 6.1 2.8 4.0 1.3 1
113 7.2 3.0 5.8 1.6 2
114 5.7 2.6 3.5 1.0 1
115 6.3 2.8 5.1 1.5 2
116 6.4 3.1 5.5 1.8 2
117 6.3 2.5 4.9 1.5 1
118 6.7 3.1 5.6 2.4 2
119 4.9 3.6 1.4 0.1 0

Double check the data statistics and features from the eda

In [8]:
plt.imshow(plt.imread(f"figs/{DATASET_NAME}_feature-statistics-X_train.png"))
plt.axis("off")
plt.show()

We have some outliers for sepal width.

Decision Trees are generally considered robust to outliers so we will leave them.

In [9]:
plt.imshow(plt.imread(f"figs/{DATASET_NAME}_target-class-distribution-y_train.png"))
plt.axis("off")
plt.show()

Classes in the train data are roughly balanced so we'll continue without any class balancing.

In [10]:
exp.update_param("n_train_samples", X_train.shape[0])
exp.update_param("n_test_samples", X_test.shape[0])
exp.summary_df
Out[10]:
dataset_name n_train_samples n_test_samples mean_accuracy split_criterion train_time query_time kfolds confusion_matrix classification_report tree_depth n_leaves n_tree_nodes
exp_name
dtc-iris-unpruned iris-20test-shuffled-v1 120 30 1.0 entropy 0 days 00:00:00.005142 0 days 00:00:00.000668 Stratified 3-Fold Cross-Validation [[11 0 0]\n [ 0 13 0]\n [ 0 0 6]] {'0': {'precision': 1.0, 'recall': 1.0, 'f1-sc... 5 9 NaN

So, we will use cross validation to evaluate the model so that we get a more generalizable result.

But how many folds?

Find best cross-validation n_folds and type

NOTE: Learning curve already lets you see this!!! Don't need this extra step...just doing it for my own learning

How many folds should we use in our cross-validation? And what type of cross-validation should we use?

  • Stratified K-Fold CV since we'd like to keep the class distribution in each fold similar to the original dataset

  • Let's do a hypterparam search for number of folds

Recall, choice of k depends on computational resources, bias, variance, and the size of the dataset

  • higher k
    • each model is trained on larger portion of the dataset (eg. k=10 => 90%) => lower bias, higher variance
    • more folds => more computational cost (more models to train and eval)
    • more "stable" estimates because the model is validated on more unique splits
  • lower k
    • each model is trained on smaller portion of the dataset (eg. k=5 => 80%) => higher bias, lower variance
    • less folds => less computational cost (less models to train and eval)

SO....

  • with smaller datasets like iris, we can afford to use higher k

Note: if k = number of samples, then we have leave-one-out CV => low bias, high variance, high computational cost

NOTE TO SELF: I CAN DO THIS WITH A LINE PLOT AND FILL BETWEEN FOR THE STD DEV

In [11]:
# Convert y_train to a 1-dimensional array
y_train_array = y_train.values.ravel()

# cross-validation hyperparameter search (cv-accuracy on Y axis and k on X axis)
min_class_samples = min(
    np.bincount(y_train_array)
)  # number of samples in the smallest class since StratifiedKFold requires each fold to have the same proportion of classes as the entire dataset

k_range = range(2, min_class_samples + 1)
k_scores = []
print(f"Range of k: {k_range}")

for k in k_range:  # for each value of k, run k-fold cross-validation
    scores = cross_val_score(
        exp.estimator,
        X_train,
        y_train_array,
        cv=StratifiedKFold(n_splits=k),
        scoring="accuracy",
    )
    k_scores.append(scores)

# Plotting the box plots
plt.figure(figsize=(12, 8))
plt.boxplot(k_scores, positions=k_range, widths=0.6)
plt.title("K-Fold Cross-Validation Hyperparameter Search")
plt.xlabel("Value of K")
plt.ylabel("Cross-Validated Accuracy")

# makes it easier to read the plot
plt.grid(axis="x", linestyle="--", color="gray", alpha=0.7)

plt.savefig(f"{FIGS_DIR}/{exp.name}-kfold-cv.png")
plt.show()
Range of k: range(2, 38)

So,

  • with 2 folds: in the first iteration 60 samples are trained on and 60 for testing. In the second iteration they get swapped and averaged for the final accuracy.

Q: Why is CV accuracy lower with 12 folds? Maybe just how the samples divide up since not perfectly even?

Q: So with 10 folds there are 12 samples (120/10=12) in each fold so 108 train (9x12) and 12 test each of the 10 iterations? Yes.

In [12]:
# disp_df of k and k_scores for the first 15 values of k
disp_df(pd.DataFrame(k_scores, index=k_range).T.head(15))
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
0 0.900000 0.975 0.966667 0.958333 0.95 0.944444 0.933333 1.000000 1.000000 1.000000 1.0 1.000000 1.000000 1.000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.0 1.0 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.000000 1.000000
1 0.916667 0.950 0.833333 0.958333 1.00 1.000000 1.000000 0.928571 0.916667 0.909091 0.9 0.900000 0.888889 0.875 1.000000 1.000000 1.000000 0.857143 1.000000 1.000000 1.000000 1.000000 1.0 1.0 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.000000 1.000000
2 NaN 0.875 0.933333 0.875000 0.95 0.941176 0.933333 1.000000 1.000000 1.000000 1.0 1.000000 1.000000 1.000 0.875000 0.857143 0.857143 1.000000 0.833333 0.833333 0.833333 0.833333 0.8 0.8 0.8 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.000000 0.750000
3 NaN NaN 0.866667 0.958333 0.95 0.941176 0.866667 0.923077 0.916667 0.909091 1.0 1.000000 1.000000 1.000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.0 1.0 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.75 0.750000 1.000000
4 NaN NaN NaN 0.833333 0.95 0.941176 0.933333 1.000000 0.833333 1.000000 0.9 0.888889 0.888889 0.875 0.875000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.0 1.0 1.0 1.00 1.00 1.00 1.00 1.00 1.00 0.75 0.75 1.00 1.000000 1.000000
5 NaN NaN NaN NaN 0.85 0.941176 1.000000 0.923077 0.916667 1.000000 0.9 1.000000 1.000000 1.000 1.000000 0.857143 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.0 1.0 1.0 0.80 0.80 0.75 0.75 0.75 0.75 1.00 1.00 1.00 1.000000 1.000000
6 NaN NaN NaN NaN NaN 0.823529 0.933333 1.000000 1.000000 0.909091 1.0 1.000000 1.000000 1.000 1.000000 1.000000 0.857143 1.000000 1.000000 1.000000 1.000000 1.000000 1.0 1.0 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.000000 1.000000
7 NaN NaN NaN NaN NaN NaN 0.800000 0.923077 1.000000 1.000000 0.9 0.888889 0.888889 1.000 1.000000 0.857143 1.000000 0.833333 0.833333 0.833333 0.833333 0.800000 0.8 0.8 0.8 0.80 0.80 0.75 0.75 1.00 1.00 1.00 1.00 1.00 1.000000 1.000000
8 NaN NaN NaN NaN NaN NaN NaN 0.769231 0.916667 1.000000 1.0 1.000000 1.000000 0.875 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.0 1.0 1.0 1.00 1.00 1.00 1.00 0.75 1.00 1.00 1.00 1.00 1.000000 1.000000
9 NaN NaN NaN NaN NaN NaN NaN NaN 0.750000 0.909091 0.9 1.000000 1.000000 1.000 0.857143 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.0 1.0 1.0 1.00 1.00 1.00 1.00 1.00 0.75 1.00 1.00 1.00 1.000000 1.000000
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.900000 0.9 1.000000 1.000000 1.000 1.000000 0.857143 1.000000 1.000000 1.000000 0.833333 0.800000 1.000000 1.0 1.0 1.0 1.00 1.00 1.00 1.00 1.00 1.00 0.75 1.00 1.00 1.000000 1.000000
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 0.888889 0.875000 1.000 1.000000 1.000000 0.857143 0.833333 0.833333 0.833333 0.800000 0.800000 0.8 0.8 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.75 0.75 1.000000 1.000000
12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.888889 0.875000 0.875 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.0 1.0 0.8 1.00 1.00 1.00 1.00 1.00 0.75 0.75 1.00 0.75 1.000000 1.000000
13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.000000 0.875 0.857143 0.857143 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.0 1.0 1.0 0.75 1.00 1.00 1.00 0.75 1.00 1.00 1.00 1.00 0.666667 1.000000
14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.000 0.857143 1.000000 0.833333 1.000000 1.000000 1.000000 1.000000 1.000000 1.0 1.0 0.8 1.00 0.75 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.000000 0.666667

Let's interpret this

Eg. if k=5 we are training 5 models on 80% of the data and validation on 20% and getting the score/accuracy

  • training on 96 samples (120/5=24 samples per fold * 4)
  • validating on 24 samples

  • high values of k (20+) have high accuracy because they are being trained on so much of the training data but variance steadily increases

  • lower values of k (5-) have lower accuracy because they are being trained on less of the training data but variance is lower also because

In our case let's go with 3 since

  • smaller variance and still relatively high accuracy

k=3 and 6 both give the same accuracy so

Just for fun lets visualize the selected cv folds

In [13]:
k = 3
cv = StratifiedKFold(
    n_splits=k,
)
In [14]:
# print number of samples in each fold and samples per class
for i, (train, test) in enumerate(cv.split(X_train, y_train)):
    print(
        f"Fold {i} contains {len(train)} training samples and {len(test)} testing samples"
    )

    # bincount expects a 1-dimensional array
    y_train_array_train = y_train.iloc[train].to_numpy().flatten()
    y_train_array_test = y_train.iloc[test].to_numpy().flatten()

    print(
        f"Train: {np.bincount(y_train_array_train)}"
    )  # number of samples class 0, 1, 2
    print(f"Test: {np.bincount(y_train_array_test)}")  # number of samples class 0, 1, 2
Fold 0 contains 80 training samples and 40 testing samples
Train: [26 24 30]
Test: [13 13 14]
Fold 1 contains 80 training samples and 40 testing samples
Train: [26 25 29]
Test: [13 12 15]
Fold 2 contains 80 training samples and 40 testing samples
Train: [26 25 29]
Test: [13 12 15]
In [15]:
fig, ax = plt.subplots()
plot_cv_indices(
    cv=cv,
    X=X_train,
    y=y_train_array,
    ax=ax,
    n_splits=k,
)
fig.savefig(f"{FIGS_DIR}/{exp.name}-cv-indices.png")

Inspect learning curve

Recall the LC shows us how the models performance changes with number of samples.

  • required data size: how much data is needed for get good performance before improvement plateau
In [16]:
cv = StratifiedKFold(
    n_splits=3,
)
# Note: LearningCurveDisplay contains the scores as parameters (train_scores, test_scores)...could save these if needed
lcd = LearningCurveDisplay.from_estimator(
    estimator=exp.estimator,
    X=X_train,
    y=y_train,
    # train_sizes=np.linspace(0.1, 1.0, 5),
    # train_sizes=np.linspace(1, 80, 5).astype(int),
    train_sizes=np.linspace(0.1, 1.0, 5),
    # splitters are instantiated with shuffle=False so the splits will be the same across all calls
    cv=cv,
    random_state=0,
    # return_times = True, # default false
)

# Update the legend to change "Test" to "Cross-Validation"
handles, labels = plt.gca().get_legend_handles_labels()
labels = ["Training", "Cross-Validation"] if "Test" in labels else labels
plt.legend(handles, labels)

plt.title(f"Learning Curve\nStratified {cv.get_n_splits()}-Fold Cross-Validation Test")
plt.ylabel("Accuracy")
plt.savefig(f"{FIGS_DIR}/{exp.name}_learning-curve.png")

Interpretation:

  • Train curve is perfect because DT can memorize the training data easily even when number of samples is small
  • Around 20 samples is where the CV curve peaks

Note: In simpler models, we often see a gradual improvement in training accuracy as the number of samples increases because they cannot memorize the training data perfectly. However, decision trees do not follow this pattern due to their high capacity to fit the training data exactly, even with a small number of samples.

Conclusion:

  • Keep training size the same since small model
In [17]:
exp.update_param("kfolds", f"Stratified {cv.get_n_splits()}-Fold Cross-Validation")

For this example we aren't going to search using gridsearch or anything....

We are just going to use criterion: entropy and max_depth: None (no pruning) since that is the experiment. We are expecting it to overfit.

In [18]:
exp.estimator.get_params()
Out[18]:
{'memory': '.cache',
 'steps': [('scaler', StandardScaler()),
  ('classifier', DecisionTreeClassifier(criterion='entropy', random_state=0))],
 'verbose': False,
 'scaler': StandardScaler(),
 'classifier': DecisionTreeClassifier(criterion='entropy', random_state=0),
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True,
 'classifier__ccp_alpha': 0.0,
 'classifier__class_weight': None,
 'classifier__criterion': 'entropy',
 'classifier__max_depth': None,
 'classifier__max_features': None,
 'classifier__max_leaf_nodes': None,
 'classifier__min_impurity_decrease': 0.0,
 'classifier__min_samples_leaf': 1,
 'classifier__min_samples_split': 2,
 'classifier__min_weight_fraction_leaf': 0.0,
 'classifier__monotonic_cst': None,
 'classifier__random_state': 0,
 'classifier__splitter': 'best'}
In [19]:
params = {
    "classifier__criterion": "entropy",
    "classifier__max_depth": None,
    "classifier__max_features": None,
    "classifier__max_leaf_nodes": None,
}
exp.estimator.set_params(**params)
exp.estimator.get_params()
Out[19]:
{'memory': '.cache',
 'steps': [('scaler', StandardScaler()),
  ('classifier', DecisionTreeClassifier(criterion='entropy', random_state=0))],
 'verbose': False,
 'scaler': StandardScaler(),
 'classifier': DecisionTreeClassifier(criterion='entropy', random_state=0),
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True,
 'classifier__ccp_alpha': 0.0,
 'classifier__class_weight': None,
 'classifier__criterion': 'entropy',
 'classifier__max_depth': None,
 'classifier__max_features': None,
 'classifier__max_leaf_nodes': None,
 'classifier__min_impurity_decrease': 0.0,
 'classifier__min_samples_leaf': 1,
 'classifier__min_samples_split': 2,
 'classifier__min_weight_fraction_leaf': 0.0,
 'classifier__monotonic_cst': None,
 'classifier__random_state': 0,
 'classifier__splitter': 'best'}
In [20]:
# fit on training data
start_time = pd.Timestamp.now()
exp.estimator.fit(X=X_train, y=y_train)
train_time = pd.Timestamp.now() - start_time
In [21]:
exp.update_param("train_time", train_time)
exp.update_param(
    "mean_accuracy",
    exp.estimator.score(X_test, y_test),
    # add_column=True
)
exp.summary_df
Out[21]:
dataset_name n_train_samples n_test_samples mean_accuracy split_criterion train_time query_time kfolds confusion_matrix classification_report tree_depth n_leaves n_tree_nodes
exp_name
dtc-iris-unpruned iris-20test-shuffled-v1 120 30 0.966667 entropy 0 days 00:00:00.005195 0 days 00:00:00.000668 Stratified 3-Fold Cross-Validation [[11 0 0]\n [ 0 13 0]\n [ 0 0 6]] {'0': {'precision': 1.0, 'recall': 1.0, 'f1-sc... 5 9 NaN

Take a look at the trained model

In [22]:
text_representation = export_text(
    exp.estimator.named_steps["classifier"], feature_names=X_train.columns
)
print(text_representation)
with open(f"{RES_DIR}/{exp.name}-dtree.txt", "w") as f:
    f.write(text_representation)
|--- petal width (cm) <= -0.54
|   |--- class: 0
|--- petal width (cm) >  -0.54
|   |--- petal width (cm) <= 0.56
|   |   |--- petal length (cm) <= 0.65
|   |   |   |--- class: 1
|   |   |--- petal length (cm) >  0.65
|   |   |   |--- sepal length (cm) <= 0.26
|   |   |   |   |--- petal width (cm) <= 0.43
|   |   |   |   |   |--- class: 2
|   |   |   |   |--- petal width (cm) >  0.43
|   |   |   |   |   |--- class: 1
|   |   |   |--- sepal length (cm) >  0.26
|   |   |   |   |--- class: 2
|   |--- petal width (cm) >  0.56
|   |   |--- petal length (cm) <= 0.59
|   |   |   |--- sepal width (cm) <= 0.19
|   |   |   |   |--- class: 2
|   |   |   |--- sepal width (cm) >  0.19
|   |   |   |   |--- class: 1
|   |   |--- petal length (cm) >  0.59
|   |   |   |--- class: 2

In [23]:
# convert target names series to list
target_names_list = target_names["target_names"].tolist()
target_names_list
Out[23]:
['setosa', 'versicolor', 'virginica']
In [24]:
fig = plt.figure(figsize=(25, 20))
plot_tree(
    decision_tree=exp.estimator.named_steps["classifier"],
    feature_names=X_train.columns,
    class_names=target_names_list,
    filled=True,
    rounded=True,
)
plt.savefig(f"{FIGS_DIR}/{exp.name}_tree.png")
In [ ]:
# # can't figure out how to make DTviz work with a pipeline so this is a workaround but its shit cause it trains another model which is different...
# X_train_nparray = X_train.to_numpy()
# y_train_nparray = y_train.to_numpy().flatten()
# params = {
#     "criterion": "entropy",
#     "max_depth": None,
#     "max_features": None,
#     "max_leaf_nodes": None,
# }
# dtc_temp = DecisionTreeClassifier(**params)
# dtc_temp.fit(X_train, y_train)
# viz_model = dtreeviz.model(
#     dtc_temp,
#     X_train=X_train_nparray,
#     y_train=y_train_nparray,
#     target_name="iris",
#     feature_names=X_train.columns.to_list(),
#     class_names=target_names_list,
# )
# v = viz_model.view(scale=1.5)
# # v.save(f"{FIGS_DIR}/{exp.name}_dtreeviz.svg")
# v
In [25]:
# get precision, recall, f1, accuracy
start_time = pd.Timestamp.now()
y_pred = exp.estimator.predict(X_test)
query_time = pd.Timestamp.now() - start_time
In [26]:
exp.update_param("query_time", query_time)

Use a confusion matrix to see its classification performance...what it got wrong/right.

In [27]:
target_names_list = target_names["target_names"].tolist()
cm = confusion_matrix(
    y_true=y_test,
    y_pred=y_pred,
    # normalize="true"
)
cmd = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=target_names_list)
cmd.plot()
plt.savefig(f"{FIGS_DIR}/{exp.name}_confusion-matrix.png")

Q: I expected unpruned DT to classify everything correctly...why didn't it?

Q: Did it overfit? Train and test performance are pretty close so...no?

In [28]:
exp.update_param("confusion_matrix", np.array2string(cm))
In [29]:
cr = classification_report(y_true=y_test, y_pred=y_pred, output_dict=True)
exp.update_param("classification_report", str(cr))
cr
Out[29]:
{'0': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 11.0},
 '1': {'precision': 1.0,
  'recall': 0.9230769230769231,
  'f1-score': 0.96,
  'support': 13.0},
 '2': {'precision': 0.8571428571428571,
  'recall': 1.0,
  'f1-score': 0.9230769230769231,
  'support': 6.0},
 'accuracy': 0.9666666666666667,
 'macro avg': {'precision': 0.9523809523809524,
  'recall': 0.9743589743589745,
  'f1-score': 0.9610256410256411,
  'support': 30.0},
 'weighted avg': {'precision': 0.9714285714285714,
  'recall': 0.9666666666666667,
  'f1-score': 0.9672820512820512,
  'support': 30.0}}
In [30]:
# add custom decision tree classification specific metrics to the summary_df
exp.update_param(
    "split_criterion",
    exp.estimator.named_steps["classifier"].criterion,
    add_column=True,
)
exp.update_param(
    "tree_depth", exp.estimator.named_steps["classifier"].get_depth(), add_column=True
)
exp.update_param(
    "n_leaves", exp.estimator.named_steps["classifier"].get_n_leaves(), add_column=True
)
exp.update_param(
    "n_tree_nodes",
    exp.estimator.named_steps["classifier"].tree_.node_count,
    add_column=True,
)
exp.summary_df
Out[30]:
dataset_name n_train_samples n_test_samples mean_accuracy split_criterion train_time query_time kfolds confusion_matrix classification_report tree_depth n_leaves n_tree_nodes
exp_name
dtc-iris-unpruned iris-20test-shuffled-v1 120 30 0.966667 entropy 0 days 00:00:00.005195 0 days 00:00:00.000654 Stratified 3-Fold Cross-Validation [[11 0 0]\n [ 0 12 1]\n [ 0 0 6]] {'0': {'precision': 1.0, 'recall': 1.0, 'f1-sc... 5 8 15.0
In [ ]:
# exp.save(overwrite_existing=False)

Conclusions

  • Expectation: it will overfit the training data (high variance)

    • Yes, it overfit the training data
  • training time will be minimal since the dataset is small

    • Yes, training time was minimal


< Experiment Decision Tree Classifier on Iris dataset with prepruning | Contents | Genetic Algorithm on Knapsack Problem >