The content for this site available on GitHub. If you want to launch the notebooks interactively click on the binder stamp below. Binder

< Experiment Adaboost on iris dataset | Contents | Experiment Decision Tree Classifier on Iris dataset with prepruning >

Experiment: Decision Tree Classifier on Iris dataset with postpruning

🧹 Quick Summary: Post-Pruned Decision Tree (Iris Dataset)

Goal: First grow a full tree, then cut back unnecessary branches to improve generalization.


🧠 Core Idea:

  • Fully grow the tree.
  • Use cost-complexity pruning (ccp_alpha > 0) to trim subtrees that provide little predictive power.

In general...Pre-pruning avoids overfitting by stopping early—fast but risks missing patterns; post-pruning fixes overfitting by trimming later—slower but usually smarter.


🧮 Example Configuration:

  • ccp_alpha=0.01 (depends on validation curve)
  • Train unpruned tree, then prune using validation loss.

🔧 Implications:

Aspect Expectation Notes
Accuracy High (~95%) Comparable to pre-pruning when tuned.
Overfitting Reduced Pruning cuts noise-adapted branches.
Training Time Slightly Higher Pruning is an extra step.
Model Size Reduced Smaller tree after pruning.

It should perform better than pre-pruning but since this is such a small dataset, it might be about the same.


🔑 Characteristics:

  • Data-driven complexity control.
  • Can outperform pre-pruning when validated well.
  • Great for interpretable and robust models.
In [7]:
%reload_ext autoreload
%autoreload 2
In [8]:
# decision tree specific imports
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree

# experiment helper imports
from sklearn.base import clone
from helpers.base_imports import *
In [9]:
exp = Experiment(
    type="c",  # classification
    name="dtc-iris-postpruned",
    dataset="iris-20test-shuffled-v1",
)
exp
Loading 'classification-experiments.csv'
Loading 'dtc-iris-postpruned' experiment
Loading 'dtc-iris-postpruned' estimator/model/pipeline
Out[9]:
Experiment(c, dtc-iris-postpruned, iris-20test-shuffled-v1)

We'll start with the regular unpruned DT then check the values of ccp_alpha with a validation curve to see if we can find a good value for pruning.

In [10]:
# add the steps to the pipeline
steps = [
    # NOTE: DTs don't need scaling, but we include it here for consistency when comparing to other classifiers
    ("scaler", StandardScaler()),
    (
        "classifier",
        DecisionTreeClassifier(
            criterion="entropy",  # gini tends to be faster but similar performance
            splitter="best",  # best split or random
            max_depth=None,  # no max depth (so will likely overfit)
            min_samples_split=2,  # require at least 2 samples to split a node
            min_samples_leaf=1,  # require at least 1 sample in each leaf
            min_weight_fraction_leaf=0.0,
            max_features=None,  # consider all features when looking for the best split
            random_state=RANDOM_SEED,
            max_leaf_nodes=None,  # allow unlimited leaf nodes???
            min_impurity_decrease=0.0,  # node will split if decrease in impurity is at least this much
            class_weight=None,  # all classes are weighted/treated equally
            # complexity parameter for minimal cost-complexity pruning, 0 means no pruning
            # greater vals increase the number of nodes pruned
            # therefore greater vals regularize the model more
            ccp_alpha=0.0,
        ),
    ),
]
exp.estimator = Pipeline(
    steps=steps,
    memory=CACHE_DIR,
)
In [11]:
exp.estimator.get_params()
Out[11]:
{'memory': '.cache',
 'steps': [('scaler', StandardScaler()),
  ('classifier', DecisionTreeClassifier(criterion='entropy', random_state=0))],
 'transform_input': None,
 'verbose': False,
 'scaler': StandardScaler(),
 'classifier': DecisionTreeClassifier(criterion='entropy', random_state=0),
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True,
 'classifier__ccp_alpha': 0.0,
 'classifier__class_weight': None,
 'classifier__criterion': 'entropy',
 'classifier__max_depth': None,
 'classifier__max_features': None,
 'classifier__max_leaf_nodes': None,
 'classifier__min_impurity_decrease': 0.0,
 'classifier__min_samples_leaf': 1,
 'classifier__min_samples_split': 2,
 'classifier__min_weight_fraction_leaf': 0.0,
 'classifier__monotonic_cst': None,
 'classifier__random_state': 0,
 'classifier__splitter': 'best'}

Get dataset by name (eda already done in another notebook and train/test split saved so we will be working with the same data)

In [12]:
notes, X_train, X_test, y_train, y_test, target_names = get_dataset(exp.dataset)
print(notes)
print(target_names)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Dataset: iris-20test-shuffled-v1
X_train shape: (120, 4)
X_test shape: (30, 4)
y_train shape: (120,)
y_test shape: (30,)
Train: 80.00% of total
Test: 20.00% of total
Notes: None
Created by save_dataset() helper at 2024-07-09 12:28:10

  target_names
0       setosa
1   versicolor
2    virginica
Out[12]:
((120, 4), (30, 4), (120, 1), (30, 1))
In [13]:
exp.update_param("n_train_samples", X_train.shape[0])
exp.update_param("n_test_samples", X_test.shape[0])
exp.summary_df
Out[13]:
dataset_name n_train_samples n_test_samples mean_accuracy split_criterion train_time query_time kfolds confusion_matrix classification_report tree_depth n_leaves n_tree_nodes
exp_name
dtc-iris-postpruned iris-20test-shuffled-v1 120 30 1.0 entropy 0 days 00:00:00.011213 0 days 00:00:00.000578 Stratified 3-Fold Cross-Validation [[11 0 0]\n [ 0 13 0]\n [ 0 0 6]] {'0': {'precision': 1.0, 'recall': 1.0, 'f1-sc... 3 4 7.0

We know from previous unpruned experiemnt that 3 works well for StratifiedKFold so we will use that here as well.

Inspect the cost complexity pruning path

In [14]:
path = exp.estimator.named_steps["classifier"].cost_complexity_pruning_path(
    X_train, y_train
)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
In [9]:
fig, ax = plt.subplots()
# plot all but the last ccp_alpha value since it is the trivial tree with only one node
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
fig.savefig(f"{FIGS_DIR}/{exp.name}-ccp_alphas.png")

Interpretation:

  • as we regularize the tree (prune more by increasing ccp_alpha) the tree becomes smaller and smaller and the impurity of the leaves increases

  • max effective alpha val is removed since its the trivial tree with just the root node

  • the highest effective alpha value is the value that prunes the whole tree

Here we show that the number of nodes and tree depth decreases as alpha increases.

In [15]:
clfs = []
node_counts = []
depths = []
for ccp_alpha in ccp_alphas:
    params = {"classifier__ccp_alpha": ccp_alpha}
    clf = exp.estimator.set_params(**params)
    # print(f"clf params: {clf.get_params()}")
    clf.fit(X_train, y_train)
    node_counts.append(clf.named_steps["classifier"].tree_.node_count)
    depths.append(clf.named_steps["classifier"].get_depth())
    clfs.append(clf)

for i in range(len(clfs)):
    print(
        f"{i}: Number of nodes: {node_counts[i]}, depth: {depths[i]} with ccp_alpha: {ccp_alphas[i]}"
    )
0: Number of nodes: 15, depth: 5 with ccp_alpha: 0.0
1: Number of nodes: 11, depth: 4 with ccp_alpha: 0.015040168643486713
2: Number of nodes: 9, depth: 3 with ccp_alpha: 0.02704260414863776
3: Number of nodes: 7, depth: 3 with ccp_alpha: 0.02947829913342006
4: Number of nodes: 5, depth: 2 with ccp_alpha: 0.126251527242787
5: Number of nodes: 3, depth: 1 with ccp_alpha: 0.45850626956603496
6: Number of nodes: 1, depth: 0 with ccp_alpha: 0.9097361225311661
In [16]:
# remove the last tree (which has only one node)
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
depths = depths[:-1]
node_counts = node_counts[:-1]

for i in range(len(clfs)):
    print(
        f"{i}: Number of nodes: {node_counts[i]}, depth: {depths[i]} with ccp_alpha: {ccp_alphas[i]}"
    )
0: Number of nodes: 15, depth: 5 with ccp_alpha: 0.0
1: Number of nodes: 11, depth: 4 with ccp_alpha: 0.015040168643486713
2: Number of nodes: 9, depth: 3 with ccp_alpha: 0.02704260414863776
3: Number of nodes: 7, depth: 3 with ccp_alpha: 0.02947829913342006
4: Number of nodes: 5, depth: 2 with ccp_alpha: 0.126251527242787
5: Number of nodes: 3, depth: 1 with ccp_alpha: 0.45850626956603496
In [17]:
# display number of alpha against numebr of nodes and depth
fig, ax = plt.subplots(2, 1)
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depths, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
fig.savefig(f"{FIGS_DIR}/{exp.name}-ccp-alpha-vs-nodes-depth.png")
In [18]:
# TEMP - COOL BUT DOESN"T USE CROSS-VALIDATION???
# train_scores = [clf.score(X_train, y_train) for clf in clfs]
# test_scores = [clf.score(X_test, y_test) for clf in clfs]

# fig, ax = plt.subplots()
# ax.set_xlabel("alpha")
# ax.set_ylabel("accuracy")
# ax.set_title("Accuracy vs alpha for training and testing sets")
# ax.plot(ccp_alphas, train_scores, marker="o", label="train", drawstyle="steps-post")
# ax.plot(ccp_alphas, test_scores, marker="o", label="test", drawstyle="steps-post")
# ax.legend()
# plt.show()
In [19]:
# TODO - whats RepeatStratifiedKFold?
cv = StratifiedKFold(
    n_splits=3,
    # random_state=RANDOM_SEED, # needed???
)
In [20]:
exp.update_param("kfolds", f"Stratified {cv.get_n_splits()}-Fold Cross-Validation")
In [16]:
# plot validation curve for ccp_alpha to see how it affects model performance
vcd_min_samples_split = ValidationCurveDisplay.from_estimator(
    estimator=exp.estimator,
    X=X_train,
    y=y_train,
    param_name="classifier__ccp_alpha",
    param_range=ccp_alphas[:-1],  # remove the last value since it is the trivial tree
    cv=cv,
    # shuffle=True,
)
# Update the legend to change "Test" to "Cross-Validation"
handles, labels = plt.gca().get_legend_handles_labels()
labels = ["Training", "Cross-Validation"] if "Test" in labels else labels
plt.legend(handles, labels)

plt.title(
    f"Validation Curve\nStratified {cv.get_n_splits()}-Fold Cross-Validation Test"
)
plt.savefig(f"{FIGS_DIR}/{exp.name}_validation-curve-ccp-alpha.png")

So gridsearch should find the value between .015 and .0175 (before CV score starts decreasing)

In [21]:
ccp_alphas
Out[21]:
array([0.        , 0.01504017, 0.0270426 , 0.0294783 , 0.12625153,
       0.45850627])
In [22]:
# set hyperparameters and train model (+report)
# perform and report experiments with different hyperparameters
param_grid = {
    "classifier__ccp_alpha": ccp_alphas[:-3],
}

grid_search = GridSearchCV(
    estimator=exp.estimator,
    param_grid=param_grid,
    # scoring="", # accuracy is default but we can use another or many others
    cv=cv,
)
grid_search.fit(X_train, y_train)
grid_search.best_params_, grid_search.best_score_
Out[22]:
({'classifier__ccp_alpha': 0.02704260414863776}, 0.9416666666666665)
In [23]:
grid_search.best_estimator_.get_params()
Out[23]:
{'memory': '.cache',
 'steps': [('scaler', StandardScaler()),
  ('classifier',
   DecisionTreeClassifier(ccp_alpha=0.02704260414863776, criterion='entropy',
                          random_state=0))],
 'transform_input': None,
 'verbose': False,
 'scaler': StandardScaler(),
 'classifier': DecisionTreeClassifier(ccp_alpha=0.02704260414863776, criterion='entropy',
                        random_state=0),
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True,
 'classifier__ccp_alpha': 0.02704260414863776,
 'classifier__class_weight': None,
 'classifier__criterion': 'entropy',
 'classifier__max_depth': None,
 'classifier__max_features': None,
 'classifier__max_leaf_nodes': None,
 'classifier__min_impurity_decrease': 0.0,
 'classifier__min_samples_leaf': 1,
 'classifier__min_samples_split': 2,
 'classifier__min_weight_fraction_leaf': 0.0,
 'classifier__monotonic_cst': None,
 'classifier__random_state': 0,
 'classifier__splitter': 'best'}
In [24]:
# I prefer the smallest tree that doesn't sacrifice too much accuracy
my_params = {
    "classifier__ccp_alpha": 0.06  # the smallest tree that still gets about .95 on CV accuracy
}
exp.estimator.set_params(**my_params)
exp.estimator.get_params()
Out[24]:
{'memory': '.cache',
 'steps': [('scaler', StandardScaler()),
  ('classifier',
   DecisionTreeClassifier(ccp_alpha=0.06, criterion='entropy', random_state=0))],
 'transform_input': None,
 'verbose': False,
 'scaler': StandardScaler(),
 'classifier': DecisionTreeClassifier(ccp_alpha=0.06, criterion='entropy', random_state=0),
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True,
 'classifier__ccp_alpha': 0.06,
 'classifier__class_weight': None,
 'classifier__criterion': 'entropy',
 'classifier__max_depth': None,
 'classifier__max_features': None,
 'classifier__max_leaf_nodes': None,
 'classifier__min_impurity_decrease': 0.0,
 'classifier__min_samples_leaf': 1,
 'classifier__min_samples_split': 2,
 'classifier__min_weight_fraction_leaf': 0.0,
 'classifier__monotonic_cst': None,
 'classifier__random_state': 0,
 'classifier__splitter': 'best'}
In [25]:
# fit on training data
start_time = pd.Timestamp.now()
exp.estimator.fit(X=X_train, y=y_train)
train_time = pd.Timestamp.now() - start_time
In [26]:
exp.update_param("train_time", train_time)
exp.update_param(
    "mean_accuracy",
    exp.estimator.score(X_test, y_test),
    # add_column=True
)
exp.summary_df
Out[26]:
dataset_name n_train_samples n_test_samples mean_accuracy split_criterion train_time query_time kfolds confusion_matrix classification_report tree_depth n_leaves n_tree_nodes
exp_name
dtc-iris-postpruned iris-20test-shuffled-v1 120 30 0.966667 entropy 0 days 00:00:00.004568 0 days 00:00:00.000578 Stratified 3-Fold Cross-Validation [[11 0 0]\n [ 0 13 0]\n [ 0 0 6]] {'0': {'precision': 1.0, 'recall': 1.0, 'f1-sc... 3 4 7.0

Take a look at the trained model

In [27]:
text_representation = export_text(
    exp.estimator.named_steps["classifier"], feature_names=X_train.columns
)
print(text_representation)
with open(f"{RES_DIR}/{exp.name}-dtree.txt", "w") as f:
    f.write(text_representation)
|--- petal width (cm) <= -0.54
|   |--- class: 0
|--- petal width (cm) >  -0.54
|   |--- petal width (cm) <= 0.56
|   |   |--- petal length (cm) <= 0.65
|   |   |   |--- class: 1
|   |   |--- petal length (cm) >  0.65
|   |   |   |--- class: 2
|   |--- petal width (cm) >  0.56
|   |   |--- class: 2

In [28]:
# convert target names series to list
target_names_list = target_names["target_names"].tolist()
target_names_list
Out[28]:
['setosa', 'versicolor', 'virginica']
In [29]:
fig = plt.figure(figsize=(25, 20))
plot_tree(
    decision_tree=exp.estimator.named_steps["classifier"],
    feature_names=X_train.columns,
    class_names=target_names_list,
    filled=True,
    rounded=True,
)
plt.savefig(f"{FIGS_DIR}/{exp.name}_tree.png")
In [30]:
# get precision, recall, f1, accuracy
start_time = pd.Timestamp.now()
y_pred = exp.estimator.predict(X_test)
query_time = pd.Timestamp.now() - start_time
In [31]:
exp.update_param("query_time", query_time)
In [ ]:
target_names_list = target_names["target_names"].tolist()
cm = confusion_matrix(
    y_true=y_test,
    y_pred=y_pred,
    # normalize="true"
)
cmd = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=target_names_list)
cmd.plot()
plt.title("Confusion Matrix")
plt.savefig(f"{FIGS_DIR}/{exp.name}_confusion-matrix.png")
In [33]:
exp.update_param("confusion_matrix", np.array2string(cm), overwrite_existing=True)
In [34]:
cr = classification_report(y_true=y_test, y_pred=y_pred, output_dict=True)
exp.update_param("classification_report", str(cr), overwrite_existing=True)
cr
Out[34]:
{'0': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 11.0},
 '1': {'precision': 1.0,
  'recall': 0.9230769230769231,
  'f1-score': 0.96,
  'support': 13.0},
 '2': {'precision': 0.8571428571428571,
  'recall': 1.0,
  'f1-score': 0.9230769230769231,
  'support': 6.0},
 'accuracy': 0.9666666666666667,
 'macro avg': {'precision': 0.9523809523809524,
  'recall': 0.9743589743589745,
  'f1-score': 0.9610256410256411,
  'support': 30.0},
 'weighted avg': {'precision': 0.9714285714285714,
  'recall': 0.9666666666666667,
  'f1-score': 0.9672820512820512,
  'support': 30.0}}
In [35]:
# add custom decision tree classification specific metrics to the summary_df
exp.update_param(
    "split_criterion",
    exp.estimator.named_steps["classifier"].criterion,
    add_column=True,
)
exp.update_param(
    "tree_depth", exp.estimator.named_steps["classifier"].get_depth(), add_column=True
)
exp.update_param(
    "n_leaves", exp.estimator.named_steps["classifier"].get_n_leaves(), add_column=True
)
exp.update_param(
    "n_tree_nodes",
    exp.estimator.named_steps["classifier"].tree_.node_count,
    add_column=True,
)
exp.summary_df
Out[35]:
dataset_name n_train_samples n_test_samples mean_accuracy split_criterion train_time query_time kfolds confusion_matrix classification_report tree_depth n_leaves n_tree_nodes
exp_name
dtc-iris-postpruned iris-20test-shuffled-v1 120 30 0.966667 entropy 0 days 00:00:00.004568 0 days 00:00:00.000906 Stratified 3-Fold Cross-Validation [[11 0 0]\n [ 0 12 1]\n [ 0 0 6]] {'0': {'precision': 1.0, 'recall': 1.0, 'f1-sc... 3 4 7.0

One more thing just for fun...its sometimes helpful to visualize the decision tree boundary to see how it separates the classes. Here's a general example of that.

In [37]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_iris
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
feature_1, feature_2 = np.meshgrid(
    np.linspace(iris.data[:, 0].min(), iris.data[:, 0].max()),
    np.linspace(iris.data[:, 1].min(), iris.data[:, 1].max()),
)
grid = np.vstack([feature_1.ravel(), feature_2.ravel()]).T
tree = DecisionTreeClassifier().fit(iris.data[:, :2], iris.target)
y_pred = np.reshape(tree.predict(grid), feature_1.shape)
display = DecisionBoundaryDisplay(xx0=feature_1, xx1=feature_2, response=y_pred)
display.plot()
display.ax_.scatter(iris.data[:, 0], iris.data[:, 1], c=iris.target, edgecolor="black")
plt.show()
In [32]:
exp.save(overwrite_existing=True)
Loading 'classification-experiments.csv'
Overwriting existing experiment dtc-iris-postpruned
Saving experiment dtc-iris-postpruned to results/classification-experiments.csv
Dumping estimator dtc-iris-postpruned to .cache/dtc-iris-postpruned.joblib

Conclusions

TODO - SO PRUNING GIVES THE SAME AS UNPRUNED BECAUSE GRIDSEARCH OPTIMIZED FOR ACCURACY.

I found that gridsearch for the best ccpalpha hyperparamer in my postpruning experiement selects 0 for ccpalpha (no pruning) I guess because that optimizes the accuracy...isn't that problematic?



< Experiment Adaboost on iris dataset | Contents | Experiment Decision Tree Classifier on Iris dataset with prepruning >