< Experiment Adaboost on iris dataset | Contents | Experiment Decision Tree Classifier on Iris dataset with prepruning >
Experiment: Decision Tree Classifier on Iris dataset with postpruning¶
🧹 Quick Summary: Post-Pruned Decision Tree (Iris Dataset)¶
Goal: First grow a full tree, then cut back unnecessary branches to improve generalization.
🧠 Core Idea:¶
- Fully grow the tree.
- Use cost-complexity pruning (
ccp_alpha > 0
) to trim subtrees that provide little predictive power.
In general...Pre-pruning avoids overfitting by stopping early—fast but risks missing patterns; post-pruning fixes overfitting by trimming later—slower but usually smarter.
🧮 Example Configuration:¶
ccp_alpha=0.01
(depends on validation curve)- Train unpruned tree, then prune using validation loss.
🔧 Implications:¶
Aspect | Expectation | Notes |
---|---|---|
Accuracy | High (~95%) | Comparable to pre-pruning when tuned. |
Overfitting | Reduced | Pruning cuts noise-adapted branches. |
Training Time | Slightly Higher | Pruning is an extra step. |
Model Size | Reduced | Smaller tree after pruning. |
It should perform better than pre-pruning but since this is such a small dataset, it might be about the same.
🔑 Characteristics:¶
- Data-driven complexity control.
- Can outperform pre-pruning when validated well.
- Great for interpretable and robust models.
%reload_ext autoreload
%autoreload 2
# decision tree specific imports
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
# experiment helper imports
from sklearn.base import clone
from helpers.base_imports import *
Setup experiment with data and model¶
https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html
exp = Experiment(
type="c", # classification
name="dtc-iris-postpruned",
dataset="iris-20test-shuffled-v1",
)
exp
We'll start with the regular unpruned DT then check the values of ccp_alpha with a validation curve to see if we can find a good value for pruning.
# add the steps to the pipeline
steps = [
# NOTE: DTs don't need scaling, but we include it here for consistency when comparing to other classifiers
("scaler", StandardScaler()),
(
"classifier",
DecisionTreeClassifier(
criterion="entropy", # gini tends to be faster but similar performance
splitter="best", # best split or random
max_depth=None, # no max depth (so will likely overfit)
min_samples_split=2, # require at least 2 samples to split a node
min_samples_leaf=1, # require at least 1 sample in each leaf
min_weight_fraction_leaf=0.0,
max_features=None, # consider all features when looking for the best split
random_state=RANDOM_SEED,
max_leaf_nodes=None, # allow unlimited leaf nodes???
min_impurity_decrease=0.0, # node will split if decrease in impurity is at least this much
class_weight=None, # all classes are weighted/treated equally
# complexity parameter for minimal cost-complexity pruning, 0 means no pruning
# greater vals increase the number of nodes pruned
# therefore greater vals regularize the model more
ccp_alpha=0.0,
),
),
]
exp.estimator = Pipeline(
steps=steps,
memory=CACHE_DIR,
)
exp.estimator.get_params()
Get dataset by name (eda already done in another notebook and train/test split saved so we will be working with the same data)
notes, X_train, X_test, y_train, y_test, target_names = get_dataset(exp.dataset)
print(notes)
print(target_names)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
exp.update_param("n_train_samples", X_train.shape[0])
exp.update_param("n_test_samples", X_test.shape[0])
exp.summary_df
We know from previous unpruned experiemnt that 3 works well for StratifiedKFold so we will use that here as well.
Inspect the cost complexity pruning path¶
path = exp.estimator.named_steps["classifier"].cost_complexity_pruning_path(
X_train, y_train
)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
fig, ax = plt.subplots()
# plot all but the last ccp_alpha value since it is the trivial tree with only one node
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
fig.savefig(f"{FIGS_DIR}/{exp.name}-ccp_alphas.png")
Interpretation:
as we regularize the tree (prune more by increasing ccp_alpha) the tree becomes smaller and smaller and the impurity of the leaves increases
max effective alpha val is removed since its the trivial tree with just the root node
- the highest effective alpha value is the value that prunes the whole tree
Here we show that the number of nodes and tree depth decreases as alpha increases.
clfs = []
node_counts = []
depths = []
for ccp_alpha in ccp_alphas:
params = {"classifier__ccp_alpha": ccp_alpha}
clf = exp.estimator.set_params(**params)
# print(f"clf params: {clf.get_params()}")
clf.fit(X_train, y_train)
node_counts.append(clf.named_steps["classifier"].tree_.node_count)
depths.append(clf.named_steps["classifier"].get_depth())
clfs.append(clf)
for i in range(len(clfs)):
print(
f"{i}: Number of nodes: {node_counts[i]}, depth: {depths[i]} with ccp_alpha: {ccp_alphas[i]}"
)
# remove the last tree (which has only one node)
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
depths = depths[:-1]
node_counts = node_counts[:-1]
for i in range(len(clfs)):
print(
f"{i}: Number of nodes: {node_counts[i]}, depth: {depths[i]} with ccp_alpha: {ccp_alphas[i]}"
)
# display number of alpha against numebr of nodes and depth
fig, ax = plt.subplots(2, 1)
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depths, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
fig.savefig(f"{FIGS_DIR}/{exp.name}-ccp-alpha-vs-nodes-depth.png")
# TEMP - COOL BUT DOESN"T USE CROSS-VALIDATION???
# train_scores = [clf.score(X_train, y_train) for clf in clfs]
# test_scores = [clf.score(X_test, y_test) for clf in clfs]
# fig, ax = plt.subplots()
# ax.set_xlabel("alpha")
# ax.set_ylabel("accuracy")
# ax.set_title("Accuracy vs alpha for training and testing sets")
# ax.plot(ccp_alphas, train_scores, marker="o", label="train", drawstyle="steps-post")
# ax.plot(ccp_alphas, test_scores, marker="o", label="test", drawstyle="steps-post")
# ax.legend()
# plt.show()
# TODO - whats RepeatStratifiedKFold?
cv = StratifiedKFold(
n_splits=3,
# random_state=RANDOM_SEED, # needed???
)
exp.update_param("kfolds", f"Stratified {cv.get_n_splits()}-Fold Cross-Validation")
# plot validation curve for ccp_alpha to see how it affects model performance
vcd_min_samples_split = ValidationCurveDisplay.from_estimator(
estimator=exp.estimator,
X=X_train,
y=y_train,
param_name="classifier__ccp_alpha",
param_range=ccp_alphas[:-1], # remove the last value since it is the trivial tree
cv=cv,
# shuffle=True,
)
# Update the legend to change "Test" to "Cross-Validation"
handles, labels = plt.gca().get_legend_handles_labels()
labels = ["Training", "Cross-Validation"] if "Test" in labels else labels
plt.legend(handles, labels)
plt.title(
f"Validation Curve\nStratified {cv.get_n_splits()}-Fold Cross-Validation Test"
)
plt.savefig(f"{FIGS_DIR}/{exp.name}_validation-curve-ccp-alpha.png")
So gridsearch should find the value between .015 and .0175 (before CV score starts decreasing)
ccp_alphas
# set hyperparameters and train model (+report)
# perform and report experiments with different hyperparameters
param_grid = {
"classifier__ccp_alpha": ccp_alphas[:-3],
}
grid_search = GridSearchCV(
estimator=exp.estimator,
param_grid=param_grid,
# scoring="", # accuracy is default but we can use another or many others
cv=cv,
)
grid_search.fit(X_train, y_train)
grid_search.best_params_, grid_search.best_score_
grid_search.best_estimator_.get_params()
# I prefer the smallest tree that doesn't sacrifice too much accuracy
my_params = {
"classifier__ccp_alpha": 0.06 # the smallest tree that still gets about .95 on CV accuracy
}
exp.estimator.set_params(**my_params)
exp.estimator.get_params()
# fit on training data
start_time = pd.Timestamp.now()
exp.estimator.fit(X=X_train, y=y_train)
train_time = pd.Timestamp.now() - start_time
exp.update_param("train_time", train_time)
exp.update_param(
"mean_accuracy",
exp.estimator.score(X_test, y_test),
# add_column=True
)
exp.summary_df
Take a look at the trained model¶
text_representation = export_text(
exp.estimator.named_steps["classifier"], feature_names=X_train.columns
)
print(text_representation)
with open(f"{RES_DIR}/{exp.name}-dtree.txt", "w") as f:
f.write(text_representation)
# convert target names series to list
target_names_list = target_names["target_names"].tolist()
target_names_list
fig = plt.figure(figsize=(25, 20))
plot_tree(
decision_tree=exp.estimator.named_steps["classifier"],
feature_names=X_train.columns,
class_names=target_names_list,
filled=True,
rounded=True,
)
plt.savefig(f"{FIGS_DIR}/{exp.name}_tree.png")
# get precision, recall, f1, accuracy
start_time = pd.Timestamp.now()
y_pred = exp.estimator.predict(X_test)
query_time = pd.Timestamp.now() - start_time
exp.update_param("query_time", query_time)
target_names_list = target_names["target_names"].tolist()
cm = confusion_matrix(
y_true=y_test,
y_pred=y_pred,
# normalize="true"
)
cmd = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=target_names_list)
cmd.plot()
plt.title("Confusion Matrix")
plt.savefig(f"{FIGS_DIR}/{exp.name}_confusion-matrix.png")
exp.update_param("confusion_matrix", np.array2string(cm), overwrite_existing=True)
cr = classification_report(y_true=y_test, y_pred=y_pred, output_dict=True)
exp.update_param("classification_report", str(cr), overwrite_existing=True)
cr
# add custom decision tree classification specific metrics to the summary_df
exp.update_param(
"split_criterion",
exp.estimator.named_steps["classifier"].criterion,
add_column=True,
)
exp.update_param(
"tree_depth", exp.estimator.named_steps["classifier"].get_depth(), add_column=True
)
exp.update_param(
"n_leaves", exp.estimator.named_steps["classifier"].get_n_leaves(), add_column=True
)
exp.update_param(
"n_tree_nodes",
exp.estimator.named_steps["classifier"].tree_.node_count,
add_column=True,
)
exp.summary_df
One more thing just for fun...its sometimes helpful to visualize the decision tree boundary to see how it separates the classes. Here's a general example of that.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_iris
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
feature_1, feature_2 = np.meshgrid(
np.linspace(iris.data[:, 0].min(), iris.data[:, 0].max()),
np.linspace(iris.data[:, 1].min(), iris.data[:, 1].max()),
)
grid = np.vstack([feature_1.ravel(), feature_2.ravel()]).T
tree = DecisionTreeClassifier().fit(iris.data[:, :2], iris.target)
y_pred = np.reshape(tree.predict(grid), feature_1.shape)
display = DecisionBoundaryDisplay(xx0=feature_1, xx1=feature_2, response=y_pred)
display.plot()
display.ax_.scatter(iris.data[:, 0], iris.data[:, 1], c=iris.target, edgecolor="black")
plt.show()
exp.save(overwrite_existing=True)
Conclusions¶
TODO - SO PRUNING GIVES THE SAME AS UNPRUNED BECAUSE GRIDSEARCH OPTIMIZED FOR ACCURACY.
I found that gridsearch for the best ccpalpha hyperparamer in my postpruning experiement selects 0 for ccpalpha (no pruning) I guess because that optimizes the accuracy...isn't that problematic?
< Experiment Adaboost on iris dataset | Contents | Experiment Decision Tree Classifier on Iris dataset with prepruning >