< Experiment Decision Tree Classifier on Iris dataset with postpruning | Contents | Experiment Decision Tree Classifier on iris dataset without pruning >
Experiment: Decision Tree Classifier on Iris dataset with prepruning¶
✂️ Quick Summary: Pre-Pruned Decision Tree (Iris Dataset)¶
Goal: Prevent the tree from growing too complex by limiting its size during training.
🧠 Core Idea:¶
- Use hyperparameters like
max_depth
,min_samples_leaf
, etc., to halt growth early, reducing overfitting risk.
In general...Pre-pruning avoids overfitting by stopping early—fast but risks missing patterns; post-pruning fixes overfitting by trimming later—slower but usually smarter.
🧮 Example Configuration:¶
max_depth=3
min_samples_split=4
min_samples_leaf=2
🔧 Implications:¶
Aspect | Expectation | Notes |
---|---|---|
Accuracy | High (~95%) | Good balance of bias and variance. |
Overfitting | Less likely | Stops before fully memorizing data. |
Interpretability | Good | Tree remains small and readable. |
Tuning Needed | Yes | Hyperparameters control complexity. |
Prepruning will generalize better than unpruned and learn faster.
🔑 Characteristics:¶
- Simple, interpretable, and fast.
- Works well when tuned.
- Ideal for small-to-medium tabular datasets.
%reload_ext autoreload
%autoreload 2
# decision tree specific imports
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
# experiment helper imports
from helpers.base_imports import *
Setup experiment with data and model¶
exp = Experiment(
type="c", # classification
name="dtc-iris-prepruned",
dataset="iris-20test-shuffled-v1",
)
exp
Start with unpruned params then we will see with validation curves and grid search what we want to set the max_depth to.
# add the steps to the pipeline
steps = [
# NOTE: DTs don't need scaling, but we include it here for consistency when comparing to other classifiers
("scaler", StandardScaler()),
(
"classifier",
DecisionTreeClassifier(
criterion="entropy", # gini tends to be faster but similar performance
splitter="best", # best split or random
max_depth=None, # no max depth (so will likely overfit)
min_samples_split=2, # require at least 2 samples to split a node
min_samples_leaf=1, # require at least 1 sample in each leaf
min_weight_fraction_leaf=0.0,
max_features=None, # consider all features when looking for the best split
random_state=RANDOM_SEED,
max_leaf_nodes=None, # allow unlimited leaf nodes???
min_impurity_decrease=0.0, # node will split if decrease in impurity is at least this much
class_weight=None, # all classes are weighted/treated equally
ccp_alpha=0.0, # complexity parameter for minimal cost-complexity pruning, 0 means no pruning
),
),
]
exp.estimator = Pipeline(
steps=steps,
memory=CACHE_DIR,
)
exp.estimator.get_params()
Get dataset by name (eda already done in another notebook and train/test split saved so we will be working with the same data)
notes, X_train, X_test, y_train, y_test, target_names = get_dataset(exp.dataset)
print(notes)
print(target_names)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
# inspect data
disp_df(pd.concat([X_train, y_train], axis=1))
Double check the data statistics and features from the eda
plt.imshow(plt.imread("figs/iris-20test-shuffled-v0_feature-statistics-X_train.png"))
plt.axis("off")
plt.show()
We have some outliers for sepal width.
Decision Trees are generally considered robust to outliers so we will leave them.
plt.imshow(plt.imread("figs/iris_target-class-distribution-y_train.png"))
plt.axis("off")
plt.show()
Classes in the train data are roughly balanced so we'll continue without any class balancing.
exp.update_param("n_train_samples", X_train.shape[0])
exp.update_param("n_test_samples", X_test.shape[0])
exp.summary_df
Inspect learning curve¶
- required data size: how much data is needed for get good performance before improvement plateau
cv = StratifiedKFold(
n_splits=3,
)
# Note: LearningCurveDisplay contains the scores as parameters (train_scores, test_scores)...could save these if needed
lcd = LearningCurveDisplay.from_estimator(
estimator=exp.estimator,
X=X_train,
y=y_train,
# train_sizes=np.linspace(0.1, 1.0, 5),
# train_sizes=np.linspace(1, 80, 5).astype(int),
train_sizes=np.linspace(0.1, 1.0, 5),
# splitters are instantiated with shuffle=False so the splits will be the same across all calls
cv=cv,
random_state=0,
# return_times = True, # default false
)
# Update the legend to change "Test" to "Cross-Validation"
handles, labels = plt.gca().get_legend_handles_labels()
labels = ["Training", "Cross-Validation"] if "Test" in labels else labels
plt.legend(handles, labels)
plt.title(f"Learning Curve\nStratified {cv.get_n_splits()}-Fold Cross-Validation Test")
plt.ylabel("Accuracy")
plt.savefig(f"{FIGS_DIR}/{exp.name}_learning-curve.png")
All of this is exactly the same as the unpruned DT so far.
exp.update_param("kfolds", f"Stratified {cv.get_n_splits()}-Fold Cross-Validation")
Hyperparam search¶
# figure out ranges for hyperparameters using validation curves
vcd_max_depth = ValidationCurveDisplay.from_estimator(
estimator=exp.estimator,
X=X_train,
y=y_train,
param_name="classifier__max_depth",
param_range=np.arange(1, 5),
cv=cv,
# shuffle=True,
)
# Update the legend to change "Test" to "Cross-Validation"
handles, labels = plt.gca().get_legend_handles_labels()
labels = ["Training", "Cross-Validation"] if "Test" in labels else labels
plt.legend(handles, labels)
plt.title(
f"Validation Curve\nStratified {cv.get_n_splits()}-Fold Cross-Validation Test"
)
plt.savefig(f"{FIGS_DIR}/{exp.name}_validation-curve-max-depth.png")
So, a max 2 or 3 seem okay. By eye, I'd go with 3 (we'll see what GridSearchCV says next but I'm guessing it will be 3).
# figure out ranges for hyperparameters using validation curves
vcd_min_samples_split = ValidationCurveDisplay.from_estimator(
estimator=exp.estimator,
X=X_train,
y=y_train,
param_name="classifier__min_samples_split",
param_range=np.arange(2, 10),
cv=cv,
# shuffle=True,
)
# Update the legend to change "Test" to "Cross-Validation"
handles, labels = plt.gca().get_legend_handles_labels()
labels = ["Training", "Cross-Validation"] if "Test" in labels else labels
plt.legend(handles, labels)
plt.title(
f"Validation Curve\nStratified {cv.get_n_splits()}-Fold Cross-Validation Test"
)
plt.savefig(f"{FIGS_DIR}/{exp.name}_validation-curve-min-samples-split.png")
For min_samples_split, the validation curve looks like 2 gives a good accuracy score.
Now that we've seen the validation curves for these hyperparams, we have a good feel for the range that grid search should be looking in and what it will likely return.
Lets see if we're right
# set hyperparameters and train model (+report)
# perform and report experiments with different hyperparameters
param_grid = {
"classifier__max_depth": np.arange(1, 4),
"classifier__min_samples_split": np.arange(2, 5),
}
grid_search = GridSearchCV(
estimator=exp.estimator,
param_grid=param_grid,
# scoring="", # accuracy is default but we can use another or many others
cv=cv,
)
grid_search.fit(X_train, y_train)
grid_search.best_params_
grid_search.best_score_
GridSearch thinks 3 is the best max_depth and 2 is the best min_samples_split. Thats what we will go with for our prepruned model.
exp.estimator.set_params(**grid_search.best_params_)
exp.estimator.get_params()
start_time = pd.Timestamp.now()
exp.estimator.fit(X_train, y_train)
train_time = pd.Timestamp.now() - start_time
exp.update_param("train_time", train_time)
exp.update_param(
"mean_accuracy",
exp.estimator.score(X_test, y_test),
# add_column=True
)
exp.summary_df
Take a look at our prepruned trained model¶
text_representation = export_text(
exp.estimator.named_steps["classifier"], feature_names=X_train.columns
)
print(text_representation)
with open(f"{RES_DIR}/{exp.name}-dtree.txt", "w") as f:
f.write(text_representation)
# convert target names series to list
target_names_list = target_names["target_names"].tolist()
target_names_list
fig = plt.figure(figsize=(25, 20))
plot_tree(
decision_tree=exp.estimator.named_steps["classifier"],
feature_names=X_train.columns,
class_names=target_names_list,
filled=True,
rounded=True,
)
plt.savefig(f"{FIGS_DIR}/{exp.name}_tree.png")
We can see its basically the same as the unpruned model but with a max_depth of 3.
# get precision, recall, f1, accuracy
start_time = pd.Timestamp.now()
y_pred = exp.estimator.predict(X_test)
query_time = pd.Timestamp.now() - start_time
exp.update_param("query_time", query_time)
target_names_list = target_names["target_names"].tolist()
cm = confusion_matrix(
y_true=y_test,
y_pred=y_pred,
# normalize="true"
)
cmd = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=target_names_list)
cmd.plot()
plt.savefig(f"{FIGS_DIR}/{exp.name}_confusion-matrix.png")
It misses one but gets the rest right still.
exp.update_param("confusion_matrix", np.array2string(cm))
cr = classification_report(y_true=y_test, y_pred=y_pred, output_dict=True)
exp.update_param("classification_report", str(cr))
cr
# add custom decision tree classification specific metrics to the summary_df
exp.update_param(
"split_criterion",
exp.estimator.named_steps["classifier"].criterion,
add_column=True,
)
exp.update_param(
"tree_depth", exp.estimator.named_steps["classifier"].get_depth(), add_column=True
)
exp.update_param(
"n_leaves", exp.estimator.named_steps["classifier"].get_n_leaves(), add_column=True
)
exp.update_param(
"n_tree_nodes",
exp.estimator.named_steps["classifier"].tree_.node_count,
add_column=True,
)
exp.summary_df
exp.save(overwrite_existing=False)
Conclusions¶
- Expectation: prepruning will generalize better than unpruned and learn faster
- Yes, prepruned trained faster
- I am not sure if you can say it generalized better....accuracy was a bit lower and it got one wrong but on the unpruned it got none wrong. I guess CV accuracy tells you about how well it generalizes which ...TODO...IDK
< Experiment Decision Tree Classifier on Iris dataset with postpruning | Contents | Experiment Decision Tree Classifier on iris dataset without pruning >