< Experiment Decision Tree Classifier on Iris dataset with prepruning | Contents | Genetic Algorithm on Knapsack Problem >
Experiment: Decision Tree Classifier on iris dataset without pruning¶
🌳 Quick Summary: Unpruned Decision Tree (Iris Dataset)¶
Goal: Build a tree that perfectly fits the training data by growing until all leaves are pure or until no further splits are possible.
🧠 Core Idea:¶
- Grow the tree fully, without any restrictions on depth, leaf size, or impurity decrease.
- Captures all patterns—including noise—leading to possible overfitting.
Note: unpruned trees are not usually used in practice but are useful for understanding the behavior of decision trees. In general...Pre-pruning avoids overfitting by stopping early—fast but risks missing patterns; post-pruning fixes overfitting by trimming later—slower but usually smarter.
🧮 Configuration (No Pruning Applied):¶
max_depth=None
min_samples_split=2
min_samples_leaf=1
ccp_alpha=0.0
(no cost complexity pruning)
🔧 Implications:¶
Aspect | Expectation | Notes |
---|---|---|
Accuracy | Very High (train), Medium-High (test) | Will overfit due to excessive depth (high variance) |
Overfitting | Likely | Tree memorizes training data. |
Interpretability | Poor | Very deep trees are hard to follow. |
Speed | Fast on Iris | But deeper trees may slow down large datasets. |
🔑 Characteristics:¶
- Overfits small/noisy datasets.
- Good for exploratory analysis.
- Not ideal for generalization without pruning.
%reload_ext autoreload
%autoreload 2
# decision tree specific imports
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
# experiment helper imports
from helpers.base_imports import *
Setup experiment with data and model¶
DATASET_NAME = "iris-20test-shuffled-v1"
exp = Experiment(
type="c", # classification
name="dtc-iris-unpruned",
dataset=DATASET_NAME,
)
exp
# add the steps to the pipeline
steps = [
# NOTE: DTs don't need scaling, but we include it here for consistency when comparing to other classifiers
("scaler", StandardScaler()),
(
"classifier",
DecisionTreeClassifier(
criterion="entropy", # gini tends to be faster but similar performance
splitter="best", # best split or random
max_depth=None, # no max depth (so will likely overfit)
min_samples_split=2, # require at least 2 samples to split a node
min_samples_leaf=1, # require at least 1 sample in each leaf
min_weight_fraction_leaf=0.0,
max_features=None, # consider all features when looking for the best split
random_state=RANDOM_SEED,
max_leaf_nodes=None, # allow unlimited leaf nodes???
min_impurity_decrease=0.0, # node will split if decrease in impurity is at least this much
class_weight=None, # all classes are weighted/treated equally
ccp_alpha=0.0, # complexity parameter for minimal cost-complexity pruning, 0 means no pruning
),
),
]
exp.estimator = Pipeline(
steps=steps,
memory=CACHE_DIR,
)
exp.estimator.get_params()
Get dataset by name (eda already done in another notebook and train/test split saved so we will be working with the same data)
notes, X_train, X_test, y_train, y_test, target_names = get_dataset(exp.dataset)
print(notes)
print(target_names)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
# inspect data
disp_df(pd.concat([X_train, y_train], axis=1))
Double check the data statistics and features from the eda
plt.imshow(plt.imread(f"figs/{DATASET_NAME}_feature-statistics-X_train.png"))
plt.axis("off")
plt.show()
We have some outliers for sepal width.
Decision Trees are generally considered robust to outliers so we will leave them.
plt.imshow(plt.imread(f"figs/{DATASET_NAME}_target-class-distribution-y_train.png"))
plt.axis("off")
plt.show()
Classes in the train data are roughly balanced so we'll continue without any class balancing.
exp.update_param("n_train_samples", X_train.shape[0])
exp.update_param("n_test_samples", X_test.shape[0])
exp.summary_df
So, we will use cross validation to evaluate the model so that we get a more generalizable result.
But how many folds?
Find best cross-validation n_folds and type¶
NOTE: Learning curve already lets you see this!!! Don't need this extra step...just doing it for my own learning
How many folds should we use in our cross-validation? And what type of cross-validation should we use?
Stratified K-Fold CV since we'd like to keep the class distribution in each fold similar to the original dataset
Let's do a hypterparam search for number of folds
Recall, choice of k depends on computational resources, bias, variance, and the size of the dataset
- higher k
- each model is trained on larger portion of the dataset (eg. k=10 => 90%) => lower bias, higher variance
- more folds => more computational cost (more models to train and eval)
- more "stable" estimates because the model is validated on more unique splits
- lower k
- each model is trained on smaller portion of the dataset (eg. k=5 => 80%) => higher bias, lower variance
- less folds => less computational cost (less models to train and eval)
SO....
- with smaller datasets like iris, we can afford to use higher k
Note: if k = number of samples, then we have leave-one-out CV => low bias, high variance, high computational cost
NOTE TO SELF: I CAN DO THIS WITH A LINE PLOT AND FILL BETWEEN FOR THE STD DEV
# Convert y_train to a 1-dimensional array
y_train_array = y_train.values.ravel()
# cross-validation hyperparameter search (cv-accuracy on Y axis and k on X axis)
min_class_samples = min(
np.bincount(y_train_array)
) # number of samples in the smallest class since StratifiedKFold requires each fold to have the same proportion of classes as the entire dataset
k_range = range(2, min_class_samples + 1)
k_scores = []
print(f"Range of k: {k_range}")
for k in k_range: # for each value of k, run k-fold cross-validation
scores = cross_val_score(
exp.estimator,
X_train,
y_train_array,
cv=StratifiedKFold(n_splits=k),
scoring="accuracy",
)
k_scores.append(scores)
# Plotting the box plots
plt.figure(figsize=(12, 8))
plt.boxplot(k_scores, positions=k_range, widths=0.6)
plt.title("K-Fold Cross-Validation Hyperparameter Search")
plt.xlabel("Value of K")
plt.ylabel("Cross-Validated Accuracy")
# makes it easier to read the plot
plt.grid(axis="x", linestyle="--", color="gray", alpha=0.7)
plt.savefig(f"{FIGS_DIR}/{exp.name}-kfold-cv.png")
plt.show()
So,
- with 2 folds: in the first iteration 60 samples are trained on and 60 for testing. In the second iteration they get swapped and averaged for the final accuracy.
Q: Why is CV accuracy lower with 12 folds? Maybe just how the samples divide up since not perfectly even?
Q: So with 10 folds there are 12 samples (120/10=12) in each fold so 108 train (9x12) and 12 test each of the 10 iterations? Yes.
# disp_df of k and k_scores for the first 15 values of k
disp_df(pd.DataFrame(k_scores, index=k_range).T.head(15))
Let's interpret this
Eg. if k=5 we are training 5 models on 80% of the data and validation on 20% and getting the score/accuracy
- training on 96 samples (120/5=24 samples per fold * 4)
validating on 24 samples
high values of k (20+) have high accuracy because they are being trained on so much of the training data but variance steadily increases
- lower values of k (5-) have lower accuracy because they are being trained on less of the training data but variance is lower also because
In our case let's go with 3 since
- smaller variance and still relatively high accuracy
k=3 and 6 both give the same accuracy so
Just for fun lets visualize the selected cv folds¶
k = 3
cv = StratifiedKFold(
n_splits=k,
)
# print number of samples in each fold and samples per class
for i, (train, test) in enumerate(cv.split(X_train, y_train)):
print(
f"Fold {i} contains {len(train)} training samples and {len(test)} testing samples"
)
# bincount expects a 1-dimensional array
y_train_array_train = y_train.iloc[train].to_numpy().flatten()
y_train_array_test = y_train.iloc[test].to_numpy().flatten()
print(
f"Train: {np.bincount(y_train_array_train)}"
) # number of samples class 0, 1, 2
print(f"Test: {np.bincount(y_train_array_test)}") # number of samples class 0, 1, 2
fig, ax = plt.subplots()
plot_cv_indices(
cv=cv,
X=X_train,
y=y_train_array,
ax=ax,
n_splits=k,
)
fig.savefig(f"{FIGS_DIR}/{exp.name}-cv-indices.png")
Inspect learning curve¶
Recall the LC shows us how the models performance changes with number of samples.
- required data size: how much data is needed for get good performance before improvement plateau
cv = StratifiedKFold(
n_splits=3,
)
# Note: LearningCurveDisplay contains the scores as parameters (train_scores, test_scores)...could save these if needed
lcd = LearningCurveDisplay.from_estimator(
estimator=exp.estimator,
X=X_train,
y=y_train,
# train_sizes=np.linspace(0.1, 1.0, 5),
# train_sizes=np.linspace(1, 80, 5).astype(int),
train_sizes=np.linspace(0.1, 1.0, 5),
# splitters are instantiated with shuffle=False so the splits will be the same across all calls
cv=cv,
random_state=0,
# return_times = True, # default false
)
# Update the legend to change "Test" to "Cross-Validation"
handles, labels = plt.gca().get_legend_handles_labels()
labels = ["Training", "Cross-Validation"] if "Test" in labels else labels
plt.legend(handles, labels)
plt.title(f"Learning Curve\nStratified {cv.get_n_splits()}-Fold Cross-Validation Test")
plt.ylabel("Accuracy")
plt.savefig(f"{FIGS_DIR}/{exp.name}_learning-curve.png")
Interpretation:
- Train curve is perfect because DT can memorize the training data easily even when number of samples is small
- Around 20 samples is where the CV curve peaks
Note: In simpler models, we often see a gradual improvement in training accuracy as the number of samples increases because they cannot memorize the training data perfectly. However, decision trees do not follow this pattern due to their high capacity to fit the training data exactly, even with a small number of samples.
Conclusion:
- Keep training size the same since small model
exp.update_param("kfolds", f"Stratified {cv.get_n_splits()}-Fold Cross-Validation")
Hyperparameter search¶
For this example we aren't going to search using gridsearch or anything....
We are just going to use criterion: entropy and max_depth: None (no pruning) since that is the experiment. We are expecting it to overfit.
exp.estimator.get_params()
params = {
"classifier__criterion": "entropy",
"classifier__max_depth": None,
"classifier__max_features": None,
"classifier__max_leaf_nodes": None,
}
exp.estimator.set_params(**params)
exp.estimator.get_params()
# fit on training data
start_time = pd.Timestamp.now()
exp.estimator.fit(X=X_train, y=y_train)
train_time = pd.Timestamp.now() - start_time
exp.update_param("train_time", train_time)
exp.update_param(
"mean_accuracy",
exp.estimator.score(X_test, y_test),
# add_column=True
)
exp.summary_df
Take a look at the trained model¶
text_representation = export_text(
exp.estimator.named_steps["classifier"], feature_names=X_train.columns
)
print(text_representation)
with open(f"{RES_DIR}/{exp.name}-dtree.txt", "w") as f:
f.write(text_representation)
# convert target names series to list
target_names_list = target_names["target_names"].tolist()
target_names_list
fig = plt.figure(figsize=(25, 20))
plot_tree(
decision_tree=exp.estimator.named_steps["classifier"],
feature_names=X_train.columns,
class_names=target_names_list,
filled=True,
rounded=True,
)
plt.savefig(f"{FIGS_DIR}/{exp.name}_tree.png")
# # can't figure out how to make DTviz work with a pipeline so this is a workaround but its shit cause it trains another model which is different...
# X_train_nparray = X_train.to_numpy()
# y_train_nparray = y_train.to_numpy().flatten()
# params = {
# "criterion": "entropy",
# "max_depth": None,
# "max_features": None,
# "max_leaf_nodes": None,
# }
# dtc_temp = DecisionTreeClassifier(**params)
# dtc_temp.fit(X_train, y_train)
# viz_model = dtreeviz.model(
# dtc_temp,
# X_train=X_train_nparray,
# y_train=y_train_nparray,
# target_name="iris",
# feature_names=X_train.columns.to_list(),
# class_names=target_names_list,
# )
# v = viz_model.view(scale=1.5)
# # v.save(f"{FIGS_DIR}/{exp.name}_dtreeviz.svg")
# v
# get precision, recall, f1, accuracy
start_time = pd.Timestamp.now()
y_pred = exp.estimator.predict(X_test)
query_time = pd.Timestamp.now() - start_time
exp.update_param("query_time", query_time)
Use a confusion matrix to see its classification performance...what it got wrong/right.
target_names_list = target_names["target_names"].tolist()
cm = confusion_matrix(
y_true=y_test,
y_pred=y_pred,
# normalize="true"
)
cmd = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=target_names_list)
cmd.plot()
plt.savefig(f"{FIGS_DIR}/{exp.name}_confusion-matrix.png")
Q: I expected unpruned DT to classify everything correctly...why didn't it?
Q: Did it overfit? Train and test performance are pretty close so...no?
exp.update_param("confusion_matrix", np.array2string(cm))
cr = classification_report(y_true=y_test, y_pred=y_pred, output_dict=True)
exp.update_param("classification_report", str(cr))
cr
# add custom decision tree classification specific metrics to the summary_df
exp.update_param(
"split_criterion",
exp.estimator.named_steps["classifier"].criterion,
add_column=True,
)
exp.update_param(
"tree_depth", exp.estimator.named_steps["classifier"].get_depth(), add_column=True
)
exp.update_param(
"n_leaves", exp.estimator.named_steps["classifier"].get_n_leaves(), add_column=True
)
exp.update_param(
"n_tree_nodes",
exp.estimator.named_steps["classifier"].tree_.node_count,
add_column=True,
)
exp.summary_df
# exp.save(overwrite_existing=False)
Conclusions¶
Expectation: it will overfit the training data (high variance)
- Yes, it overfit the training data
training time will be minimal since the dataset is small
- Yes, training time was minimal
< Experiment Decision Tree Classifier on Iris dataset with prepruning | Contents | Genetic Algorithm on Knapsack Problem >