Experiment: Decision Tree Classifier on Iris dataset with prepruning¶

✂️ Quick Summary: Pre-Pruned Decision Tree (Iris Dataset)¶

Goal: Prevent the tree from growing too complex by limiting its size during training.

🧠 Core Idea:¶

Use hyperparameters like max_depth, min_samples_leaf, etc., to halt growth early, reducing overfitting risk.

In general...Pre-pruning avoids overfitting by stopping early—fast but risks missing patterns; post-pruning fixes overfitting by trimming later—slower but usually smarter.

🧮 Example Configuration:¶

max_depth=3
min_samples_split=4
min_samples_leaf=2

🔧 Implications:¶

Aspect	Expectation	Notes
Accuracy	High (~95%)	Good balance of bias and variance.
Overfitting	Less likely	Stops before fully memorizing data.
Interpretability	Good	Tree remains small and readable.
Tuning Needed	Yes	Hyperparameters control complexity.

Prepruning will generalize better than unpruned and learn faster.

🔑 Characteristics:¶

Simple, interpretable, and fast.
Works well when tuned.
Ideal for small-to-medium tabular datasets.

In [4]:

%reload_ext autoreload
%autoreload 2

In [5]:

# decision tree specific imports
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree

# experiment helper imports
from helpers.base_imports import *

Setup experiment with data and model¶

In [6]:

exp = Experiment(
    type="c",  # classification
    name="dtc-iris-prepruned",
    dataset="iris-20test-shuffled-v1",
)
exp

Loading 'classification-experiments.csv'
Creating experiment: 'dtc-iris-prepruned'
Loading 'dtc-iris-prepruned' estimator/model/pipeline

Out[6]:

Experiment(c, dtc-iris-prepruned, iris-20test-shuffled-v1)

Start with unpruned params then we will see with validation curves and grid search what we want to set the max_depth to.

In [7]:

# add the steps to the pipeline
steps = [
    # NOTE: DTs don't need scaling, but we include it here for consistency when comparing to other classifiers
    ("scaler", StandardScaler()),
    (
        "classifier",
        DecisionTreeClassifier(
            criterion="entropy",  # gini tends to be faster but similar performance
            splitter="best",  # best split or random
            max_depth=None,  # no max depth (so will likely overfit)
            min_samples_split=2,  # require at least 2 samples to split a node
            min_samples_leaf=1,  # require at least 1 sample in each leaf
            min_weight_fraction_leaf=0.0,
            max_features=None,  # consider all features when looking for the best split
            random_state=RANDOM_SEED,
            max_leaf_nodes=None,  # allow unlimited leaf nodes???
            min_impurity_decrease=0.0,  # node will split if decrease in impurity is at least this much
            class_weight=None,  # all classes are weighted/treated equally
            ccp_alpha=0.0,  # complexity parameter for minimal cost-complexity pruning, 0 means no pruning
        ),
    ),
]
exp.estimator = Pipeline(
    steps=steps,
    memory=CACHE_DIR,
)

In [8]:

exp.estimator.get_params()

Out[8]:

{'memory': '.cache',
 'steps': [('scaler', StandardScaler()),
  ('classifier', DecisionTreeClassifier(criterion='entropy', random_state=0))],
 'verbose': False,
 'scaler': StandardScaler(),
 'classifier': DecisionTreeClassifier(criterion='entropy', random_state=0),
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True,
 'classifier__ccp_alpha': 0.0,
 'classifier__class_weight': None,
 'classifier__criterion': 'entropy',
 'classifier__max_depth': None,
 'classifier__max_features': None,
 'classifier__max_leaf_nodes': None,
 'classifier__min_impurity_decrease': 0.0,
 'classifier__min_samples_leaf': 1,
 'classifier__min_samples_split': 2,
 'classifier__min_weight_fraction_leaf': 0.0,
 'classifier__monotonic_cst': None,
 'classifier__random_state': 0,
 'classifier__splitter': 'best'}

Get dataset by name (eda already done in another notebook and train/test split saved so we will be working with the same data)

In [9]:

notes, X_train, X_test, y_train, y_test, target_names = get_dataset(exp.dataset)
print(notes)
print(target_names)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

Dataset: iris-20test-shuffled-v1
X_train shape: (120, 4)
X_test shape: (30, 4)
y_train shape: (120,)
y_test shape: (30,)
Train: 80.00% of total
Test: 20.00% of total
Notes: None
Created by save_dataset() helper at 2024-07-02 12:20:00

  target_names
0       setosa
1   versicolor
2    virginica

Out[9]:

((120, 4), (30, 4), (120, 1), (30, 1))

In [10]:

# inspect data
disp_df(pd.concat([X_train, y_train], axis=1))

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
0	6.4	3.1	5.5	1.8	2
1	5.4	3.0	4.5	1.5	1
2	5.2	3.5	1.5	0.2	0
3	6.1	3.0	4.9	1.8	2
4	6.4	2.8	5.6	2.2	2
5	5.2	2.7	3.9	1.4	1
6	5.7	3.8	1.7	0.3	0
7	6.0	2.7	5.1	1.6	1
8	5.9	3.0	4.2	1.5	1
9	5.8	2.6	4.0	1.2	1
10	6.8	3.0	5.5	2.1	2
11	4.7	3.2	1.3	0.2	0
12	6.9	3.1	5.1	2.3	2
13	5.0	3.5	1.6	0.6	0
14	5.4	3.7	1.5	0.2	0
15	5.0	2.0	3.5	1.0	1
16	6.5	3.0	5.5	1.8	2
17	6.7	3.3	5.7	2.5	2
18	6.0	2.2	5.0	1.5	2
19	6.7	2.5	5.8	1.8	2
20	5.6	2.5	3.9	1.1	1
21	7.7	3.0	6.1	2.3	2
22	6.3	3.3	4.7	1.6	1
23	5.5	2.4	3.8	1.1	1
24	6.3	2.7	4.9	1.8	2
25	6.3	2.8	5.1	1.5	2
26	4.9	2.5	4.5	1.7	2
27	6.3	2.5	5.0	1.9	2
28	7.0	3.2	4.7	1.4	1
29	6.5	3.0	5.2	2.0	2
30	6.0	3.4	4.5	1.6	1
31	4.8	3.1	1.6	0.2	0
32	5.8	2.7	5.1	1.9	2
33	5.6	2.7	4.2	1.3	1
34	5.6	2.9	3.6	1.3	1
35	5.5	2.5	4.0	1.3	1
36	6.1	3.0	4.6	1.4	1
37	7.2	3.2	6.0	1.8	2
38	5.3	3.7	1.5	0.2	0
39	4.3	3.0	1.1	0.1	0
40	6.4	2.7	5.3	1.9	2
41	5.7	3.0	4.2	1.2	1
42	5.4	3.4	1.7	0.2	0
43	5.7	4.4	1.5	0.4	0
44	6.9	3.1	4.9	1.5	1
45	4.6	3.1	1.5	0.2	0
46	5.9	3.0	5.1	1.8	2
47	5.1	2.5	3.0	1.1	1
48	4.6	3.4	1.4	0.3	0
49	6.2	2.2	4.5	1.5	1
50	7.2	3.6	6.1	2.5	2
51	5.7	2.9	4.2	1.3	1
52	4.8	3.0	1.4	0.1	0
53	7.1	3.0	5.9	2.1	2
54	6.9	3.2	5.7	2.3	2
55	6.5	3.0	5.8	2.2	2
56	6.4	2.8	5.6	2.1	2
57	5.1	3.8	1.6	0.2	0
58	4.8	3.4	1.6	0.2	0
59	6.5	3.2	5.1	2.0	2
60	6.7	3.3	5.7	2.1	2
61	4.5	2.3	1.3	0.3	0
62	6.2	3.4	5.4	2.3	2
63	4.9	3.0	1.4	0.2	0
64	5.7	2.5	5.0	2.0	2
65	6.9	3.1	5.4	2.1	2
66	4.4	3.2	1.3	0.2	0
67	5.0	3.6	1.4	0.2	0
68	7.2	3.0	5.8	1.6	2
69	5.1	3.5	1.4	0.3	0
70	4.4	3.0	1.3	0.2	0
71	5.4	3.9	1.7	0.4	0
72	5.5	2.3	4.0	1.3	1
73	6.8	3.2	5.9	2.3	2
74	7.6	3.0	6.6	2.1	2
75	5.1	3.5	1.4	0.2	0
76	4.9	3.1	1.5	0.2	0
77	5.2	3.4	1.4	0.2	0
78	5.7	2.8	4.5	1.3	1
79	6.6	3.0	4.4	1.4	1
80	5.0	3.2	1.2	0.2	0
81	5.1	3.3	1.7	0.5	0
82	6.4	2.9	4.3	1.3	1
83	5.4	3.4	1.5	0.4	0
84	7.7	2.6	6.9	2.3	2
85	4.9	2.4	3.3	1.0	1
86	7.9	3.8	6.4	2.0	2
87	6.7	3.1	4.4	1.4	1
88	5.2	4.1	1.5	0.1	0
89	6.0	3.0	4.8	1.8	2
90	5.8	4.0	1.2	0.2	0
91	7.7	2.8	6.7	2.0	2
92	5.1	3.8	1.5	0.3	0
93	4.7	3.2	1.6	0.2	0
94	7.4	2.8	6.1	1.9	2
95	5.0	3.3	1.4	0.2	0
96	6.3	3.4	5.6	2.4	2
97	5.7	2.8	4.1	1.3	1
98	5.8	2.7	3.9	1.2	1
99	5.7	2.6	3.5	1.0	1
100	6.4	3.2	5.3	2.3	2
101	6.7	3.0	5.2	2.3	2
102	6.3	2.5	4.9	1.5	1
103	6.7	3.0	5.0	1.7	1
104	5.0	3.0	1.6	0.2	0
105	5.5	2.4	3.7	1.0	1
106	6.7	3.1	5.6	2.4	2
107	5.8	2.7	5.1	1.9	2
108	5.1	3.4	1.5	0.2	0
109	6.6	2.9	4.6	1.3	1
110	5.6	3.0	4.1	1.3	1
111	5.9	3.2	4.8	1.8	1
112	6.3	2.3	4.4	1.3	1
113	5.5	3.5	1.3	0.2	0
114	5.1	3.7	1.5	0.4	0
115	4.9	3.1	1.5	0.1	0
116	6.3	2.9	5.6	1.8	2
117	5.8	2.7	4.1	1.0	1
118	7.7	3.8	6.7	2.2	2
119	4.6	3.2	1.4	0.2	0

Double check the data statistics and features from the eda

In [11]:

plt.imshow(plt.imread("figs/iris-20test-shuffled-v0_feature-statistics-X_train.png"))
plt.axis("off")
plt.show()

We have some outliers for sepal width.

Decision Trees are generally considered robust to outliers so we will leave them.

In [12]:

plt.imshow(plt.imread("figs/iris_target-class-distribution-y_train.png"))
plt.axis("off")
plt.show()

Classes in the train data are roughly balanced so we'll continue without any class balancing.

In [13]:

exp.update_param("n_train_samples", X_train.shape[0])
exp.update_param("n_test_samples", X_test.shape[0])
exp.summary_df

Out[13]:

	dataset_name	n_train_samples	n_test_samples	mean_accuracy	train_time	query_time	kfolds	confusion_matrix	classification_report
exp_name
dtc-iris-prepruned	iris-20test-shuffled-v1	120	30	NaN	NaN	NaN	NaN	NaN	NaN

Inspect learning curve¶

required data size: how much data is needed for get good performance before improvement plateau

In [16]:

cv = StratifiedKFold(
    n_splits=3,
)
# Note: LearningCurveDisplay contains the scores as parameters (train_scores, test_scores)...could save these if needed
lcd = LearningCurveDisplay.from_estimator(
    estimator=exp.estimator,
    X=X_train,
    y=y_train,
    # train_sizes=np.linspace(0.1, 1.0, 5),
    # train_sizes=np.linspace(1, 80, 5).astype(int),
    train_sizes=np.linspace(0.1, 1.0, 5),
    # splitters are instantiated with shuffle=False so the splits will be the same across all calls
    cv=cv,
    random_state=0,
    # return_times = True, # default false
)

# Update the legend to change "Test" to "Cross-Validation"
handles, labels = plt.gca().get_legend_handles_labels()
labels = ["Training", "Cross-Validation"] if "Test" in labels else labels
plt.legend(handles, labels)

plt.title(f"Learning Curve\nStratified {cv.get_n_splits()}-Fold Cross-Validation Test")
plt.ylabel("Accuracy")
plt.savefig(f"{FIGS_DIR}/{exp.name}_learning-curve.png")

All of this is exactly the same as the unpruned DT so far.

In [37]:

exp.update_param("kfolds", f"Stratified {cv.get_n_splits()}-Fold Cross-Validation")

Hyperparam search¶

In [17]:

# figure out ranges for hyperparameters using validation curves
vcd_max_depth = ValidationCurveDisplay.from_estimator(
    estimator=exp.estimator,
    X=X_train,
    y=y_train,
    param_name="classifier__max_depth",
    param_range=np.arange(1, 5),
    cv=cv,
    # shuffle=True,
)

# Update the legend to change "Test" to "Cross-Validation"
handles, labels = plt.gca().get_legend_handles_labels()
labels = ["Training", "Cross-Validation"] if "Test" in labels else labels
plt.legend(handles, labels)

plt.title(
    f"Validation Curve\nStratified {cv.get_n_splits()}-Fold Cross-Validation Test"
)

plt.savefig(f"{FIGS_DIR}/{exp.name}_validation-curve-max-depth.png")

So, a max 2 or 3 seem okay. By eye, I'd go with 3 (we'll see what GridSearchCV says next but I'm guessing it will be 3).

In [18]:

# figure out ranges for hyperparameters using validation curves
vcd_min_samples_split = ValidationCurveDisplay.from_estimator(
    estimator=exp.estimator,
    X=X_train,
    y=y_train,
    param_name="classifier__min_samples_split",
    param_range=np.arange(2, 10),
    cv=cv,
    # shuffle=True,
)

# Update the legend to change "Test" to "Cross-Validation"
handles, labels = plt.gca().get_legend_handles_labels()
labels = ["Training", "Cross-Validation"] if "Test" in labels else labels
plt.legend(handles, labels)

plt.title(
    f"Validation Curve\nStratified {cv.get_n_splits()}-Fold Cross-Validation Test"
)
plt.savefig(f"{FIGS_DIR}/{exp.name}_validation-curve-min-samples-split.png")

For min_samples_split, the validation curve looks like 2 gives a good accuracy score.

Now that we've seen the validation curves for these hyperparams, we have a good feel for the range that grid search should be looking in and what it will likely return.

Lets see if we're right

In [19]:

# set hyperparameters and train model (+report)
# perform and report experiments with different hyperparameters
param_grid = {
    "classifier__max_depth": np.arange(1, 4),
    "classifier__min_samples_split": np.arange(2, 5),
}

grid_search = GridSearchCV(
    estimator=exp.estimator,
    param_grid=param_grid,
    # scoring="", # accuracy is default but we can use another or many others
    cv=cv,
)
grid_search.fit(X_train, y_train)

Out[19]:

GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
             estimator=Pipeline(memory='.cache',
                                steps=[('scaler', StandardScaler()),
                                       ('classifier',
                                        DecisionTreeClassifier(criterion='entropy',
                                                               random_state=0))]),
             param_grid={'classifier__max_depth': array([1, 2, 3]),
                         'classifier__min_samples_split': array([2, 3, 4])})

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [20]:

grid_search.best_params_

Out[20]:

{'classifier__max_depth': np.int64(3),
 'classifier__min_samples_split': np.int64(2)}

In [21]:

grid_search.best_score_

Out[21]:

np.float64(0.9499999999999998)

GridSearch thinks 3 is the best max_depth and 2 is the best min_samples_split. Thats what we will go with for our prepruned model.

In [22]:

exp.estimator.set_params(**grid_search.best_params_)
exp.estimator.get_params()

Out[22]:

{'memory': '.cache',
 'steps': [('scaler', StandardScaler()),
  ('classifier',
   DecisionTreeClassifier(criterion='entropy', max_depth=np.int64(3),
                          min_samples_split=np.int64(2), random_state=0))],
 'verbose': False,
 'scaler': StandardScaler(),
 'classifier': DecisionTreeClassifier(criterion='entropy', max_depth=np.int64(3),
                        min_samples_split=np.int64(2), random_state=0),
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True,
 'classifier__ccp_alpha': 0.0,
 'classifier__class_weight': None,
 'classifier__criterion': 'entropy',
 'classifier__max_depth': np.int64(3),
 'classifier__max_features': None,
 'classifier__max_leaf_nodes': None,
 'classifier__min_impurity_decrease': 0.0,
 'classifier__min_samples_leaf': 1,
 'classifier__min_samples_split': np.int64(2),
 'classifier__min_weight_fraction_leaf': 0.0,
 'classifier__monotonic_cst': None,
 'classifier__random_state': 0,
 'classifier__splitter': 'best'}

In [23]:

start_time = pd.Timestamp.now()
exp.estimator.fit(X_train, y_train)
train_time = pd.Timestamp.now() - start_time

In [24]:

exp.update_param("train_time", train_time)
exp.update_param(
    "mean_accuracy",
    exp.estimator.score(X_test, y_test),
    # add_column=True
)
exp.summary_df

Out[24]:

	dataset_name	n_train_samples	n_test_samples	mean_accuracy	train_time	query_time	kfolds	confusion_matrix	classification_report
exp_name
dtc-iris-prepruned	iris-20test-shuffled-v1	120	30	0.966667	0 days 00:00:00.003480	NaN	NaN	NaN	NaN

Take a look at our prepruned trained model¶

In [25]:

text_representation = export_text(
    exp.estimator.named_steps["classifier"], feature_names=X_train.columns
)
print(text_representation)
with open(f"{RES_DIR}/{exp.name}-dtree.txt", "w") as f:
    f.write(text_representation)

|--- petal width (cm) <= -0.56
|   |--- class: 0
|--- petal width (cm) >  -0.56
|   |--- petal width (cm) <= 0.67
|   |   |--- petal length (cm) <= 0.64
|   |   |   |--- class: 1
|   |   |--- petal length (cm) >  0.64
|   |   |   |--- class: 2
|   |--- petal width (cm) >  0.67
|   |   |--- petal length (cm) <= 0.58
|   |   |   |--- class: 1
|   |   |--- petal length (cm) >  0.58
|   |   |   |--- class: 2

In [26]:

# convert target names series to list
target_names_list = target_names["target_names"].tolist()
target_names_list

Out[26]:

['setosa', 'versicolor', 'virginica']

In [27]:

fig = plt.figure(figsize=(25, 20))
plot_tree(
    decision_tree=exp.estimator.named_steps["classifier"],
    feature_names=X_train.columns,
    class_names=target_names_list,
    filled=True,
    rounded=True,
)
plt.savefig(f"{FIGS_DIR}/{exp.name}_tree.png")

We can see its basically the same as the unpruned model but with a max_depth of 3.

In [28]:

# get precision, recall, f1, accuracy
start_time = pd.Timestamp.now()
y_pred = exp.estimator.predict(X_test)
query_time = pd.Timestamp.now() - start_time

In [34]:

exp.update_param("query_time", query_time)

In [29]:

target_names_list = target_names["target_names"].tolist()
cm = confusion_matrix(
    y_true=y_test,
    y_pred=y_pred,
    # normalize="true"
)
cmd = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=target_names_list)
cmd.plot()
plt.savefig(f"{FIGS_DIR}/{exp.name}_confusion-matrix.png")

It misses one but gets the rest right still.

In [30]:

exp.update_param("confusion_matrix", np.array2string(cm))

In [31]:

cr = classification_report(y_true=y_test, y_pred=y_pred, output_dict=True)
exp.update_param("classification_report", str(cr))
cr

Out[31]:

{'0': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 11.0},
 '1': {'precision': 0.9285714285714286,
  'recall': 1.0,
  'f1-score': 0.9629629629629629,
  'support': 13.0},
 '2': {'precision': 1.0,
  'recall': 0.8333333333333334,
  'f1-score': 0.9090909090909091,
  'support': 6.0},
 'accuracy': 0.9666666666666667,
 'macro avg': {'precision': 0.9761904761904763,
  'recall': 0.9444444444444445,
  'f1-score': 0.957351290684624,
  'support': 30.0},
 'weighted avg': {'precision': 0.9690476190476189,
  'recall': 0.9666666666666667,
  'f1-score': 0.9657687991021324,
  'support': 30.0}}

In [32]:

# add custom decision tree classification specific metrics to the summary_df
exp.update_param(
    "split_criterion",
    exp.estimator.named_steps["classifier"].criterion,
    add_column=True,
)
exp.update_param(
    "tree_depth", exp.estimator.named_steps["classifier"].get_depth(), add_column=True
)
exp.update_param(
    "n_leaves", exp.estimator.named_steps["classifier"].get_n_leaves(), add_column=True
)
exp.update_param(
    "n_tree_nodes",
    exp.estimator.named_steps["classifier"].tree_.node_count,
    add_column=True,
)
exp.summary_df

Adding column: split_criterion
Adding column: tree_depth
Adding column: n_leaves

Out[32]:

	dataset_name	n_train_samples	n_test_samples	mean_accuracy	train_time	query_time	kfolds	confusion_matrix	classification_report	split_criterion	tree_depth	n_leaves
exp_name
dtc-iris-prepruned	iris-20test-shuffled-v1	120	30	0.966667	0 days 00:00:00.003480	NaN	NaN	[[11 0 0]\n [ 0 13 0]\n [ 0 1 5]]	{'0': {'precision': 1.0, 'recall': 1.0, 'f1-sc...	entropy	3	5

In [38]:

exp.save(overwrite_existing=False)

Loading 'classification-experiments.csv'
Overwriting existing experiment dtc-iris-prepruned
Saving experiment dtc-iris-prepruned to results/classification-experiments.csv
Dumping estimator dtc-iris-prepruned to .cache/dtc-iris-prepruned.joblib

Conclusions¶

Expectation: prepruning will generalize better than unpruned and learn faster
- Yes, prepruned trained faster
- I am not sure if you can say it generalized better....accuracy was a bit lower and it got one wrong but on the unpruned it got none wrong. I guess CV accuracy tells you about how well it generalizes which ...TODO...IDK

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
0	6.4	3.1	5.5	1.8	2
1	5.4	3.0	4.5	1.5	1
2	5.2	3.5	1.5	0.2	0
3	6.1	3.0	4.9	1.8	2
4	6.4	2.8	5.6	2.2	2
5	5.2	2.7	3.9	1.4	1
6	5.7	3.8	1.7	0.3	0
7	6.0	2.7	5.1	1.6	1
8	5.9	3.0	4.2	1.5	1
9	5.8	2.6	4.0	1.2	1
10	6.8	3.0	5.5	2.1	2
11	4.7	3.2	1.3	0.2	0
12	6.9	3.1	5.1	2.3	2
13	5.0	3.5	1.6	0.6	0
14	5.4	3.7	1.5	0.2	0
15	5.0	2.0	3.5	1.0	1
16	6.5	3.0	5.5	1.8	2
17	6.7	3.3	5.7	2.5	2
18	6.0	2.2	5.0	1.5	2
19	6.7	2.5	5.8	1.8	2
20	5.6	2.5	3.9	1.1	1
21	7.7	3.0	6.1	2.3	2
22	6.3	3.3	4.7	1.6	1
23	5.5	2.4	3.8	1.1	1
24	6.3	2.7	4.9	1.8	2
25	6.3	2.8	5.1	1.5	2
26	4.9	2.5	4.5	1.7	2
27	6.3	2.5	5.0	1.9	2
28	7.0	3.2	4.7	1.4	1
29	6.5	3.0	5.2	2.0	2
30	6.0	3.4	4.5	1.6	1
31	4.8	3.1	1.6	0.2	0
32	5.8	2.7	5.1	1.9	2
33	5.6	2.7	4.2	1.3	1
34	5.6	2.9	3.6	1.3	1
35	5.5	2.5	4.0	1.3	1
36	6.1	3.0	4.6	1.4	1
37	7.2	3.2	6.0	1.8	2
38	5.3	3.7	1.5	0.2	0
39	4.3	3.0	1.1	0.1	0
40	6.4	2.7	5.3	1.9	2
41	5.7	3.0	4.2	1.2	1
42	5.4	3.4	1.7	0.2	0
43	5.7	4.4	1.5	0.4	0
44	6.9	3.1	4.9	1.5	1
45	4.6	3.1	1.5	0.2	0
46	5.9	3.0	5.1	1.8	2
47	5.1	2.5	3.0	1.1	1
48	4.6	3.4	1.4	0.3	0
49	6.2	2.2	4.5	1.5	1
50	7.2	3.6	6.1	2.5	2
51	5.7	2.9	4.2	1.3	1
52	4.8	3.0	1.4	0.1	0
53	7.1	3.0	5.9	2.1	2
54	6.9	3.2	5.7	2.3	2
55	6.5	3.0	5.8	2.2	2
56	6.4	2.8	5.6	2.1	2
57	5.1	3.8	1.6	0.2	0
58	4.8	3.4	1.6	0.2	0
59	6.5	3.2	5.1	2.0	2
60	6.7	3.3	5.7	2.1	2
61	4.5	2.3	1.3	0.3	0
62	6.2	3.4	5.4	2.3	2
63	4.9	3.0	1.4	0.2	0
64	5.7	2.5	5.0	2.0	2
65	6.9	3.1	5.4	2.1	2
66	4.4	3.2	1.3	0.2	0
67	5.0	3.6	1.4	0.2	0
68	7.2	3.0	5.8	1.6	2
69	5.1	3.5	1.4	0.3	0
70	4.4	3.0	1.3	0.2	0
71	5.4	3.9	1.7	0.4	0
72	5.5	2.3	4.0	1.3	1
73	6.8	3.2	5.9	2.3	2
74	7.6	3.0	6.6	2.1	2
75	5.1	3.5	1.4	0.2	0
76	4.9	3.1	1.5	0.2	0
77	5.2	3.4	1.4	0.2	0
78	5.7	2.8	4.5	1.3	1
79	6.6	3.0	4.4	1.4	1
80	5.0	3.2	1.2	0.2	0
81	5.1	3.3	1.7	0.5	0
82	6.4	2.9	4.3	1.3	1
83	5.4	3.4	1.5	0.4	0
84	7.7	2.6	6.9	2.3	2
85	4.9	2.4	3.3	1.0	1
86	7.9	3.8	6.4	2.0	2
87	6.7	3.1	4.4	1.4	1
88	5.2	4.1	1.5	0.1	0
89	6.0	3.0	4.8	1.8	2
90	5.8	4.0	1.2	0.2	0
91	7.7	2.8	6.7	2.0	2
92	5.1	3.8	1.5	0.3	0
93	4.7	3.2	1.6	0.2	0
94	7.4	2.8	6.1	1.9	2
95	5.0	3.3	1.4	0.2	0
96	6.3	3.4	5.6	2.4	2
97	5.7	2.8	4.1	1.3	1
98	5.8	2.7	3.9	1.2	1
99	5.7	2.6	3.5	1.0	1
100	6.4	3.2	5.3	2.3	2
101	6.7	3.0	5.2	2.3	2
102	6.3	2.5	4.9	1.5	1
103	6.7	3.0	5.0	1.7	1
104	5.0	3.0	1.6	0.2	0
105	5.5	2.4	3.7	1.0	1
106	6.7	3.1	5.6	2.4	2
107	5.8	2.7	5.1	1.9	2
108	5.1	3.4	1.5	0.2	0
109	6.6	2.9	4.6	1.3	1
110	5.6	3.0	4.1	1.3	1
111	5.9	3.2	4.8	1.8	1
112	6.3	2.3	4.4	1.3	1
113	5.5	3.5	1.3	0.2	0
114	5.1	3.7	1.5	0.4	0
115	4.9	3.1	1.5	0.1	0
116	6.3	2.9	5.6	1.8	2
117	5.8	2.7	4.1	1.0	1
118	7.7	3.8	6.7	2.2	2
119	4.6	3.2	1.4	0.2	0

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
0	6.4	3.1	5.5	1.8	2
1	5.4	3.0	4.5	1.5	1
2	5.2	3.5	1.5	0.2	0
3	6.1	3.0	4.9	1.8	2
4	6.4	2.8	5.6	2.2	2
5	5.2	2.7	3.9	1.4	1
6	5.7	3.8	1.7	0.3	0
7	6.0	2.7	5.1	1.6	1
8	5.9	3.0	4.2	1.5	1
9	5.8	2.6	4.0	1.2	1
10	6.8	3.0	5.5	2.1	2
11	4.7	3.2	1.3	0.2	0
12	6.9	3.1	5.1	2.3	2
13	5.0	3.5	1.6	0.6	0
14	5.4	3.7	1.5	0.2	0
15	5.0	2.0	3.5	1.0	1
16	6.5	3.0	5.5	1.8	2
17	6.7	3.3	5.7	2.5	2
18	6.0	2.2	5.0	1.5	2
19	6.7	2.5	5.8	1.8	2
20	5.6	2.5	3.9	1.1	1
21	7.7	3.0	6.1	2.3	2
22	6.3	3.3	4.7	1.6	1
23	5.5	2.4	3.8	1.1	1
24	6.3	2.7	4.9	1.8	2
25	6.3	2.8	5.1	1.5	2
26	4.9	2.5	4.5	1.7	2
27	6.3	2.5	5.0	1.9	2
28	7.0	3.2	4.7	1.4	1
29	6.5	3.0	5.2	2.0	2
30	6.0	3.4	4.5	1.6	1
31	4.8	3.1	1.6	0.2	0
32	5.8	2.7	5.1	1.9	2
33	5.6	2.7	4.2	1.3	1
34	5.6	2.9	3.6	1.3	1
35	5.5	2.5	4.0	1.3	1
36	6.1	3.0	4.6	1.4	1
37	7.2	3.2	6.0	1.8	2
38	5.3	3.7	1.5	0.2	0
39	4.3	3.0	1.1	0.1	0
40	6.4	2.7	5.3	1.9	2
41	5.7	3.0	4.2	1.2	1
42	5.4	3.4	1.7	0.2	0
43	5.7	4.4	1.5	0.4	0
44	6.9	3.1	4.9	1.5	1
45	4.6	3.1	1.5	0.2	0
46	5.9	3.0	5.1	1.8	2
47	5.1	2.5	3.0	1.1	1
48	4.6	3.4	1.4	0.3	0
49	6.2	2.2	4.5	1.5	1
50	7.2	3.6	6.1	2.5	2
51	5.7	2.9	4.2	1.3	1
52	4.8	3.0	1.4	0.1	0
53	7.1	3.0	5.9	2.1	2
54	6.9	3.2	5.7	2.3	2
55	6.5	3.0	5.8	2.2	2
56	6.4	2.8	5.6	2.1	2
57	5.1	3.8	1.6	0.2	0
58	4.8	3.4	1.6	0.2	0
59	6.5	3.2	5.1	2.0	2
60	6.7	3.3	5.7	2.1	2
61	4.5	2.3	1.3	0.3	0
62	6.2	3.4	5.4	2.3	2
63	4.9	3.0	1.4	0.2	0
64	5.7	2.5	5.0	2.0	2
65	6.9	3.1	5.4	2.1	2
66	4.4	3.2	1.3	0.2	0
67	5.0	3.6	1.4	0.2	0
68	7.2	3.0	5.8	1.6	2
69	5.1	3.5	1.4	0.3	0
70	4.4	3.0	1.3	0.2	0
71	5.4	3.9	1.7	0.4	0
72	5.5	2.3	4.0	1.3	1
73	6.8	3.2	5.9	2.3	2
74	7.6	3.0	6.6	2.1	2
75	5.1	3.5	1.4	0.2	0
76	4.9	3.1	1.5	0.2	0
77	5.2	3.4	1.4	0.2	0
78	5.7	2.8	4.5	1.3	1
79	6.6	3.0	4.4	1.4	1
80	5.0	3.2	1.2	0.2	0
81	5.1	3.3	1.7	0.5	0
82	6.4	2.9	4.3	1.3	1
83	5.4	3.4	1.5	0.4	0
84	7.7	2.6	6.9	2.3	2
85	4.9	2.4	3.3	1.0	1
86	7.9	3.8	6.4	2.0	2
87	6.7	3.1	4.4	1.4	1
88	5.2	4.1	1.5	0.1	0
89	6.0	3.0	4.8	1.8	2
90	5.8	4.0	1.2	0.2	0
91	7.7	2.8	6.7	2.0	2
92	5.1	3.8	1.5	0.3	0
93	4.7	3.2	1.6	0.2	0
94	7.4	2.8	6.1	1.9	2
95	5.0	3.3	1.4	0.2	0
96	6.3	3.4	5.6	2.4	2
97	5.7	2.8	4.1	1.3	1
98	5.8	2.7	3.9	1.2	1
99	5.7	2.6	3.5	1.0	1
100	6.4	3.2	5.3	2.3	2
101	6.7	3.0	5.2	2.3	2
102	6.3	2.5	4.9	1.5	1
103	6.7	3.0	5.0	1.7	1
104	5.0	3.0	1.6	0.2	0
105	5.5	2.4	3.7	1.0	1
106	6.7	3.1	5.6	2.4	2
107	5.8	2.7	5.1	1.9	2
108	5.1	3.4	1.5	0.2	0
109	6.6	2.9	4.6	1.3	1
110	5.6	3.0	4.1	1.3	1
111	5.9	3.2	4.8	1.8	1
112	6.3	2.3	4.4	1.3	1
113	5.5	3.5	1.3	0.2	0
114	5.1	3.7	1.5	0.4	0
115	4.9	3.1	1.5	0.1	0
116	6.3	2.9	5.6	1.8	2
117	5.8	2.7	4.1	1.0	1
118	7.7	3.8	6.7	2.2	2
119	4.6	3.2	1.4	0.2	0

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
0	6.4	3.1	5.5	1.8	2
1	5.4	3.0	4.5	1.5	1
2	5.2	3.5	1.5	0.2	0
3	6.1	3.0	4.9	1.8	2
4	6.4	2.8	5.6	2.2	2
5	5.2	2.7	3.9	1.4	1
6	5.7	3.8	1.7	0.3	0
7	6.0	2.7	5.1	1.6	1
8	5.9	3.0	4.2	1.5	1
9	5.8	2.6	4.0	1.2	1
10	6.8	3.0	5.5	2.1	2
11	4.7	3.2	1.3	0.2	0
12	6.9	3.1	5.1	2.3	2
13	5.0	3.5	1.6	0.6	0
14	5.4	3.7	1.5	0.2	0
15	5.0	2.0	3.5	1.0	1
16	6.5	3.0	5.5	1.8	2
17	6.7	3.3	5.7	2.5	2
18	6.0	2.2	5.0	1.5	2
19	6.7	2.5	5.8	1.8	2
20	5.6	2.5	3.9	1.1	1
21	7.7	3.0	6.1	2.3	2
22	6.3	3.3	4.7	1.6	1
23	5.5	2.4	3.8	1.1	1
24	6.3	2.7	4.9	1.8	2
25	6.3	2.8	5.1	1.5	2
26	4.9	2.5	4.5	1.7	2
27	6.3	2.5	5.0	1.9	2
28	7.0	3.2	4.7	1.4	1
29	6.5	3.0	5.2	2.0	2
30	6.0	3.4	4.5	1.6	1
31	4.8	3.1	1.6	0.2	0
32	5.8	2.7	5.1	1.9	2
33	5.6	2.7	4.2	1.3	1
34	5.6	2.9	3.6	1.3	1
35	5.5	2.5	4.0	1.3	1
36	6.1	3.0	4.6	1.4	1
37	7.2	3.2	6.0	1.8	2
38	5.3	3.7	1.5	0.2	0
39	4.3	3.0	1.1	0.1	0
40	6.4	2.7	5.3	1.9	2
41	5.7	3.0	4.2	1.2	1
42	5.4	3.4	1.7	0.2	0
43	5.7	4.4	1.5	0.4	0
44	6.9	3.1	4.9	1.5	1
45	4.6	3.1	1.5	0.2	0
46	5.9	3.0	5.1	1.8	2
47	5.1	2.5	3.0	1.1	1
48	4.6	3.4	1.4	0.3	0
49	6.2	2.2	4.5	1.5	1
50	7.2	3.6	6.1	2.5	2
51	5.7	2.9	4.2	1.3	1
52	4.8	3.0	1.4	0.1	0
53	7.1	3.0	5.9	2.1	2
54	6.9	3.2	5.7	2.3	2
55	6.5	3.0	5.8	2.2	2
56	6.4	2.8	5.6	2.1	2
57	5.1	3.8	1.6	0.2	0
58	4.8	3.4	1.6	0.2	0
59	6.5	3.2	5.1	2.0	2
60	6.7	3.3	5.7	2.1	2
61	4.5	2.3	1.3	0.3	0
62	6.2	3.4	5.4	2.3	2
63	4.9	3.0	1.4	0.2	0
64	5.7	2.5	5.0	2.0	2
65	6.9	3.1	5.4	2.1	2
66	4.4	3.2	1.3	0.2	0
67	5.0	3.6	1.4	0.2	0
68	7.2	3.0	5.8	1.6	2
69	5.1	3.5	1.4	0.3	0
70	4.4	3.0	1.3	0.2	0
71	5.4	3.9	1.7	0.4	0
72	5.5	2.3	4.0	1.3	1
73	6.8	3.2	5.9	2.3	2
74	7.6	3.0	6.6	2.1	2
75	5.1	3.5	1.4	0.2	0
76	4.9	3.1	1.5	0.2	0
77	5.2	3.4	1.4	0.2	0
78	5.7	2.8	4.5	1.3	1
79	6.6	3.0	4.4	1.4	1
80	5.0	3.2	1.2	0.2	0
81	5.1	3.3	1.7	0.5	0
82	6.4	2.9	4.3	1.3	1
83	5.4	3.4	1.5	0.4	0
84	7.7	2.6	6.9	2.3	2
85	4.9	2.4	3.3	1.0	1
86	7.9	3.8	6.4	2.0	2
87	6.7	3.1	4.4	1.4	1
88	5.2	4.1	1.5	0.1	0
89	6.0	3.0	4.8	1.8	2
90	5.8	4.0	1.2	0.2	0
91	7.7	2.8	6.7	2.0	2
92	5.1	3.8	1.5	0.3	0
93	4.7	3.2	1.6	0.2	0
94	7.4	2.8	6.1	1.9	2
95	5.0	3.3	1.4	0.2	0
96	6.3	3.4	5.6	2.4	2
97	5.7	2.8	4.1	1.3	1
98	5.8	2.7	3.9	1.2	1
99	5.7	2.6	3.5	1.0	1
100	6.4	3.2	5.3	2.3	2
101	6.7	3.0	5.2	2.3	2
102	6.3	2.5	4.9	1.5	1
103	6.7	3.0	5.0	1.7	1
104	5.0	3.0	1.6	0.2	0
105	5.5	2.4	3.7	1.0	1
106	6.7	3.1	5.6	2.4	2
107	5.8	2.7	5.1	1.9	2
108	5.1	3.4	1.5	0.2	0
109	6.6	2.9	4.6	1.3	1
110	5.6	3.0	4.1	1.3	1
111	5.9	3.2	4.8	1.8	1
112	6.3	2.3	4.4	1.3	1
113	5.5	3.5	1.3	0.2	0
114	5.1	3.7	1.5	0.4	0
115	4.9	3.1	1.5	0.1	0
116	6.3	2.9	5.6	1.8	2
117	5.8	2.7	4.1	1.0	1
118	7.7	3.8	6.7	2.2	2
119	4.6	3.2	1.4	0.2	0