The content for this site available on GitHub. If you want to launch the notebooks interactively click on the binder stamp below. Binder

< Randomized Optimization | Contents

Experiment: Support Vector Machine Classifier on iris dataset

⚖️ Quick Summary

Goal: Find the optimal decision boundary that maximizes the margin between classes.


🧠 Core Idea:

  • SVM finds the hyperplane that best separates classes with the largest margin.
  • Uses support vectors (critical boundary cases) to define the hyperplane.
  • Can use kernels to handle nonlinear separation.

🧮 Example Configuration / Hyperparams:

  • kernel='rbf' (default: good for nonlinear boundaries)
  • C=1.0 (regularization strength)
  • gamma='scale' (kernel coefficient)

🔧 Expectations:

Aspect Expectation Notes
Accuracy High Excellent for small, clean datasets like Iris
Overfitting Moderate (tunable) Controlled via C and kernel choice
Training Time Moderate Slower than KNN or DTC; fast on small datasets
Interpretability Low Not easily visualizable or interpretable
Kernel Choice Critical RBF for nonlinear, linear for simpler separable cases

🔑 Characteristics:

  • Powerful for both linear and nonlinear problems.
  • Sensitive to feature scaling.
  • Kernel trick makes SVM flexible, but tuning is important.
In [1]:
%reload_ext autoreload
%autoreload 2
In [2]:
# scv specific imports
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.inspection import DecisionBoundaryDisplay

# experiment helper imports
from helpers.base_imports import *

Setup experiment with data and model

In [3]:
exp = Experiment(
    type="c",  # classification
    name="svc-iris-linear",
    dataset="iris-20test-shuffled-v1",
)
exp
Loading 'classification-experiments.csv'
Creating experiment: 'svc-iris-linear'
Loading 'svc-iris-linear' estimator/model/pipeline
/Users/yarik/vc_projects/ML/machine-learning/.venv-py31013/lib/python3.10/site-packages/sklearn/base.py:380: InconsistentVersionWarning: Trying to unpickle estimator StandardScaler from version 1.5.0 when using version 1.6.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(
/Users/yarik/vc_projects/ML/machine-learning/.venv-py31013/lib/python3.10/site-packages/sklearn/base.py:380: InconsistentVersionWarning: Trying to unpickle estimator SVC from version 1.5.0 when using version 1.6.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(
/Users/yarik/vc_projects/ML/machine-learning/.venv-py31013/lib/python3.10/site-packages/sklearn/base.py:380: InconsistentVersionWarning: Trying to unpickle estimator Pipeline from version 1.5.0 when using version 1.6.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(
Out[3]:
Experiment(c, svc-iris-linear, iris-20test-shuffled-v1)

Note that sklearn has svm.SVC(kernel="linear", C=C) and svm.LinearSVC(C=C, max_iter=10000) for linear SVMs. The former is more flexible and can use other kernels, while the latter is optimized for linear kernels and a bit faster. For this small dataset we'll just use the former.

In [4]:
# add the steps to the pipeline
steps = [
    ("scaler", StandardScaler()),
    (
        "classifier",
        SVC(
            kernel="linear",  # linear, poly, rbg, etc
            gamma="auto",  # kernel coefficient (1/n features)
            C=1.0,  # regularization parameter
            random_state=RANDOM_SEED,
        ),
    ),
]
exp.estimator = Pipeline(
    steps=steps,
    memory=CACHE_DIR,
)
In [5]:
exp.estimator.get_params()
Out[5]:
{'memory': '.cache',
 'steps': [('scaler', StandardScaler()),
  ('classifier', SVC(gamma='auto', kernel='linear', random_state=0))],
 'transform_input': None,
 'verbose': False,
 'scaler': StandardScaler(),
 'classifier': SVC(gamma='auto', kernel='linear', random_state=0),
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True,
 'classifier__C': 1.0,
 'classifier__break_ties': False,
 'classifier__cache_size': 200,
 'classifier__class_weight': None,
 'classifier__coef0': 0.0,
 'classifier__decision_function_shape': 'ovr',
 'classifier__degree': 3,
 'classifier__gamma': 'auto',
 'classifier__kernel': 'linear',
 'classifier__max_iter': -1,
 'classifier__probability': False,
 'classifier__random_state': 0,
 'classifier__shrinking': True,
 'classifier__tol': 0.001,
 'classifier__verbose': False}

Get dataset by name (eda already done in another notebook and train/test split saved so we will be working with the same data)

In [6]:
notes, X_train, X_test, y_train, y_test, target_names = get_dataset(exp.dataset)
print(notes)
print(target_names)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Dataset: iris-20test-shuffled-v1
X_train shape: (120, 4)
X_test shape: (30, 4)
y_train shape: (120,)
y_test shape: (30,)
Train: 80.00% of total
Test: 20.00% of total
Notes: None
Created by save_dataset() helper at 2024-07-09 12:28:10

  target_names
0       setosa
1   versicolor
2    virginica
Out[6]:
((120, 4), (30, 4), (120, 1), (30, 1))
In [10]:
# Make sure both are 1D numpy arrays
# y_train = y_train.to_numpy().ravel()
# y_test = y_test.to_numpy().ravel()
print("y_train:", type(y_train), y_train.shape)
print("y_test:", type(y_test), y_test.shape)
y_train: <class 'numpy.ndarray'> (120,)
y_test: <class 'numpy.ndarray'> (30,)
In [12]:
classes, counts = np.unique(y_train, return_counts=True)
for label, count in zip(classes, counts):
    print(f"Class {label}: {count} samples")
Class 0: 39 samples
Class 1: 37 samples
Class 2: 44 samples

SVM is biased towards majority class but classes in the train data are roughly balanced so we'll continue without any class balancing.

In [13]:
exp.update_param("n_train_samples", X_train.shape[0])
exp.update_param("n_test_samples", X_test.shape[0])
exp.summary_df
Out[13]:
dataset_name n_train_samples n_test_samples mean_accuracy train_time query_time kfolds confusion_matrix classification_report
exp_name
svc-iris-linear iris-20test-shuffled-v1 120 30 NaN NaN NaN NaN NaN NaN

Optimize hyperparameters

In [14]:
# add the steps to the pipeline
steps = [
    # NOTE: DTs don't need scaling, but we include it here for consistency when comparing to other classifiers
    ("scaler", StandardScaler()),
    (
        "classifier",
        SVC(
            kernel="linear",  # linear, poly, rbg, etc
            gamma="auto",  # kernel coefficient (1/n features)
            C=1.0,  # regularization parameter
            random_state=RANDOM_SEED,
        ),
    ),
]

exp.estimator = Pipeline(
    steps=steps,
    memory=CACHE_DIR,
)
In [16]:
param_grid = {
    "classifier__kernel": ["linear", "rbf", "poly"],
    "classifier__C": [0.1, 1, 10, 100],
    "classifier__gamma": ["scale", "auto", 0.1, 0.01, 0.001],
}

grid = GridSearchCV(exp.estimator, param_grid, cv=3)
grid.fit(X_train, y_train)

print("Best Params:", grid.best_params_)
print("Best Cross-Validation Score:", grid.best_score_)
Best Params: {'classifier__C': 1, 'classifier__gamma': 'scale', 'classifier__kernel': 'linear'}
Best Cross-Validation Score: 0.975

Ok, it does well with a C of 1 (regularization strength) and a linear kernel.

In [18]:
grid.best_estimator_.get_params()
Out[18]:
{'memory': '.cache',
 'steps': [('scaler', StandardScaler()),
  ('classifier', SVC(C=1, kernel='linear', random_state=0))],
 'transform_input': None,
 'verbose': False,
 'scaler': StandardScaler(),
 'classifier': SVC(C=1, kernel='linear', random_state=0),
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True,
 'classifier__C': 1,
 'classifier__break_ties': False,
 'classifier__cache_size': 200,
 'classifier__class_weight': None,
 'classifier__coef0': 0.0,
 'classifier__decision_function_shape': 'ovr',
 'classifier__degree': 3,
 'classifier__gamma': 'scale',
 'classifier__kernel': 'linear',
 'classifier__max_iter': -1,
 'classifier__probability': False,
 'classifier__random_state': 0,
 'classifier__shrinking': True,
 'classifier__tol': 0.001,
 'classifier__verbose': False}
In [19]:
my_params = {
    "classifier__kernel": "linear",
    "classifier__C": 1.0,
    # "classifier__gamma": "auto",
}
exp.estimator.set_params(**my_params)
exp.estimator.get_params()
Out[19]:
{'memory': '.cache',
 'steps': [('scaler', StandardScaler()),
  ('classifier', SVC(gamma='auto', kernel='linear', random_state=0))],
 'transform_input': None,
 'verbose': False,
 'scaler': StandardScaler(),
 'classifier': SVC(gamma='auto', kernel='linear', random_state=0),
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True,
 'classifier__C': 1.0,
 'classifier__break_ties': False,
 'classifier__cache_size': 200,
 'classifier__class_weight': None,
 'classifier__coef0': 0.0,
 'classifier__decision_function_shape': 'ovr',
 'classifier__degree': 3,
 'classifier__gamma': 'auto',
 'classifier__kernel': 'linear',
 'classifier__max_iter': -1,
 'classifier__probability': False,
 'classifier__random_state': 0,
 'classifier__shrinking': True,
 'classifier__tol': 0.001,
 'classifier__verbose': False}
In [20]:
# fit on training data
start_time = pd.Timestamp.now()
exp.estimator.fit(X=X_train, y=y_train)
train_time = pd.Timestamp.now() - start_time
In [21]:
exp.update_param("train_time", train_time)
exp.update_param(
    "mean_accuracy",
    exp.estimator.score(X_test, y_test),
    # add_column=True
)
exp.summary_df
Out[21]:
dataset_name n_train_samples n_test_samples mean_accuracy train_time query_time kfolds confusion_matrix classification_report
exp_name
svc-iris-linear iris-20test-shuffled-v1 120 30 0.966667 0 days 00:00:00.003818 NaN NaN NaN NaN

Take a look at the trained model

In [22]:
# get precision, recall, f1, accuracy
start_time = pd.Timestamp.now()
y_pred = exp.estimator.predict(X_test)
query_time = pd.Timestamp.now() - start_time
In [23]:
exp.update_param("query_time", query_time)
In [25]:
target_names_list = target_names["target_names"].tolist()
cm = confusion_matrix(
    y_true=y_test,
    y_pred=y_pred,
    # normalize="true"
)
cmd = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=target_names_list)
cmd.plot()
plt.title("Confusion Matrix")
# plt.savefig(f"{FIGS_DIR}/{exp.name}_confusion-matrix.png")
Out[25]:
Text(0.5, 1.0, 'Confusion Matrix')
In [26]:
exp.update_param("confusion_matrix", np.array2string(cm))
In [27]:
cr = classification_report(y_true=y_test, y_pred=y_pred, output_dict=True)
exp.update_param("classification_report", str(cr))
cr
Out[27]:
{'0': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 11.0},
 '1': {'precision': 1.0,
  'recall': 0.9230769230769231,
  'f1-score': 0.96,
  'support': 13.0},
 '2': {'precision': 0.8571428571428571,
  'recall': 1.0,
  'f1-score': 0.9230769230769231,
  'support': 6.0},
 'accuracy': 0.9666666666666667,
 'macro avg': {'precision': 0.9523809523809524,
  'recall': 0.9743589743589745,
  'f1-score': 0.9610256410256411,
  'support': 30.0},
 'weighted avg': {'precision': 0.9714285714285714,
  'recall': 0.9666666666666667,
  'f1-score': 0.9672820512820512,
  'support': 30.0}}
In [29]:
# add custom decision tree classification specific metrics to the summary_df
exp.update_param(
    "kernel",
    exp.estimator.named_steps["classifier"].kernel,
    add_column=True,
)

exp.update_param(
    "C",
    exp.estimator.named_steps["classifier"].C,
    add_column=True,
)
exp.update_param(
    "gamma",
    exp.estimator.named_steps["classifier"].gamma,
    add_column=True,
)
exp.summary_df
Adding column: kernel
Adding column: C
Adding column: gamma
Out[29]:
dataset_name n_train_samples n_test_samples mean_accuracy train_time query_time kfolds confusion_matrix classification_report kernel C gamma
exp_name
svc-iris-linear iris-20test-shuffled-v1 120 30 0.966667 0 days 00:00:00.003818 0 days 00:00:00.001419 NaN [[11 0 0]\n [ 0 12 1]\n [ 0 0 6]] {'0': {'precision': 1.0, 'recall': 1.0, 'f1-sc... linear 1.0 auto
In [ ]:
exp.save(overwrite_existing=False)


< Randomized Optimization | Contents