Welcome to the 6th practical session of CS233 - Introduction to Machine Learning.
We will start using scikit-learn, a Python package of machine learning methods.
Use it to train SVM with different kernel functions and learn how tune their parameters.
# Useful starting lines
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%load_ext autoreload
%autoreload 2
Scikit-Learn is a machine learning library written in python. Most of the machine learning algorithms and tools are already implemented for you to use. In this exercise, we'll use this package to train SVM. In order to install scikit-learn follow instructions at this link:https://scikit-learn.org/stable/install.html
Scikit-Learn has modules implemented broadly for
All the magic happens under the hood, but gives you freedom to try out more complicated stuff.
Different methods to be noted here are
fit
: Train a model with the datapredict
: Use the model to predict on test datascore
: Return mean accuracy on the given test dataHave a look at this for simple example.
SVM tries to solve optimization problem of the primal form:
$\underset{\mathbf{w},b,\zeta}{\operatorname{argmin}} \frac{1}{2}\|\mathbf{w}\|^2 + C \sum^N_{n=1}\zeta_n$
subject to $\zeta_n \geq 0$ and $t_n(\mathbf{w}^T\mathbf{\phi(x_n)}+\mathbf{b}) \geq 1-\zeta_n$ , for all n
where, $\mathbf{w}$,$\mathbf{b}$ are weight and bias. C is penalty term, $\zeta_n$ is error in terms of how far data point is beyond correct margin and $t_n \in\{-1,1\}$. $\|\mathbf{w}\|$ is inversely related to margin width, so minimizing it means maximizing the margin, hence we minimize $\|\mathbf{w}\|$. As our data may not be linearly separable, hence maximizing margin will lead to some misclassifications. $\zeta_n$ is greater than zero when a data point is beyond margin and how many such data points are allowed is controlled by C. We choose the right value for C, given the data, through validation set. Hence with this objective function we get a maximum margin with certain amount of misclassification.
The corresponding dual problem is given by:
maximize $L(a) = \sum_{n=1}^N a_n - \frac 1 2 \sum_{n=1}^N\sum_{m=1}^N a_na_mt_nt_mk(x_n,x_m)$
subject to $C \geq a_n \geq 0 \text{ for all }n$ and $\sum_{n=1}^N a_nt_n = 0$
Question
Answer
Have a look at the SVM function here. The main parameters you should look for are:
Let's try with simple binary classification using SVM.
First we begin with simple linearly separable data.
from helpers import get_simple_dataset
from plots import plot
from matplotlib import pyplot as plt
# Get the simple dataset
X, Y = get_simple_dataset()
plot(X,Y,None,dataOnly=True)
from sklearn import svm
import numpy as np
# CODE HERE #
#Get SVM model with linear kernel and C=0.1
clf = svm.SVC(kernel='linear', C=0.1)
#call the fit method
clf.fit(X, Y)
#plot the decision boundary
plot(X, Y, clf)
The above plot shows the decision boundary and margins of the learnt model. Encircled points are the support vectors.
WARNING: if the margins go beyound the limits of axis, they are not shown or shown close to decision plane.
Use dual_coef_
attribute of the model to get the dual coefficients $a_n$ of the support vectors.
Question How does this value relate to margin penalty C? Scikit returns dual coeff in slightly different form, can you identify the difference?
#CODE HERE#
# use dual_coef_ to see dual coefficients
clf.dual_coef_
Answer Dual coefficients $a_n$ must satisfy constraint $0\leq a_n \leq C$. Scikit return coefficients with label of the class {-1,1}, i.e. it returns $a_nt_n$, where $t_n \in \{1,-1\}$. Also, the coefficients are only of support vectors.
Support vector which lies on margin has $a_n
Let's try different values of C. In the code, vary the C value from 0.001 to 100 and notice the changes.
Question How does the margin vary with C? Hint: have a look at the optimization formulation above.
# CODE HERE #
# Vary C and plot the boundaries
# Use np.logspace to generate C values
C = np.logspace(-3,2,6)
for c in C:
#Get SVM model with linear kernel
clf = svm.SVC(kernel='linear', C=c)
#call the fit method
clf.fit(X, Y)
#plot the decision boundary
plot(X, Y, clf)
Answer: Bigger C, weighs misclassfication term more hence small margin. Lower C allows more misclassification and hence larger margin.
We will try out Polynomial and RBF kernels. Have a look at the type of decision boundary created.
#CODE HERE#
# Use SVM with Polynomial Kernel of different degrees and C
clf = svm.SVC(kernel='poly', C=0.0001, degree=4)
clf.fit(X, Y)
plot(X, Y, clf)
# Use SVM with RBF with different gamma
clf = svm.SVC(kernel='rbf', C=0.1, gamma=1)
clf.fit(X, Y)
plot(X, Y, clf)
Util now we have worked with linear data in the given feature space. Most of the time that's not the case. We'll use non linear data and use non linear kernels to do classification. These kernels project the data to higher dimension and which helps in finding out a linear decision boundary.
from helpers import get_circle_dataset
X,Y = get_circle_dataset()
plot(X,Y,None,dataOnly=True)
#CODE HERE#
# Use Polynomial Kernel
clf = svm.SVC(kernel='poly', C=0.1, degree=2)
clf.fit(X, Y)
plot(X, Y, clf)
#CODE HERE#
# Use RBF Kernel
clf = svm.SVC(kernel='rbf', C=0.1, gamma=10)
clf.fit(X, Y)
plot(X, Y, clf)
While trying the above exercises with different values (called hyperparameters) of C, gamma or degree, or kernels, we obtain different boundaries and classfication. Naturally, question arises what are the best parameters and how can we choose one?
Cross Validation(CV) technique is used to choose the best parameters for separate validation set, where the hope is that this set will be close to the (unseen) test set. We train our model on train set and use hyperparameters which give best results on validation set. This is a general technique to choose your best machine learning model.
K-fold is a type of CV technique that we'll see here. We'll split data in K parts and use kth part for validation, whereas k-1 for training. This process will be repeated K times. Metric used to evaluate the model is accuracy on the validation set.
Sklearn already provide a method to do Cross Validation, which you can use. But to understand the concept, we'll code this time. Here, you can find explanation with a toy example.
We'll use Iris dataset, which you have seen before. Use some of the data to train and rest as validation.
from helpers import get_iris_dataset
X,Y = get_iris_dataset()
plot(X,Y,None,dataOnly=True)
# Function to split data indices
# num_examples: total samples in the dataset
# k_fold: number fold of CV
# returns: array of shuffled indices with shape (k_fold, num_examples//k_fold)
def fold_indices(num_examples,k_fold):
ind = np.arange(num_examples)
split_size = num_examples//k_fold
#important to shuffle your data
np.random.seed(42)
np.random.shuffle(ind)
k_fold_indices = []
#CODE HERE#
# Generate k_fold set of indices
k_fold_indices = [ind[k*split_size:(k+1)*split_size] for k in range(k_fold)]
return np.array(k_fold_indices)
total_samples = X.shape[0]
print(fold_indices(total_samples,4))
from sklearn.preprocessing import StandardScaler
# it's good to standarize our data. Helps in faster convergence.
scaler = StandardScaler()
# Function for using kth split as validation set to get accuracy
# and k-1 splits to train our model
def do_cross_validation(clf,k,k_fold_ind,X,Y):
# use one split as val
val_ind = k_fold_ind[k]
# use k-1 split to train
train_splits = [i for i in range(k_fold_ind.shape[0]) if i is not k]
train_ind = k_fold_ind[train_splits,:].reshape(-1)
#Get train and val
X_train = X[train_ind,:]
Y_train = Y[train_ind]
X_val = X[val_ind,:]
Y_val = Y[val_ind]
scaler.fit(X_train)
X_train = scaler.fit_transform(X_train)
#fit the train transformation on val
X_val = scaler.fit_transform(X_val)
#fit on train set
clf.fit(X_train,Y_train)
#get accuracy for val
acc = clf.score(X_val,Y_val)
return acc
We use Grid Search technique to sample from the hyperparameter space. Then for each set of hyperparameter, K-Fold CV is done. We take the mean of K accuracies obtain by CV, to represent performance of model with those hyperparameters.
# better to search in log space
grid_search_c = np.logspace(-6,12,num=19)
grid_search_gamma = np.logspace(-9,5,num=15)
#save the values for the combination of hyperparameters
grid_val = np.zeros((len(grid_search_c),len(grid_search_gamma)))
# Do 4 fold CV
k_fold = 4
k_fold_ind = fold_indices(X.shape[0],k_fold)
for i, c in enumerate(grid_search_c):
for j, g in enumerate(grid_search_gamma):
print('Evaluating for C:{} gamma:{} ...'.format(c,g))
#CODE HERE#
# call SVM with c,g as param.
clf = svm.SVC(kernel='rbf', C=c, gamma=g)
acc = np.zeros(k_fold)
# do cross validation
for k in range(k_fold):
acc[k] = do_cross_validation(clf,k,k_fold_ind,X,Y)
#aggregrate the accuracy score for k_fold runs. What function would you choose?
grid_val[i,j] = np.mean(acc)
from plots import plot_cv_result
plot_cv_result(grid_val,grid_search_c,grid_search_gamma)
#best acc
print('Best acc:{}'.format(np.max(grid_val)))
#best params
cin,gin = np.unravel_index(np.argmax(grid_val),grid_val.shape)
print('Best Params- C:{}, Gamma:{}'.format(grid_search_c[cin],grid_search_gamma[gin]))
Above heatmap shows accuracies for different Gamma and C values. The best params are used on test set.
Question Is there a relation between C and Gamma?
Hint:Think how increase in one value changes other. Look at the heatmap to get the idea.
Answer: High Gamma will lead to overfitting, hence smaller C would more misclassification to counteract. Lower Gamma is underfitting and hence high C is to stop misclassification