Welcome to the fifth practical session of CS233 - Introduction to Machine Learning.

In this exercise class, we will work with higher-dimensional feature spaces.

In [1]:

```
# Useful starting lines
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%load_ext autoreload
%autoreload 2
```

In this week's lecture we have encountered Cover's theorem:

$\textit{"A complex pattern-classification problem, cast in a high-dimensional space nonlinearly is more likely}$ $\textit{to be linearly separable than in a low-dimensional space, provided that the space is not densely populated."}$

Within this exercise we will empirically validate Cover's theorem by simulating toy data and by re-using the previous weeks' perceptron code.

**Note**: The slides use a slightly different notation where, $p$ is number of $N$- dimensional data points. To keep things coherent with the previous exercises we use, N and D. N- number of data points and D is their dimentionality.

To train our perceptron we create a synthetic data set $\left(X,y\right)$. X is a $\left[ N \times D \right]$ matrix of observations $x_{i=1,...,N}$ with entries $x_{i,j=1,...,D} \stackrel{i.i.d.}{\sim} Bernoulli(\lambda=0.5)$. y is a $\left[ N \times 1 \right]$ vector of target values $y_{i=1,...,N} \stackrel{i.i.d.}{\sim} Bernoulli(\lambda=0.5)$.

Write a function `sample_data`

that returns a synthetic data set $\left(X,y\right)$ given $N$ and $D$.

Hint: You can draw $50$ samples from a Bernoulli distribution $Bernoulli(\lambda=0.5)$ via

In [2]:

```
import numpy as np
lamb, D = 0.5, 50
np.random.binomial(1, lamb, D)
```

Out[2]:

Please fill in the function `sample_data`

below:

In [3]:

```
def sample_data(N,D):
"""
generate the synthetic data.
X is of size [N x D]
y is of size [N]
"""
k, lamb = 1, 0.5
X = np.empty((N, D)) # X is [N x D]
y = np.empty(N) # y is [N]
# generate one observation by drawing 10 samples
# from a Bernoulli distribution (Binomial with k=1)
X = np.random.binomial(k, lamb, (N,D))
# generate target label
y = np.random.binomial(k, lamb, N)
# if all target labels are identical then flip a label
# (some perceptron implementations require at least two different label values)
if len(np.unique(y))==1: y[0] = 1-y[0]
return X, y
```

We wish to evaluate the fraction $f$ of $r$ runs for which $\left(X,y\right)$ is linearly separable and evaluate this fraction as a function of $N$: Within a function `simulate`

, systematically vary $N \in \left(1,200\right]$ and set $D=50, r=30$. For each $N$ call a function `run_trial`

that generates a data set as described in section 2.1 and trains a perceptron on it. Count the number $r_+$ out of the $r$ trials that are linearly separable and plot $f(N) = \frac{r_+(N)}{r}$ over $N/D$.

The outcome should resemble the figure on slide $12$ ("Numerical Approximation") of this week's lecture.

Hints: Feel free to use your own perceptron code but keep in mind that for some instances the perceptron may not converge.

For plotting you can use the provided function `plotC`

.

Please fill in the functions `simulate`

and `run_trial`

below:

In [4]:

```
import numpy as np
from plots import plotC
from helpers import Perceptron
def run_trial(N, D):
"""
generate one trial of data and run the perceptron on it
X is of size [N x D]
y is of size [N]
"""
# call sample_data
X, y = sample_data(N,D)
# append a constant value to the input for the bias
# such that X is of size [N x D+1]
X = np.concatenate([np.ones((N,1)),X],1)
# initialize weights w
w = np.ones(D+1)
# run the perceptron
_,num_errs = Perceptron(X, y, w, lr=0.1, n_epochs=100)
# the data is linearly separable if and only if the perceptron
# classifies the given data with zero error
linSep = num_errs == 0
return linSep
def simulate():
"""run a simulation generating & classifying data sets for a varying sample size"""
trials = 50 # number of trials per parameter
D = 50 # dimensionality of each observation
domain = np.round(np.linspace(2,200,25)).astype(int)
fracts = np.empty(len(domain))
for pdx, N in enumerate(domain):
# initialize the count of linearly separable trials
linSep = 0
for trial in range(trials):
# set random number generator as a function of trial number
# to make results comparable across conditions & get smoother C(N,D) curve
np.random.seed(trial)
linSep += run_trial(N, D)
# compute the fraction of these problems for which the
# perceptron learning rule converges
fracts[pdx] = linSep / trials
plotC(fracts, D, trials)
simulate()
```

**Questions:**

- Justify the shape of the observed curve! Why does the Perceptron fail when the number of dimensions is less than the number of data points?
- We sampled data from a Bernoulli distribution with fair chance while keeping the dimension size fixed. The more samples we draw, the higher the probability that the drawn samples are not linearly separable.

- In this exercise we randomly sampled data points. Is this distribution representative of real data? Why (not)?
- What is not representative: real data is not typically distributed according to a Bernoulli distribution and may have more mass in one region of the sample space than in others.
- What is representative: real data may not be separable in the given dimensionality either.

- Toy around with different values of $N$ and $D$. What do you observe?
- The curve changes. If $N$ is relatively large or $D$ is relatively small then the data quickly becomes linearly inseparable.

In this week's lecture we have heard of the benefits of higher-dimensional feature spaces for linear classifiers. To obtain higher-dimensional data one may

- directly map observations into a higher-dimensional feature space or
- make use of a kernel term that can be utilized to compute a dot product in a latent higher-dimensional space

Within this exercise we will encounter both techniques and kernelize the previously encountered perceptron.

Let us define the following toy data set

In [5]:

```
def simData():
"""construct a XOR problem toy data set"""
X = np.array([[0,0,1,1],[1,0,1,0]]).T
y = np.array([1,0,0,1])
return X, y
```

**Question:**

- Does training a perceptron on this data set converge? Why (not)?
- The perceptron does not converge because the perceptron does only converge on linearly separable data and the data is not linearly separable.

Use the provided function `plot3Dscatter`

to plot the given data set.

- Consider the function $\phi(\mathbf{x_1}): {\rm I\!R}^d \xrightarrow{} {\rm I\!R}^m, \left[ x_{11},x_{12}\right] \xrightarrow{} \left[ x_{11}^2, x_{12}^2, \sqrt{2} x_{11} x_{12} \right]$ for $\mathbf{x} \in X$ and $d=2,m=3$. Write a function
`transform_data`

that transforms the given data set accordingly.

**Question:**

Does training a perceptron on this transformed data set converge? Why (not)? Use the provided function `plot3Dscatter`

to plot the transformed data set.

- The perceptron does converge because the perceptron does converge on linearly separable data and the data is linearly separable.

Please fill in the function `transform_data`

below:

In [6]:

```
from plots import plot3Dscatter
def transform_data(X):
"""transform the given data set"""
Z = np.sqrt(2)*X[:,0]*X[:,1]
X[:,0] = X[:,0]**2
X[:,1] = X[:,1]**2
X = np.hstack([X, Z[:,np.newaxis]])
return X
X, y = simData()
plot3Dscatter(X)
plot3Dscatter(transform_data(X))
```

A kernel $k$ is a function $k(\mathbf{x_1},\mathbf{x_2}): {\rm I\!R}^d \times {\rm I\!R}^d \xrightarrow{} {\rm I\!R}, \left( \mathbf{x_1},\mathbf{x_2}\right) \xrightarrow{} \phi(\mathbf{x_1})^T\phi(\mathbf{x_2})$ for $\mathbf{x_1},\mathbf{x_2} \in X$, $d\in {\rm I\!N}$ and a given function $\phi(\mathbf{x})$.

Notice how computing $k$ directly does not involve projecting the data to feature space ${\rm I\!R}^m$ as an intermediate step (as in the case of computing $\phi$).

**Question:**

- Show that the kernel $k(\mathbf{x_1},\mathbf{x_2}) = (\mathbf{x_1}^T\mathbf{x_2})^2$ implicitly specifies the feature transformation $\phi(\mathbf{x_1}) = \left[ x_{11}^2, x_{12}^2, \sqrt{2} x_{11} x_{12} \right]$ of section 3.1.
- Just plug $\phi(\mathbf{x})$ into the kernel function's definition and show equality.

- This is known as the kernel trick. What does the kernel trick imply in terms of computational runtime? When is it (not) beneficial?
- By computing the kernel function instead of projecting into a high-dimensional feature space and taking the dot product, we do these calculations implicitly. This is beneficial when the feature space ${\rm I\!R}^m$ is high-dimensional. This it not beneficial if the number $N$ of data points is large, as our kernel matrix is of size $[N \times N]$.

- One can express the perceptron's weight vector $w$ as a linear combination $w=\sum_i \alpha_i y_i \phi(x_i)$ with $\alpha_i$ the number of times that observation $x_i$ was misclassified such that

Hence, final class is represented by $\hat{y}$.

Use this term to write a perceptron involving the kernel $k(\mathbf{x_1},\mathbf{x_2})$ defined as in section 3.2.

- Run the perceptron on the data $\left( X,y \right)$ transformed via $\phi(\mathbf{x})$ as defined in section 3.1.

**Question:**

- What do you observe?
- The data is linearly separable. We implicitly performed the feature space projection previously done via function transform_data.

Please fill in the functions `kernel`

and `kernelizedPrediction`

below:

In [7]:

```
def kernel(x,x_):
"""
Kernel which takes two inputs and
returns scalar value. Implement Kernel
as discussed in section 3.2
x: data point
x_: data point
Return: float value according to kernel
"""
return x.dot(x_)**2
```

In [8]:

```
def kernelizedPrediction(X, y, alpha, idx):
"""
kernelized prediction function
Implement y_pred as in the equation above
X: data points
y: data label
alpha: number of misclassification of each data point
idx: index of data point for which prediction is to be made
Return: float
"""
return np.sum([alpha[idx_]*y[idx_]*kernel(X[idx],X[idx_]) for idx_,_ in enumerate(X)])
```

Following implementation of Kernel Preceptron Learning is provided for you. Please, go through to understand the update step better

In [9]:

```
def predict(X, y, alpha, idx):
"""predict label of observation"""
prediction = kernelizedPrediction(X, y, alpha, idx)
heaviside = prediction >= 0
rescale_pred = (heaviside - 0.5) * 2.0
return rescale_pred
def KernelPerceptron(X, y, alpha, lr=1.0, n_epochs=100):
"""kernelized perceptron"""
# in case the target labels should be {0,1} then remap to {-1,1}
y[y==0] = -1
num_samples = X.shape[0]
for ep in range(n_epochs):
# draw indices randomly
idxs = np.random.permutation(num_samples)
num_errs = 0 # count errors within current epoch
for idx in idxs:
# check whether the current observation was classified correctly
correct = predict(X, y, alpha, idx) == y[idx]
# Update error counts
if not correct:
alpha[idx] += lr*1
num_errs += 1
# stopping criterion
if num_errs == 0: break
return alpha, num_errs
```

Here we call

- Simple Perceptron with transformed data
- Kernelized Perceptron

Run to see that both now produce zero error

In [10]:

```
from helpers import Perceptron
# get the data and apply transform
X, y = simData()
X_trans = transform_data(X)
# append a bias to the data
X = np.concatenate([np.ones((4,1)),X],1)
X_trans = np.concatenate([np.ones((4,1)),X_trans],1)
# initialize the weight vector and run the perceptron
w_3D = np.array([0., 1., 1., 1.])
w_final, num_errs = Perceptron(X_trans, y, w_3D)
print("The perceptron learned to classify (X,y) with {} errors.".format(num_errs))
# initialize the error vector and run the kernel perceptron
alpha = np.zeros(X.shape[0])
alpha_final, num_errs_kernel = KernelPerceptron(X, y, alpha)
print("The kernel perceptron learned to classify (X,y) with {} errors.".format(num_errs_kernel))
```