$\renewcommand{\real}{\mathbb{R}}$ $\renewcommand{\xb}{\mathbf{x}}$ $\renewcommand{\wb}{\mathbf{w}}$ $\renewcommand{\Xb}{\mathbf{X}}$ $\renewcommand{\Lb}{\mathbf{L}}$
$\DeclareMathOperator*{\argmin}{argmin}$
Let the perceptron be defined as a function $y(\xb) = f(\wb^{\top}\xb)$, where $\wb$ is a vector of (learned) weights and $f$ is a step function $ f(a) = \begin{cases} +1, & a \ge 0 \\ -1, & a < 0 \end{cases}$, as seen in the lecture. Such a perceptron can be used to model various logic gate functions. However, in our particular case the logic gates do not output the usual values $\{0, 1\}$ but rather $\mathbf{\{-1, 1\}}$ to stay consistent with the perceptron's step function $f$:
Given a perceptron taking an input $\xb = [x_{1}, x_{2}]$ where $x_{1}, x_{2} \in \{0, 1\}$, find a weight vector $\wb_{AND} = [w_{1}, w_{2}]$ which models logic gate AND, i.e. $y(x_{1}, x_{2}) = x_{1}\ \text{AND}\ x_{2}$. Does any $\wb_{AND}$ exist? If not, why?
$x_{1}$ | $x_{2}$ | $y$ |
---|---|---|
0 | 0 | -1 |
0 | 1 | -1 |
1 | 0 | -1 |
1 | 1 | 1 |
Now let us add a bias term to the perceptron, i.e. an additional input $x_{0}$, where $x_{0} = 1$. The input to the perceptron is now $\xb = [x_{0}, x_{1}, x_{2}] = [1, x_{1}, x_{2}]$. Find a weight vector $\wb_{AND} = [w_{0}, w_{1}, w_{2}]$ such that the perceptron would model a logic gate AND.
Find a weight vector $\wb_{OR} = [w_{0}, w_{1}, w_{2}]$ such that the perceptron would model logic gate OR, i.e. $y(x_{1}, x_{2}) = x_{1}\ \text{OR}\ x_{2}$.
$x_{1}$ | $x_{2}$ | $y$ |
---|---|---|
0 | 0 | -1 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 1 |
Find a weight vector $\wb_{XOR} = [w_{0}, w_{1}, w_{2}]$ such that the perceptron would model logic gate XOR, i.e. $y(x_{1}, x_{2}) = x_{1}\ \text{XOR}\ x_{2}$. Does $\wb_{XOR}$ exist? If not, why?
$x_{1}$ | $x_{2}$ | $y$ |
---|---|---|
0 | 0 | -1 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | -1 |
We have seen that adding a bias term to the perceptron from task 1. AND gate (no bias) saved the day and helped us model the AND gate properly using perceptron. What if we add one more bias to the perceptron modeling the XOR gate from task 4. XOR gate, such that we would have $\xb = [x_{0_{a}}, x_{0_{b}}, x_{1}, x_{2}] = [1, 1, x_{1}, x_{2}]$ $\wb_{XOR} = [w_{0_{a}}, w_{0_{b}} w_{1}, w_{2}]$ does it allow us to model the XOR gate now? Why?
$\wb_{AND}$ does not exist.
There is no bias which means that the decision boundary has to intersect the coordinate system origin $[0, 0]$. Even though the data is linearly separable, they are not centered, i.e. there is no line intersecting the origin which would perfectly separate the classes.
There are infinite number of solutions, e.g. $\wb = [-1.5, 1, 1]$
Adding a bias term allows the decision boundary to be placed anywhere in the 2D $x_{1}x_{2}$-plane and since the data are linearly separable, the $\wb_{AND}$ surely exists.
There are infinite number of solutions, e.g. $\wb_{XOR} = [-0.5, 1, 1]$
$\wb_{XOR}$ does not exist.
The data is not linearly separable.
$\wb_{XOR}$ does not exist.
Adding another fixed input term does not change anything about the fact, that the original data is not linearly separable. Furthermore, having two bias terms $x_{0_{a}}$, $x_{0_{b}}$ and corresponding weights $w_{0_{a}}$, $w_{0_{b}}$ is mathematically equivalent to having a single bias term $w_{0}$ and corresponding weight $w_{0}$:
$$ x_{0_{a}}w_{0_{a}} + x_{0_{b}}w_{0_{b}} = 1\cdot w_{0_{a}} + 1\cdot w_{0_{b}} = 1 \cdot (w_{0_{a}} + w_{0_{b}}) = 1 \cdot w_{0} = x_{0}w_{0}, $$where $w_{0} = w_{0_{a}} + w_{0_{b}}$.
Note that adding a bias term $x_{0}$ only helped in the case of going from task 1. AND (no bias) to task 2. AND, since the data were linearly separable in the first place, just not centered and the bias term allowed the decision boundary to shift away from the origin $[0, 0]$.
%load_ext autoreload
%autoreload 2
%matplotlib notebook
# project files
import sys
import helpers
# 3rd party
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Perceptron
As we've done in the last part of previous week's exercise, we will work with and Iris flower dataset. For simplicity, we will only use 2 out of 4 features and 2 out of 3 classes (named as setosa and versicolor). Therefore, our dataset is as follows:
Let us first load the data and visualize the two classes.
# Load data and visualize the dataset.
data, labels = helpers.load_ds1(center=True)
fig = helpers.vis_ds1(data, labels)
Visually, the classes appear to be linearly separable, which is a good news, since as the perceptron convergence theorem states, if there exists an exact solution (i.e. data are linearly separable), the perceptron learning algorithm is guaranteed to find the solution in finite number of steps.
Given an input vector $\xb$ and a weight vector $\wb$, the prediction performed by perceptron is defined as
$$ y(\xb) = \begin{cases} +1 & \text{if}\ \wb^{\top}\xb \ge 0 \\ -1 & \text{otherwise} \end{cases} $$Please implement the function predict()
accordingly. Recall that you can use the function np.dot()
to do matrix multiplication.
def predict(x, w):
""" Perceptron prediction.
Args:
x (np.array): Input data of shape (N, D), N is # samples, D is dimension.
w (np.array): Perceptron weights of shape (D, ).
Returns:
np.array: Predictions of shape (N, ), where each value is in {-1, 1}.
"""
return ((x.dot(w) >= 0.0) - 0.5) * 2.0
In order to track the training progress it is useful to monitor the current accuracy of the perceptron on the whole dataset, which is defined as
$$ f_{\text{acc}} = \frac{\text{# correct predictions}}{\text{# all predictions}}$$Please implement the function accuracy()
accordingly.
def accuracy(labels_gt, labels_pred):
""" Computes accuracy.
Args:
labels_gt (np.array): GT (ground truth) labels of shape (N, ).
labels_pred (np.array): Predicted labels of shape (N, ).
Returns:
float: Accuracy, in range [0, 1].
"""
return np.sum(labels_gt == labels_pred) / labels_gt.shape[0]
Finally, in order to find the weight vector $\wb$ corresponding to a decision boundary separating the classes, we need a learning algorithm. The perceptron training objective can be formulated as an energy minimization problem
$$ \wb^{*} = \argmin_{\wb} E(\wb), $$with an energy function $E$ defined as
$$ E(\wb) = -\sum_{n \in M}(\wb^{\top}\xb_{n})l_{n}, $$where $M$ is a set of wrongly classified samples $\xb_{n}$ with corresponding GT (ground truth) labels $l_{n}$. To find $\wb^{*}$ we apply stochastic gradient descent, where the change in $\wb$ after one step is given as
$$ \wb^{t + 1} = \wb^{t} - \eta \nabla E(\wb^{t}) $$where $\eta$ is the learning rate. In general, we try to choose a learning rate that is small enough to converge and large enough to converge quickly.
The algorithm continuously iterates through the dataset and performs the weight update each time it stumbles upon a misclassified sample (predicted given the most recent weight vector $\wb$). The algorithm stops after no missclassified sample is found or after the maximum number of epochs is reached. The algorithm can be described by the following pseudocode:
Please fill in the function fit_perceptron()
:
def fit_perceptron(data, labels, w, eta, max_epochs, vis=None,
history_save_period=100, verbose=True):
""" Trains the perceptron to find a decision boundary between
two classes given `data` and corresponding `labels`.
Args:
data (np.array): Dataset of shape (N, D).
labels (np.array): Labels of shape (N, ).
w (np.array): Initial weights of shape (D, ).
eta float: The learning rate.
max_epochs (int): Maximum number of epochs for which
the training is run.
Returns:
w (np.array): Weights after convergence or `max_epochs`
reached. Shape (D, )
history (np.array): History of accuracy monitored during
training.
"""
history = []
num_samples = data.shape[0]
for ep in range(max_epochs):
# Shuffle the data.
inds = np.random.permutation(num_samples)
num_errs = 0
# Train for one epoch.
for it, ind in enumerate(inds):
c_pred = predict(data, w)
acc = accuracy(labels, c_pred)
# Print and save stats.
if verbose:
print('\rep {}/{}, it {}/{}, accuracy {:.3f}'.
format(ep + 1, max_epochs, it + 1, num_samples, acc),
end='')
if (it + 1) % history_save_period == 0:
history.append(acc)
# Extract one sample from the dataste.
sample = data[ind]
lab = labels[ind]
# Find whether the sample is correctly classified.
correct = predict(sample, w) == lab
# Update weights.
if not correct:
num_errs += 1
w_new = w + eta * sample * lab
# Visualize current state.
if vis is not None:
vis.step(sample, w, w_new, c_pred)
w = w_new
if num_errs == 0:
break
print()
return w, history
Let us test the perceptron learning algorithm on the Iris dataset which we have already loaded. Note that for now the dataset is centered, i.e. the decision boundary will coincide with the coordinate system origin $[0, 0]$ and therefore the bias term is not needed for the time being.
The cell below runs the perceptron learning algorithm and it lets you manually step through the iterations. Once you run the cell, the input field appears below the plot. Please click in the field and gradually keep pressing Enter
key - this advances the training step by step. The algorithm stops each time it encounters an incorrectly classified sample, it visualizes the current decision boundary and its normal given by $\wb^{t}$, the selected data sample (black arrow) and a new decision boundary and its normal given by $\wb^{t+1}$. If you have troubles running this cell (plot does not show, cell gets stuck, etc.), please call the fit_perceptron
function as this instead: fit_perceptron(data, labels, w, eta, max_epochs, verbose=True, vis=None)
and observe the text output (accuracy).
Please fill in the code below, namely select a suitable number of max_epochs
and initialize the weight vector w
. Use the visualization tool to debug your implementation of predict()
and fit_perceptron()
.
Hints:
If the training algorithm stops but it is not yet fully converged, either there is a problem in your implementation of predict()
and fit_perceptron()
, or you have selected too small value for max_epochs
.
Set the weight vector components w
to a reasonably small values (e.g. $<1$), otherwise the decision boundary normal will not be fully seen in the visualization.
# Get the visualizer
vis = helpers.PerceptronStepVisualizer(data, labels)
# Training settings
max_epochs = 1000
w = np.array([0.5, 0.5]) # (w1, w2)
eta = 1.0
# Training and prediction.
w_star, history = fit_perceptron(data, labels, w, eta, max_epochs, verbose=False, vis=vis)
labels_pred = predict(data, w_star)
# Visualize the result.
vis._reset()
_ = helpers.vis_ds1_dec_bound(data, labels, labels_pred, w_star, fig=vis._fig)
Let us load the same dataset but it will not be centered now as you can see on the plot below (note that all the data samples are now centered around a point $\sim[5.5, 3.25]$).
# Load non-centered dataset and visualize.
data_nc, labels = helpers.load_ds1(center=False)
fig = helpers.vis_ds1(data_nc, labels)
In the cell below, run the same perceptron learning algorithm (for sufficiently many epochs). Do not forget to initialize the weights vector $\wb$ again. Did the algorithm converge? If not, why?
# Fit perceptron without bias to non-centered data.
max_epochs = 500
w = np.array([1., 1.]) # (w1, w2)
eta = 1.0
# Training and prediction
w_star, history = fit_perceptron(data_nc, labels, w, eta, max_epochs, verbose=True)
labels_pred = predict(data_nc, w_star)
# Visualize the decision boundary.
_ = helpers.vis_ds1_dec_bound(data_nc, labels, labels_pred, w_star)
The algorithm does not converge regardless of the hyperparameter settings (learning rate, number of epochs, weights initialization), because a linear decision boundary which both coincides with the origin $[0, 0]$ and also linearly separates the two classes does not exist. The classes are linearly separable, though, but a bias term is needed in order to add an aditional degree of freedom to the decision boundary line.
It is clear from the previous experiment that a bias term will be needed in order to find a decision boundary of non-centered data. In the cell below, augment each data sample $\xb_{n}$ with a bias term to get $\hat{\xb}_{n} = [1\ \xb_{n}^{\top}]$ and run the perceptron training algorithm again. Do not forget that since a bias term was added, the weight vector also needs to be extended, i.e. now $\wb \in \real^{3} = [w_{0}, w_{1}, w_{2}]$.
Note that you should not use a for
cycle to augment the data by a bias term, i.e. your line of code should correspond to $\hat{\Xb} = [\mathbf{1}\ \Xb]$, where $\mathbf{1} \in \real^{N}$ is a (column) vector of 1. You can create a vector of ones using np.ones()
and you can use the function np.concatenate()
to concatenate the ones vector with your data.
# Extend data by bias.
data_nc_bias = np.concatenate([np.ones((data_nc.shape[0], 1)), data_nc], axis=1)
# Initialize the weights.
w = np.array([0., 1., 1.]) # (w0, w1, w2)
# Training settgins.
max_epochs = 1000
eta = 1.0
# Train and predict.
w_star, history = fit_perceptron(data_nc_bias, labels, w, eta,
max_epochs, verbose=True)
labels_pred = predict(data_nc_bias, w_star)
# Visualize the decision boundary.
_ = helpers.vis_ds1_dec_bound(data_nc, labels, labels_pred, w_star)
Since the perceptron is a well-known machine learnin algorithm, standard implementation exist in various libraries. scikit-learn is one of popular libraries which implements many machine learning algorithms and provides a Python interface. Let us use the scikit-learn
's implementation of perceptron and compare the results to our implementation.
In general, a given model has to be instantiated first, then its method fit
is used to find a set of parameters (weights) which meet a given objective. Finally, a prediction can be made by models function predict
and it can be evaluated by e.g. its score
function (which will in case of perceptron compute the accuracy). You will learn more about scikit-learn
in the next week's exercise.
Start by studying the documentation of the perceptron model. Then, please fill in the code in the cell below to create and train the perceptron, extract the learned parameters, and visualize the decision boundary.
# Train a perceptron implemented in sklearn.
data = data_nc
labels = labels
# Instantiate the perceptron model
clf = Perceptron(fit_intercept=True, max_iter=1000, tol=None, verbose=0)
# Train it by feeding it the non-centered dataset without a bias.
clf.fit(data, labels)
# Get the predicted labels.
labels_pred = clf.predict(data)
# Print out the accuracy.
acc = clf.score(data, labels)
print('Accuracy: {}'.format(acc))
# Extract the weight vector, w = (w0, w1, w2)
w_star = np.concatenate([clf.intercept_, clf.coef_.flatten()], axis=0)
# Visualize the decision boundary.
_ = helpers.vis_ds1_dec_bound(data, labels, labels_pred, w_star)