$\renewcommand{\real}{\mathbb{R}}$ $\renewcommand{\xb}{\mathbf{x}}$ $\renewcommand{\wb}{\mathbf{w}}$ $\renewcommand{\Xb}{\mathbf{X}}$ $\renewcommand{\Lb}{\mathbf{L}}$

$\DeclareMathOperator*{\argmin}{argmin}$

Exercise Session 3

Paper and pencil exercises

Logic gates

Let the perceptron be defined as a function $y(\xb) = f(\wb^{\top}\xb)$, where $\wb$ is a vector of (learned) weights and $f$ is a step function $ f(a) = \begin{cases} +1, & a \ge 0 \\ -1, & a < 0 \end{cases}$, as seen in the lecture. Such a perceptron can be used to model various logic gate functions. However, in our particular case the logic gates do not output the usual values $\{0, 1\}$ but rather $\mathbf{\{-1, 1\}}$ to stay consistent with the perceptron's step function $f$:

1. AND gate (no bias)

Given a perceptron taking an input $\xb = [x_{1}, x_{2}]$ where $x_{1}, x_{2} \in \{0, 1\}$, find a weight vector $\wb_{AND} = [w_{1}, w_{2}]$ which models logic gate AND, i.e. $y(x_{1}, x_{2}) = x_{1}\ \text{AND}\ x_{2}$. Does any $\wb_{AND}$ exist? If not, why?

$x_{1}$ $x_{2}$ $y$
0 0 -1
0 1 -1
1 0 -1
1 1 1

2. AND gate

Now let us add a bias term to the perceptron, i.e. an additional input $x_{0}$, where $x_{0} = 1$. The input to the perceptron is now $\xb = [x_{0}, x_{1}, x_{2}] = [1, x_{1}, x_{2}]$. Find a weight vector $\wb_{AND} = [w_{0}, w_{1}, w_{2}]$ such that the perceptron would model a logic gate AND.

3. OR gate

Find a weight vector $\wb_{OR} = [w_{0}, w_{1}, w_{2}]$ such that the perceptron would model logic gate OR, i.e. $y(x_{1}, x_{2}) = x_{1}\ \text{OR}\ x_{2}$.

$x_{1}$ $x_{2}$ $y$
0 0 -1
0 1 1
1 0 1
1 1 1

4. XOR gate

Find a weight vector $\wb_{XOR} = [w_{0}, w_{1}, w_{2}]$ such that the perceptron would model logic gate XOR, i.e. $y(x_{1}, x_{2}) = x_{1}\ \text{XOR}\ x_{2}$. Does $\wb_{XOR}$ exist? If not, why?

$x_{1}$ $x_{2}$ $y$
0 0 -1
0 1 1
1 0 1
1 1 -1

5. XOR gate (2 biases)

We have seen that adding a bias term to the perceptron from task 1. AND gate (no bias) saved the day and helped us model the AND gate properly using perceptron. What if we add one more bias to the perceptron modeling the XOR gate from task 4. XOR gate, such that we would have $\xb = [x_{0_{a}}, x_{0_{b}}, x_{1}, x_{2}] = [1, 1, x_{1}, x_{2}]$ $\wb_{XOR} = [w_{0_{a}}, w_{0_{b}} w_{1}, w_{2}]$ does it allow us to model the XOR gate now? Why?

Paper and pencil exercises - SOLUTIONS

Logic gates

1. AND gate (no bias)

$\wb_{AND}$ does not exist.

There is no bias which means that the decision boundary has to intersect the coordinate system origin $[0, 0]$. Even though the data is linearly separable, they are not centered, i.e. there is no line intersecting the origin which would perfectly separate the classes.

2. AND gate

There are infinite number of solutions, e.g. $\wb = [-1.5, 1, 1]$

Adding a bias term allows the decision boundary to be placed anywhere in the 2D $x_{1}x_{2}$-plane and since the data are linearly separable, the $\wb_{AND}$ surely exists.

3. OR gate

There are infinite number of solutions, e.g. $\wb_{XOR} = [-0.5, 1, 1]$

4. XOR gate

$\wb_{XOR}$ does not exist.

The data is not linearly separable.

5. XOR gate (2 biases)

$\wb_{XOR}$ does not exist.

Adding another fixed input term does not change anything about the fact, that the original data is not linearly separable. Furthermore, having two bias terms $x_{0_{a}}$, $x_{0_{b}}$ and corresponding weights $w_{0_{a}}$, $w_{0_{b}}$ is mathematically equivalent to having a single bias term $w_{0}$ and corresponding weight $w_{0}$:

$$ x_{0_{a}}w_{0_{a}} + x_{0_{b}}w_{0_{b}} = 1\cdot w_{0_{a}} + 1\cdot w_{0_{b}} = 1 \cdot (w_{0_{a}} + w_{0_{b}}) = 1 \cdot w_{0} = x_{0}w_{0}, $$

where $w_{0} = w_{0_{a}} + w_{0_{b}}$.

Note that adding a bias term $x_{0}$ only helped in the case of going from task 1. AND (no bias) to task 2. AND, since the data were linearly separable in the first place, just not centered and the bias term allowed the decision boundary to shift away from the origin $[0, 0]$.

Programming exercises

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib notebook

# project files
import sys
import helpers

# 3rd party
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Perceptron

1. Perceptron

1.1 Loading and visualizing data

As we've done in the last part of previous week's exercise, we will work with and Iris flower dataset. For simplicity, we will only use 2 out of 4 features and 2 out of 3 classes (named as setosa and versicolor). Therefore, our dataset is as follows:

  • data: $\Xb \in \real^{N \times 2}$, $\forall \xb \in \Xb: \xb \in \real^{2}$
  • labels: $\Lb \in \real^{N}$, $\forall l \in \Lb: l \in \{-1, 1\}$

Let us first load the data and visualize the two classes.

In [2]:
# Load data and visualize the dataset.
data, labels = helpers.load_ds1(center=True)
fig = helpers.vis_ds1(data, labels)

Visually, the classes appear to be linearly separable, which is a good news, since as the perceptron convergence theorem states, if there exists an exact solution (i.e. data are linearly separable), the perceptron learning algorithm is guaranteed to find the solution in finite number of steps.

1.2 Perceptron learning algorithm

Given an input vector $\xb$ and a weight vector $\wb$, the prediction performed by perceptron is defined as

$$ y(\xb) = \begin{cases} +1 & \text{if}\ \wb^{\top}\xb \ge 0 \\ -1 & \text{otherwise} \end{cases} $$

Please implement the function predict() accordingly. Recall that you can use the function np.dot() to do matrix multiplication.

In [3]:
def predict(x, w):
    """ Perceptron prediction.
    
    Args:
        x (np.array): Input data of shape (N, D), N is # samples, D is dimension.
        w (np.array): Perceptron weights of shape (D, ).
        
    Returns:
        np.array: Predictions of shape (N, ), where each value is in {-1, 1}.
    """
    return ((x.dot(w) >= 0.0) - 0.5) * 2.0

In order to track the training progress it is useful to monitor the current accuracy of the perceptron on the whole dataset, which is defined as

$$ f_{\text{acc}} = \frac{\text{# correct predictions}}{\text{# all predictions}}$$

Please implement the function accuracy() accordingly.

In [4]:
def accuracy(labels_gt, labels_pred):
    """ Computes accuracy.
    
    Args:
        labels_gt (np.array): GT (ground truth) labels of shape (N, ).
        labels_pred (np.array): Predicted labels of shape (N, ).
        
    Returns:
        float: Accuracy, in range [0, 1].
    """
    return np.sum(labels_gt == labels_pred) / labels_gt.shape[0]

Finally, in order to find the weight vector $\wb$ corresponding to a decision boundary separating the classes, we need a learning algorithm. The perceptron training objective can be formulated as an energy minimization problem

$$ \wb^{*} = \argmin_{\wb} E(\wb), $$

with an energy function $E$ defined as

$$ E(\wb) = -\sum_{n \in M}(\wb^{\top}\xb_{n})l_{n}, $$

where $M$ is a set of wrongly classified samples $\xb_{n}$ with corresponding GT (ground truth) labels $l_{n}$. To find $\wb^{*}$ we apply stochastic gradient descent, where the change in $\wb$ after one step is given as

$$ \wb^{t + 1} = \wb^{t} - \eta \nabla E(\wb^{t}) $$

where $\eta$ is the learning rate. In general, we try to choose a learning rate that is small enough to converge and large enough to converge quickly.

The algorithm continuously iterates through the dataset and performs the weight update each time it stumbles upon a misclassified sample (predicted given the most recent weight vector $\wb$). The algorithm stops after no missclassified sample is found or after the maximum number of epochs is reached. The algorithm can be described by the following pseudocode:

  1. Initialize the weight vector $\wb$ arbitrarily.
  2. Shuffle dataset and for each data sample $\xb_{n}$:
    1. If $\xb_{n}$ is not correctly classified, update weights.
  3. If no misclassified sample encountered or maximum epochs reached, end. Else, go to 2.

Please fill in the function fit_perceptron():

  • Extract one sample and a corresponding label and find out whether it is correctly classified.
  • Mathematically derive $\nabla E(\wb^{t})$ and use it to update the weights. (Check the lecture slides for hints on what $\nabla E(\wb^{t})$ is).
In [5]:
def fit_perceptron(data, labels, w, eta, max_epochs, vis=None, 
                   history_save_period=100, verbose=True):
    """ Trains the perceptron to find a decision boundary between 
    two classes given `data` and corresponding `labels`.
    
    Args:
        data (np.array): Dataset of shape (N, D).
        labels (np.array): Labels of shape (N, ).
        w (np.array): Initial weights of shape (D, ).
        eta float: The learning rate.
        max_epochs (int): Maximum number of epochs for which 
            the training is run.
            
    Returns:
        w (np.array): Weights after convergence or `max_epochs`
            reached. Shape (D, )
        history (np.array): History of accuracy monitored during 
            training.
    """
    
    history = []
    num_samples = data.shape[0]
    
    for ep in range(max_epochs):
        # Shuffle the data.
        inds = np.random.permutation(num_samples)
        num_errs = 0
        
        # Train for one epoch.
        for it, ind in enumerate(inds):
            c_pred = predict(data, w)
            acc = accuracy(labels, c_pred)

            # Print and save stats.
            if verbose:
                print('\rep {}/{}, it {}/{}, accuracy {:.3f}'.
                      format(ep + 1, max_epochs, it + 1, num_samples, acc), 
                      end='')
            if (it + 1) % history_save_period == 0:
                history.append(acc)
            
            # Extract one sample from the dataste.
            sample = data[ind]
            lab = labels[ind]
            
            # Find whether the sample is correctly classified.
            correct = predict(sample, w) == lab
            
            # Update weights.
            if not correct:
                num_errs += 1    
                w_new = w + eta * sample * lab
            
                # Visualize current state.
                if vis is not None:
                    vis.step(sample, w, w_new, c_pred)
                w = w_new
        
        if num_errs == 0:
            break

    print()
    return w, history

1.3 Training a perceptron on centered data

Let us test the perceptron learning algorithm on the Iris dataset which we have already loaded. Note that for now the dataset is centered, i.e. the decision boundary will coincide with the coordinate system origin $[0, 0]$ and therefore the bias term is not needed for the time being.

The cell below runs the perceptron learning algorithm and it lets you manually step through the iterations. Once you run the cell, the input field appears below the plot. Please click in the field and gradually keep pressing Enter key - this advances the training step by step. The algorithm stops each time it encounters an incorrectly classified sample, it visualizes the current decision boundary and its normal given by $\wb^{t}$, the selected data sample (black arrow) and a new decision boundary and its normal given by $\wb^{t+1}$. If you have troubles running this cell (plot does not show, cell gets stuck, etc.), please call the fit_perceptron function as this instead: fit_perceptron(data, labels, w, eta, max_epochs, verbose=True, vis=None) and observe the text output (accuracy).

Please fill in the code below, namely select a suitable number of max_epochs and initialize the weight vector w. Use the visualization tool to debug your implementation of predict() and fit_perceptron().

Hints:

  • If the training algorithm stops but it is not yet fully converged, either there is a problem in your implementation of predict() and fit_perceptron(), or you have selected too small value for max_epochs.

  • Set the weight vector components w to a reasonably small values (e.g. $<1$), otherwise the decision boundary normal will not be fully seen in the visualization.

In [6]:
# Get the visualizer
vis = helpers.PerceptronStepVisualizer(data, labels)

# Training settings
max_epochs = 1000
w = np.array([0.5, 0.5])  # (w1, w2)
eta = 1.0

# Training and prediction.
w_star, history = fit_perceptron(data, labels, w, eta, max_epochs, verbose=False, vis=vis)
labels_pred = predict(data, w_star)

# Visualize the result.
vis._reset()
_ = helpers.vis_ds1_dec_bound(data, labels, labels_pred, w_star, fig=vis._fig)
























1.4 Training a perceptron on non-centered data

Let us load the same dataset but it will not be centered now as you can see on the plot below (note that all the data samples are now centered around a point $\sim[5.5, 3.25]$).

In [7]:
# Load non-centered dataset and visualize.
data_nc, labels = helpers.load_ds1(center=False)
fig = helpers.vis_ds1(data_nc, labels)

In the cell below, run the same perceptron learning algorithm (for sufficiently many epochs). Do not forget to initialize the weights vector $\wb$ again. Did the algorithm converge? If not, why?

In [8]:
# Fit perceptron without bias to non-centered data.
max_epochs = 500
w = np.array([1., 1.])  # (w1, w2)
eta = 1.0

# Training and prediction
w_star, history = fit_perceptron(data_nc, labels, w, eta, max_epochs, verbose=True)
labels_pred = predict(data_nc, w_star)

# Visualize the decision boundary.
_ = helpers.vis_ds1_dec_bound(data_nc, labels, labels_pred, w_star)
ep 500/500, it 100/100, accuracy 0.900

1.4 - SOLUTION

The algorithm does not converge regardless of the hyperparameter settings (learning rate, number of epochs, weights initialization), because a linear decision boundary which both coincides with the origin $[0, 0]$ and also linearly separates the two classes does not exist. The classes are linearly separable, though, but a bias term is needed in order to add an aditional degree of freedom to the decision boundary line.

1.5 Adding a bias term

It is clear from the previous experiment that a bias term will be needed in order to find a decision boundary of non-centered data. In the cell below, augment each data sample $\xb_{n}$ with a bias term to get $\hat{\xb}_{n} = [1\ \xb_{n}^{\top}]$ and run the perceptron training algorithm again. Do not forget that since a bias term was added, the weight vector also needs to be extended, i.e. now $\wb \in \real^{3} = [w_{0}, w_{1}, w_{2}]$.

Note that you should not use a for cycle to augment the data by a bias term, i.e. your line of code should correspond to $\hat{\Xb} = [\mathbf{1}\ \Xb]$, where $\mathbf{1} \in \real^{N}$ is a (column) vector of 1. You can create a vector of ones using np.ones() and you can use the function np.concatenate() to concatenate the ones vector with your data.

In [9]:
# Extend data by bias.
data_nc_bias = np.concatenate([np.ones((data_nc.shape[0], 1)), data_nc], axis=1)

# Initialize the weights.
w = np.array([0., 1., 1.])  # (w0, w1, w2)

# Training settgins.
max_epochs = 1000
eta = 1.0

# Train and predict.
w_star, history = fit_perceptron(data_nc_bias, labels, w, eta,
                                 max_epochs, verbose=True)
labels_pred = predict(data_nc_bias, w_star)

# Visualize the decision boundary.
_ = helpers.vis_ds1_dec_bound(data_nc, labels, labels_pred, w_star)
ep 439/1000, it 100/100, accuracy 1.000

1.6 BONUS - Perceptron in sklearn

Since the perceptron is a well-known machine learnin algorithm, standard implementation exist in various libraries. scikit-learn is one of popular libraries which implements many machine learning algorithms and provides a Python interface. Let us use the scikit-learn's implementation of perceptron and compare the results to our implementation.

In general, a given model has to be instantiated first, then its method fit is used to find a set of parameters (weights) which meet a given objective. Finally, a prediction can be made by models function predict and it can be evaluated by e.g. its score function (which will in case of perceptron compute the accuracy). You will learn more about scikit-learn in the next week's exercise.

Start by studying the documentation of the perceptron model. Then, please fill in the code in the cell below to create and train the perceptron, extract the learned parameters, and visualize the decision boundary.

In [10]:
# Train a perceptron implemented in sklearn.

data = data_nc
labels = labels

# Instantiate the perceptron model
clf = Perceptron(fit_intercept=True, max_iter=1000, tol=None, verbose=0)

# Train it by feeding it the non-centered dataset without a bias.
clf.fit(data, labels)

# Get the predicted labels.
labels_pred = clf.predict(data)

# Print out the accuracy.
acc = clf.score(data, labels)
print('Accuracy: {}'.format(acc))
      
# Extract the weight vector, w = (w0, w1, w2)
w_star = np.concatenate([clf.intercept_, clf.coef_.flatten()], axis=0)

# Visualize the decision boundary.
_ = helpers.vis_ds1_dec_bound(data, labels, labels_pred, w_star)
Accuracy: 1.0
/Users/janbednarik/.virtualenvs/intro2ml/lib/python3.6/site-packages/sklearn/linear_model/stochastic_gradient.py:183: FutureWarning: max_iter and tol parameters have been added in Perceptron in 0.19. If max_iter is set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence this warning.
  FutureWarning)