$\renewcommand{\real}{\mathbb{R}}$ $\renewcommand{\xb}{\mathbf{x}}$ $\renewcommand{\wb}{\mathbf{w}}$ $\renewcommand{\Xb}{\mathbf{X}}$ $\renewcommand{\Lb}{\mathbf{L}}$
$\DeclareMathOperator*{\argmin}{argmin}$
%load_ext autoreload
%autoreload 2
%matplotlib notebook
# project files
import helpers as helpers
# 3rd party
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
This week's exercise is about binary class logistic regression. Let's start by reloading the Iris Flower Dataset from the previous week. For the sake of the simplicity of plotting, we will only use 2 out of 4 features. For the first part we will use 2 out of 3 classes (named as setosa and versicolor). Later on, we will also use the third class virginica. Therefore, for this part our dataset with two classes is as follows:
Notice that our labels are {0, 1} not {-1, 1} as they were for perceptron. This is because the formulation we have seen in the course for logistic regression was designed according to {0, 1} labels. It can easily be modified to also suit other value pairs.
Also note that $\Xb$ is a matrix of shape $(N \times D)$. However, a single example $\xb_n$ is a column vector of shape $(D \times 1)$. When you want to multiply one data sample with the weights vector $\wb$ (also a column vector of shape $(D \times 1)$) you will see: $\xb_n^T\cdot\wb$. When you want to multiply the entire data matrix with the weights vector, you will see: $\Xb\cdot\wb$.
data, labels = helpers.load_ds_iris(sep_l=True, sep_w=True, pet_l=False, pet_w=False,
setosa=True, versicolor=True, virginica=False)
fig = helpers.scatter2d_multiclass(data, labels)
In logistic regression, the probability of a datapoint belonging to a class is found as: $$P(l_n=1|\xb_n, \wb) = \frac{e^{(\xb_n^T\cdot \wb)}}{1+e^{(\xb_n^T\cdot \wb)}} $$
This is called the sigmoid function! The sigmoid function is defined as: $$\sigma(t)= \frac {e^t}{1+e^t}= \frac{1}{1+e^{-t}}$$
Let's try to code this function. You can use the numpy function np.exp(x)
to take the exponential of a number.
def sigmoid(t):
""" Sigmoid function
Args:
t (np.array): Input data of shape (N, )
Returns:
np.array: Probabilites of shape (N, ), where each value is in {0, 1}.
"""
return 1/(1 + np.exp(-t))
A side note: Since the probabilities sum up to 1, we find $$P(l_n=0|\xb_n, \wb) = 1-P(l_n=1|\xb_n, \wb).$$
Q: Using $P(l_n|\xb_n, \wb)$ verify the likelihood is equal to:
$$P(\Lb|\Xb, \wb) = \prod_n\sigma(\xb_n^T\cdot\wb)^{l_n} (1-\sigma(\xb_n^T\cdot\wb))^{1-l_n} $$
Solution:
$$P(\Lb|\Xb, \wb) = \prod_n P(l_n|\xb_n, \wb) \\ =\prod_{n: l_n = 1} P(l_n = 1|\xb_n, \wb) \prod_{n: l_n = 0} P(l_n = 0|\xb_n, \wb) \\ =\prod_{n: l_n = 1} \sigma(\xb_n^T\cdot\wb) \prod_{n: l_n = 0} (1-\sigma(\xb_n^T\cdot\wb)) \\ P(\Lb|\Xb, \wb) =\prod_n\sigma(\xb_n^T\cdot\wb)^{l_n} (1-\sigma(\xb_n^T\cdot\wb))^{1-l_n} $$
Q: Using the likelihood, verify that the loss (also called energy) function is equal to:
$$E(\wb) = -\sum_{n}l_n\log(\sigma(\xb_n^T\cdot \wb))+(1-l_n)\log(1-\sigma(\xb_n^T\cdot \wb))$$
Solution:
$$E(\wb) = -\log(P(\Lb|\Xb, \wb))\\ = -\log(\prod_n\sigma(\xb_n^T\cdot\wb)^{l_n} (1-\sigma(\xb_n^T\cdot\wb))^{1-l_n}) \\ E(\wb) = -\sum_{n}l_n\log(\sigma(\xb_n^T\cdot \wb))+(1-l_n)\log(1-\sigma(\xb_n^T\cdot \wb))$$
This energy function is also equal to: $$E(\wb) = \sum_{n}(\log(1+e^{\xb_n^T\cdot \wb})-l_n(\xb_n^T\cdot \wb))$$ (Feel free to work this out and see for yourself that they are equal!)
Below we have provided the function for finding the loss, loss_logistic()
.
def loss_logistic(data, labels, w):
""" Logistic regression loss function for binary classes
Args:
data (np.array): Dataset of shape (N, D).
labels (np.array): Labels of shape (N, ).
w (np.array): Weights of logistic regression model of shape (D, )
Returns:
int: Loss of logistic regression.
"""
return np.sum(np.log(1+np.exp(data.dot(w)))-labels*(data.dot(w)))
We need to find the gradient of the loss function in order to move our weights towards the optimal weights. The gradient of the loss function is: $$\nabla E(\wb)= \sum_n (\sigma(\xb_n^T\cdot \wb) - l_n)\xb_n^T $$ Let us put this into nice matrix format: $$\nabla E(\wb)= \Xb^T(\sigma(\Xb\cdot \wb) - \Lb)$$
Fill in the function for finding the gradient gradient_logistic()
. You can use the numpy function np.dot()
for matrix multiplication.
def gradient_logistic(data, labels, w):
""" Logistic regression gradient function for binary classes
Args:
data (np.array): Dataset of shape (N, D).
labels (np.array): Labels of shape (N, ).
w (np.array): Weights of logistic regression model of shape (D, )
Returns:
np. array: Gradient array of shape (D, )
"""
return data.T.dot(sigmoid(data.dot(w))-labels)
Now let us write a function to perform classification using logistic regression, logistic_regression_classify()
. This function uses the weights we find during training in order to predict the labels for the data.
Hints:
(x==2)
. So we can set these elements to 0 by x[x==2]=0
.def logistic_regression_classify(data, w):
""" Classification function for binary class logistic regression.
Args:
data (np.array): Dataset of shape (N, D).
w (np.array): Weights of logistic regression model of shape (D, )
Returns:
np. array: Label assignments of data of shape (N, )
"""
#### write your code here: find predictions and threshold.
predictions = sigmoid(data@w)
predictions[predictions<0.5]=0
predictions[predictions>=0.5]=1
return predictions
Let us also use our accuracy()
function from last week's exercise. Recall that it implements
$$ f_{\text{acc}} = \frac{\text{# correct predictions}}{\text{# all predictions}}$$
We have provided it for you below:
def accuracy(labels_gt, labels_pred):
""" Computes accuracy.
Args:
labels_gt (np.array): GT labels of shape (N, ).
labels_pred (np.array): Predicted labels of shape (N, ).
Returns:
float: Accuracy, in range [0, 1].
"""
return np.sum(labels_gt == labels_pred) / labels_gt.shape[0]
In order to find good weights and have a high accuracy we need to train our model. Fill in missing parts of the function logistic_regression_train()
.
The function first initializes the weights according to a gaussian distribution. In each iteration, you should find the gradient using gradient_logistic
and take a gradient step using these weights. Recall that a gradient step is: $$ \wb^{t + 1} = \wb^{t} - \eta \nabla E(\wb^{t}) $$
The loss
and plot
parameters print out the loss and display the predictions as a plot for the current iteration, respectively. You do not need to modify these parts.
def logistic_regression_train(data, labels, max_iters=10, lr=0.001, loss=True, plot=True):
""" Training function for binary class logistic regression.
Args:
data (np.array): Dataset of shape (N, D).
labels (np.array): Labels of shape (N, ).
max_iters (integer): Maximum number of iterations. Default:10
lr (integer): The learning rate of the gradient step. Default:0.001
loss (boolean): Calculates the loss of logistic regression at each iteration and prints it. Default:True
plot (boolean): Plots the predictions at each iteration. Default:True
Returns:
np. array: weights of shape(D, )
"""
#initialize weights randomly according to gaussian distribution
weights = np.random.normal(0, 0.1, [data.shape[1],])
for iter in range(max_iters):
########## write your code here: find gradient and do a gradient step
gradient = gradient_logistic(data, labels, weights)
weights = weights - lr*gradient
##################################
predictions = logistic_regression_classify(data, weights)
if loss:
print('loss at iteration', iter, ":", loss_logistic(data, labels, weights))
if plot:
fig = helpers.visualize_predictions(data=data, labels_gt=labels, labels_pred=predictions)
plt.title("iteration "+ str(iter))
if accuracy(labels, predictions) == 1:
break
return weights
Run the code below to see your training in action. What do you observe? Try playing with the learning rate and number of max iterations.
weights = logistic_regression_train(data, labels, max_iters=20, lr=1e-3, loss=True, plot=True)
predictions = logistic_regression_classify(data, weights)
print("Accuracy is", accuracy(labels, predictions))
Now that we have classified two classes, we can move on to multi-class logistic regression.
Load the synthetic data by running the code segment below. We will use this dataset for now as it is easy to work with. Our data is:
helpers.generate_dataset_synth()
data_multi, labels_multi = helpers.load_dataset_synth()
fig = helpers.scatter2d_multiclass(data_multi, labels_multi, fig=None, fig_size=None, color_map=None,
legend=True, legend_map=None, grid=False, show=False)
Multi class logistic regression is an extention to binary logistic regression.
Let us consider logistic regression for K classes. We keep our weights in a weight matrix $\mathbf{W}$, where every column is $\wb_k$ for class $k$. Therefore, for every class $k$, we learn a separate $\wb_k$ during training. The weights matrix will be of size $(D \times K)$.
The generalized probabilities for logistic regression is $$P(l_n=k|\xb_n, \mathbf{W}) = \frac{e^{\xb_n^T\cdot \wb_k}}{\sum_j^K e^{\xb_n^T\cdot \wb_j}}$$ which is called the softmax function! Let us denote this function by $f_\text{softmax}$.
Study the implementation of this function below. It is used to assign the probabilities of a datapoint belonging to each class. For example, for a single datapoint, for 3 classes you might have the following probability assignments {0.2, 0.7, 0.1}. The probabilities all sum up to 1.
def f_softmax(data, w):
""" Softmax function
Args:
data (np.array): Input data of shape (N, D)
w (np.array): Weights of shape (D, K) where K is # of classes
Returns:
np.array: Probabilites of shape (N, K), where each value is in {0, 1} and each row sums to 1.
"""
#data.shape[0] is the number of datapoints, and w.shape[1] is the number of classes.
res = np.zeros([data.shape[0], w.shape[1]])
#The normalization term only has to be calculated once for all classes
norm = np.sum(np.exp(data@w), axis=1)
#We iterate for each class
for k in range(w.shape[1]):
exp_top = np.exp(data@w[:, k])
res[:, k] = exp_top/norm
return res
Using these, we find the loss function which we are trying to minimize is
$$E(\mathbf{W}) = -\sum_{n}\sum_{k}\mathbb{1}_{[l_n=k]}\log(f_\text{softmax}(\xb_{n}^T \cdot \wb_k))$$
The expression $\mathbb{1}_{[l_n=k]}$ is the indicator function, meaning that it is 1 when the label is equal to $k$. The loss function code is provided for you below. Note the use of (labels==k)
for the indicator function $\mathbb{1}_{[l_n=k]}$.
def loss_logistic_multi(data, labels, w):
""" Loss function for multi class logistic regression
Args:
data (np.array): Input data of shape (N, D)
labels (np.array): Labels of shape (N, )
w (np.array): Weights of shape (D, K)
Returns:
float: Loss value
"""
loss = 0
res = f_softmax(data, w)
for k in range(w.shape[1]):
loss -= np.sum((labels==k)*np.log(res[:,k]))
return loss
To find the gradient, we find the gradient of $E(\mathbf{W})$ with respect to each $\wb_k$ vector separately. For model $\wb_k$ we have $$\nabla E(\mathbf{W})_{\wb_k}=\sum_n(f_\text{softmax}(\xb_{n}^T \cdot \wb_k)-\mathbb1_{[l_n=k]})\xb_{n}^T$$ Let's put this into matrix format as well: $$\nabla E(\mathbf{W})_{\wb_k}= \Xb^T(f_\text{softmax}(\Xb\cdot \wb_k) - \mathbf{1}_{[l = k]})$$
Now, you will fill in the gradient function, gradient_logistic_multi()
given below.
Hints:
def gradient_logistic_multi(data, labels, w):
""" Gradient function for multi class logistic regression
Args:
data (np.array): Input data of shape (N, D)
labels (np.array): Labels of shape (N, )
w (np.array): Weights of shape (D, K)
Returns:
np.array: Gradients of shape (D, K)
"""
grad_w = np.zeros(w.shape)
res = f_softmax(data, w)
for k in range(w.shape[1]):
grad_w[:, k] = data.T@(res[:,k]-(labels==k))
return grad_w
Write the functions for classification and training.
Hints:
To find which class a point is assigned to use the numpy function argmax
. This function returns the index of the highest value in an array. This function has an argument axis
which you should set to 1.
Training will be the same as the binary case. First, we will find the gradient. Then we will update the weights using gradient descent.
def logistic_regression_classify_multi(data, w):
""" Classification function for multi class logistic regression.
Args:
data (np.array): Dataset of shape (N, D).
w (np.array): Weights of logistic regression model of shape (D, K)
Returns:
np. array: Label assignments of data of shape (N, ).
"""
#### write your code here: find predictions, argmax to find the correct label
predictions = np.argmax(f_softmax(data, w), axis=1)
return predictions
def logistic_regression_train_multi(data, labels, k=3, max_iters=10, lr=0.001, loss=True, plot=True):
""" Classification function for multi class logistic regression.
Args:
data (np.array): Dataset of shape (N, D).
labels (np.array): Labels of shape (N, )
k (integer): Number of classes. Default=3
max_iters (integer): Maximum number of iterations. Default:10
lr (integer): The learning rate of the gradient step. Default:0.001
loss (boolean): Calculates the loss of logistic regression at each iteration and prints it. Default:True
plot (boolean): Plots the predictions at each iteration. Default:True
Returns:
np. array: Label assignments of data of shape (N, ).
"""
weights = np.random.normal(0, 0.1, [data.shape[1], k])
for iter in range(max_iters):
gradient = gradient_logistic_multi(data, labels, weights)
weights = weights - lr*gradient
##################################
predictions = logistic_regression_classify_multi(data, weights)
if loss:
print('loss at iteration', iter, ":", loss_logistic_multi(data, labels, weights))
if plot:
fig = helpers.visualize_predictions(data=data, labels_gt=labels, labels_pred=predictions)
plt.title("iteration "+ str(iter))
if accuracy(labels, predictions) == 1:
break
return weights
weights_multi = logistic_regression_train_multi(data_multi, labels_multi, max_iters=15, lr=5e-4, loss=True, plot=True)
predictions_multi = logistic_regression_classify_multi(data_multi, weights_multi)
print("Accuracy is", accuracy(labels_multi, predictions_multi))
A side note: Notice that using this simple formulation, we have trained K classifiers for K classes. Our probability assignments are according to the softmax function. $$P(l_n=k|\xb_n, \mathbf{W}) = \frac{e^{\xb_n^T\cdot \wb_k}}{\sum_j^K e^{\xb_n^T\cdot \wb_j}}$$ And $$\sum_{k}^{K}P(l_n=k|\xb_n, \mathbf{W})=1$$ However, in the binary case we were training 1 classifier for 2 classes. The probabilities are assigned according to the sigmoid function. $$P(l_n=1|\xb_n, \wb) = \frac{e^{(\xb_n^T\cdot \wb)}}{1+e^{(\xb_n^T\cdot \wb)}} \\ P(l_n=0|\xb_n, \wb) = 1-P(l_n=1|\xb_n, \wb) = \frac{1}{1+e^{(\xb_n^T\cdot \wb)}}$$
Similar to the binary case, we can train K-1 classifiers for K classes. We modify the probability assignment function to be, for classes $k={1, ... ,K-1}$. $$P(l_n=k|\xb_n, \mathbf{W}) = \frac{e^{\xb_n^T\cdot \wb_k}}{1+\sum_j^{K-1} e^{\xb_n^T\cdot \wb_j}}$$
Q: What is $P(l_n=K|\xb_n, \mathbf{W})$? $$P(l_n=K|\xb_n, \mathbf{W}) = 1- \sum_{k}^{K-1}P(l_n=k|\xb_n, \mathbf{W})\\ 1- \sum_{k}^{K-1}\frac{e^{\xb_n^T\cdot \wb_k}}{1+\sum_j^{K-1} e^{\xb_n^T\cdot \wb_j}} = 1-\frac{\sum_{k}^{K-1}e^{\xb_n^T\cdot \wb_k}}{1+\sum_j^{K-1} e^{\xb_n^T\cdot \wb_j}}\\ P(l_n=K|\xb_n, \mathbf{W}) = \frac{1}{1+\sum_j^{K-1} e^{\xb_n^T\cdot \wb_j}}$$
Further reading: https://en.wikipedia.org/wiki/Multinomial_logistic_regression
Now let us test our code on the iris dataset. We load and display the dataset below.
data_multi, labels_multi = helpers.load_ds_iris(sep_l=True, sep_w=True, pet_l=False, pet_w=False, setosa=True, versicolor=True, virginica=True)
fig = helpers.scatter2d_multiclass(data_multi, labels_multi, fig=None, fig_size=None, color_map=None,
legend=True, legend_map=None, grid=False, show=False)
weights_multi = logistic_regression_train_multi(data_multi, labels_multi, max_iters=100, lr=1e-3, loss=True, plot=False)
predictions_multi = logistic_regression_classify_multi(data_multi, weights_multi)
print("Accuracy is", accuracy(labels_multi, predictions_multi))
fig = helpers.visualize_predictions(data=data_multi, labels_gt=labels_multi, labels_pred=predictions_multi)
Q: Comment on the accuracy. What is the difference between the synthetic dataset and this one?
It is doing a good job at classifying class 0, but not 1 and 2. This is because, for these two features, class 0 is linearly separable from classes 1 and 2, but 1 and 2 are not linearly separable. In the synthetic dataset, all classes were linearly separable.