$\renewcommand{\real}{\mathbb{R}}$ $\renewcommand{\xb}{\mathbf{x}}$ $\renewcommand{\wb}{\mathbf{w}}$ $\renewcommand{\Xb}{\mathbf{X}}$ $\renewcommand{\Lb}{\mathbf{L}}$

$\DeclareMathOperator*{\argmin}{argmin}$

Exercise Session 4: Logistic Regression¶

%load_ext autoreload
%autoreload 2
%matplotlib notebook

# project files
import helpers as helpers

# 3rd party
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)

This week's exercise is about binary class logistic regression. Let's start by reloading the Iris Flower Dataset from the previous week. For the sake of the simplicity of plotting, we will only use 2 out of 4 features. For the first part we will use 2 out of 3 classes (named as setosa and versicolor). Later on, we will also use the third class virginica. Therefore, for this part our dataset with two classes is as follows:

data: $\Xb \in \real^{N \times 2}$, $\forall \xb_n \in \Xb: \xb_n \in \real^{2}$
labels: $\Lb \in \real^{N}$, $\forall l_n \in \Lb: l_n \in \{0, 1\}$

Notice that our labels are {0, 1} not {-1, 1} as they were for perceptron. This is because the formulation we have seen in the course for logistic regression was designed according to {0, 1} labels. It can easily be modified to also suit other value pairs.

Also note that $\Xb$ is a matrix of shape $(N \times D)$. However, a single example $\xb_n$ is a column vector of shape $(D \times 1)$. When you want to multiply one data sample with the weights vector $\wb$ (also a column vector of shape $(D \times 1)$) you will see: $\xb_n^T\cdot\wb$. When you want to multiply the entire data matrix with the weights vector, you will see: $\Xb\cdot\wb$.

data, labels = helpers.load_ds_iris(sep_l=True, sep_w=True, pet_l=False, pet_w=False,
                              setosa=True, versicolor=True, virginica=False)
fig = helpers.scatter2d_multiclass(data, labels)

1.1 - A short introduction:¶

In logistic regression, the probability of a datapoint belonging to a class is found as: $$P(l_n=1|\xb_n, \wb) = \frac{e^{(\xb_n^T\cdot \wb)}}{1+e^{(\xb_n^T\cdot \wb)}} $$

This is called the sigmoid function! The sigmoid function is defined as: $$\sigma(t)= \frac {e^t}{1+e^t}= \frac{1}{1+e^{-t}}$$

Let's try to code this function. You can use the numpy function np.exp(x) to take the exponential of a number.

def sigmoid(t):
    """ Sigmoid function
    
    Args:
        t (np.array): Input data of shape (N, )
        
    Returns:
        np.array: Probabilites of shape (N, ), where each value is in {0, 1}.
    """
    return 1/(1 + np.exp(-t))

A side note: Since the probabilities sum up to 1, we find $$P(l_n=0|\xb_n, \wb) = 1-P(l_n=1|\xb_n, \wb).$$

Q: Using $P(l_n|\xb_n, \wb)$ verify the likelihood is equal to:

$$P(\Lb|\Xb, \wb) = \prod_n\sigma(\xb_n^T\cdot\wb)^{l_n} (1-\sigma(\xb_n^T\cdot\wb))^{1-l_n} $$

Hint: You can start from the definition of likelihood: $P(\Lb|\Xb, \wb) = \prod_n P(l_n|\xb_n, \wb)$.

Solution:

$$P(\Lb|\Xb, \wb) = \prod_n P(l_n|\xb_n, \wb) \\ =\prod_{n: l_n = 1} P(l_n = 1|\xb_n, \wb) \prod_{n: l_n = 0} P(l_n = 0|\xb_n, \wb) \\ =\prod_{n: l_n = 1} \sigma(\xb_n^T\cdot\wb) \prod_{n: l_n = 0} (1-\sigma(\xb_n^T\cdot\wb)) \\ P(\Lb|\Xb, \wb) =\prod_n\sigma(\xb_n^T\cdot\wb)^{l_n} (1-\sigma(\xb_n^T\cdot\wb))^{1-l_n} $$

Q: Using the likelihood, verify that the loss (also called energy) function is equal to:

$$E(\wb) = -\sum_{n}l_n\log(\sigma(\xb_n^T\cdot \wb))+(1-l_n)\log(1-\sigma(\xb_n^T\cdot \wb))$$

Hint: Our aim is to maximize the likelihood. This is the same as maximizing the logarithm of the likelihood. This is also the same as minimizing the negative log-likelihood.

Solution:

$$E(\wb) = -\log(P(\Lb|\Xb, \wb))\\ = -\log(\prod_n\sigma(\xb_n^T\cdot\wb)^{l_n} (1-\sigma(\xb_n^T\cdot\wb))^{1-l_n}) \\ E(\wb) = -\sum_{n}l_n\log(\sigma(\xb_n^T\cdot \wb))+(1-l_n)\log(1-\sigma(\xb_n^T\cdot \wb))$$

This energy function is also equal to: $$E(\wb) = \sum_{n}(\log(1+e^{\xb_n^T\cdot \wb})-l_n(\xb_n^T\cdot \wb))$$ (Feel free to work this out and see for yourself that they are equal!)

Below we have provided the function for finding the loss, loss_logistic().

def loss_logistic(data, labels, w):
    """ Logistic regression loss function for binary classes
    
    Args:
        data (np.array): Dataset of shape (N, D).
        labels (np.array): Labels of shape (N, ).
        w (np.array): Weights of logistic regression model of shape (D, )
    Returns:
        int: Loss of logistic regression.
    """
    return np.sum(np.log(1+np.exp(data.dot(w)))-labels*(data.dot(w)))

We need to find the gradient of the loss function in order to move our weights towards the optimal weights. The gradient of the loss function is: $$\nabla E(\wb)= \sum_n (\sigma(\xb_n^T\cdot \wb) - l_n)\xb_n^T $$ Let us put this into nice matrix format: $$\nabla E(\wb)= \Xb^T(\sigma(\Xb\cdot \wb) - \Lb)$$

Fill in the function for finding the gradient gradient_logistic(). You can use the numpy function np.dot() for matrix multiplication.

def gradient_logistic(data, labels, w):
    """ Logistic regression gradient function for binary classes
    
    Args:
        data (np.array): Dataset of shape (N, D).
        labels (np.array): Labels of shape (N, ).
        w (np.array): Weights of logistic regression model of shape (D, )
    Returns:
        np. array: Gradient array of shape (D, )
    """
    return data.T.dot(sigmoid(data.dot(w))-labels)

1.2 - Classifying using a logistic regression model¶

Now let us write a function to perform classification using logistic regression, logistic_regression_classify(). This function uses the weights we find during training in order to predict the labels for the data.

Hints:

We classify our data according to $P(l_n=1|\xb_n, \wb)$. If the value of $P(l_n=1|\xb_n, \wb)$ is less than 0.5 then the data point is classified as label 0. If it is more than 0.5 then we classify the data point as label 1.
You can do thresholding very easily in numpy. Example: we have an array x that has values {1,1,1,2,2,2,3,3}. We would like to set the values of elements equal to 2 to 0, so we would have {1,1,1,0,0,0,3,3}. We can select the elements of x equal to 2 by (x==2). So we can set these elements to 0 by x[x==2]=0.

def logistic_regression_classify(data, w):
    """ Classification function for binary class logistic regression. 
    
    Args:
        data (np.array): Dataset of shape (N, D).
        w (np.array): Weights of logistic regression model of shape (D, )
    Returns:
        np. array: Label assignments of data of shape (N, )
    """
    #### write your code here: find predictions and threshold.
    predictions = sigmoid(data@w)
    predictions[predictions<0.5]=0
    predictions[predictions>=0.5]=1        
    return predictions

Let us also use our accuracy() function from last week's exercise. Recall that it implements $$ f_{\text{acc}} = \frac{\text{# correct predictions}}{\text{# all predictions}}$$ We have provided it for you below:

def accuracy(labels_gt, labels_pred):
    """ Computes accuracy.
    
    Args:
        labels_gt (np.array): GT labels of shape (N, ).
        labels_pred (np.array): Predicted labels of shape (N, ).
        
    Returns:
        float: Accuracy, in range [0, 1].
    """
    return np.sum(labels_gt == labels_pred) / labels_gt.shape[0]

1.3 - Training a logistic regression model¶

In order to find good weights and have a high accuracy we need to train our model. Fill in missing parts of the function logistic_regression_train().

The function first initializes the weights according to a gaussian distribution. In each iteration, you should find the gradient using gradient_logistic and take a gradient step using these weights. Recall that a gradient step is: $$ \wb^{t + 1} = \wb^{t} - \eta \nabla E(\wb^{t}) $$

The loss and plot parameters print out the loss and display the predictions as a plot for the current iteration, respectively. You do not need to modify these parts.

def logistic_regression_train(data, labels, max_iters=10, lr=0.001, loss=True, plot=True):
    """ Training function for binary class logistic regression. 
    
    Args:
        data (np.array): Dataset of shape (N, D).
        labels (np.array): Labels of shape (N, ).
        max_iters (integer): Maximum number of iterations. Default:10
        lr (integer): The learning rate of  the gradient step. Default:0.001
        loss (boolean): Calculates the loss of logistic regression at each iteration and prints it. Default:True
        plot (boolean): Plots the predictions at each iteration. Default:True
    Returns:
        np. array: weights of shape(D, )
    """
    #initialize weights randomly according to gaussian distribution
    weights = np.random.normal(0, 0.1, [data.shape[1],])
    for iter in range(max_iters):
        ########## write your code here: find gradient and do a gradient step
        gradient = gradient_logistic(data, labels, weights)
        weights = weights - lr*gradient
        ##################################
        predictions = logistic_regression_classify(data, weights)
        if loss:
            print('loss at iteration', iter, ":", loss_logistic(data, labels, weights))
        if plot:
            fig = helpers.visualize_predictions(data=data, labels_gt=labels, labels_pred=predictions)
            plt.title("iteration "+ str(iter))
        if accuracy(labels, predictions) == 1:
            break
    return weights

Run the code below to see your training in action. What do you observe? Try playing with the learning rate and number of max iterations.

weights = logistic_regression_train(data, labels, max_iters=20, lr=1e-3, loss=True, plot=True)

loss at iteration 0 : 67.41294571511655

loss at iteration 1 : 66.65004527503662

loss at iteration 2 : 66.03196583233186

loss at iteration 3 : 65.42384069995235

loss at iteration 4 : 64.82548993776551

loss at iteration 5 : 64.2367462225068

loss at iteration 6 : 63.65744381580975

loss at iteration 7 : 63.08741864330523

loss at iteration 8 : 62.5265083682974

loss at iteration 9 : 61.97455245914741

predictions = logistic_regression_classify(data, weights)
print("Accuracy is", accuracy(labels, predictions))

Accuracy is 0.99

Now that we have classified two classes, we can move on to multi-class logistic regression.

2. Multi-Class Logistic Regression¶

Load the synthetic data by running the code segment below. We will use this dataset for now as it is easy to work with. Our data is:

data: $\Xb \in \real^{N \times 2}$, $\forall \xb_n \in \Xb: \xb_n \in \real^{2}$
labels: $\Lb \in \real^{N}$, $\forall l_n \in \Lb: l_n \in \{0, 1, 2\}$

helpers.generate_dataset_synth()
data_multi, labels_multi = helpers.load_dataset_synth()
fig = helpers.scatter2d_multiclass(data_multi, labels_multi, fig=None, fig_size=None, color_map=None,
                         legend=True, legend_map=None, grid=False, show=False)

2.1 - A short introduction¶

Multi class logistic regression is an extention to binary logistic regression.

Let us consider logistic regression for K classes. We keep our weights in a weight matrix $\mathbf{W}$, where every column is $\wb_k$ for class $k$. Therefore, for every class $k$, we learn a separate $\wb_k$ during training. The weights matrix will be of size $(D \times K)$.

The generalized probabilities for logistic regression is $$P(l_n=k|\xb_n, \mathbf{W}) = \frac{e^{\xb_n^T\cdot \wb_k}}{\sum_j^K e^{\xb_n^T\cdot \wb_j}}$$ which is called the softmax function! Let us denote this function by $f_\text{softmax}$.

Study the implementation of this function below. It is used to assign the probabilities of a datapoint belonging to each class. For example, for a single datapoint, for 3 classes you might have the following probability assignments {0.2, 0.7, 0.1}. The probabilities all sum up to 1.

def f_softmax(data, w):
    """ Softmax function
    
    Args:
        data (np.array): Input data of shape (N, D)
        w (np.array): Weights of shape (D, K) where K is # of classes
        
    Returns:
        np.array: Probabilites of shape (N, K), where each value is in {0, 1} and each row sums to 1.
    """
    
    #data.shape[0] is the number of datapoints, and w.shape[1] is the number of classes.
    res = np.zeros([data.shape[0], w.shape[1]]) 
    #The normalization term only has to be calculated once for all classes
    norm = np.sum(np.exp(data@w), axis=1)
    #We iterate for each class
    for k in range(w.shape[1]):
        exp_top = np.exp(data@w[:, k])
        res[:, k] = exp_top/norm
    return res

Using these, we find the loss function which we are trying to minimize is

$$E(\mathbf{W}) = -\sum_{n}\sum_{k}\mathbb{1}_{[l_n=k]}\log(f_\text{softmax}(\xb_{n}^T \cdot \wb_k))$$

The expression $\mathbb{1}_{[l_n=k]}$ is the indicator function, meaning that it is 1 when the label is equal to $k$. The loss function code is provided for you below. Note the use of (labels==k) for the indicator function $\mathbb{1}_{[l_n=k]}$.

def loss_logistic_multi(data, labels, w):
    """ Loss function for multi class logistic regression
    
    Args:
        data (np.array): Input data of shape (N, D)
        labels (np.array): Labels of shape  (N, )
        w (np.array): Weights of shape (D, K)
        
    Returns:
        float: Loss value 
    """
    
    loss =  0
    res = f_softmax(data, w)
    for k in range(w.shape[1]):
        loss -= np.sum((labels==k)*np.log(res[:,k]))
    return loss

To find the gradient, we find the gradient of $E(\mathbf{W})$ with respect to each $\wb_k$ vector separately. For model $\wb_k$ we have $$\nabla E(\mathbf{W})_{\wb_k}=\sum_n(f_\text{softmax}(\xb_{n}^T \cdot \wb_k)-\mathbb1_{[l_n=k]})\xb_{n}^T$$ Let's put this into matrix format as well: $$\nabla E(\mathbf{W})_{\wb_k}= \Xb^T(f_\text{softmax}(\Xb\cdot \wb_k) - \mathbf{1}_{[l = k]})$$

Now, you will fill in the gradient function, gradient_logistic_multi() given below.

Hints:

You can refer to the loss function implemented above as a guideline (follow a similar format)

def gradient_logistic_multi(data, labels, w):
    """ Gradient function for multi class logistic regression
    
    Args:
        data (np.array): Input data of shape (N, D)
        labels (np.array): Labels of shape  (N, )
        w (np.array): Weights of shape (D, K)
        
    Returns:
        np.array: Gradients of shape (D, K)
    """
    grad_w = np.zeros(w.shape)
    res = f_softmax(data, w)
    for k in range(w.shape[1]):
        grad_w[:, k] = data.T@(res[:,k]-(labels==k))
    return grad_w

2.2 - Classification and training for multiple classes¶

Write the functions for classification and training.

Hints:

For the classification function, you will be using $f_\text{softmax}$ to assign the probabilities of a datapoint belonging to each class. The softmax function returns an array of size $(N \times K)$.
To find which class a point is assigned to use the numpy function argmax. This function returns the index of the highest value in an array. This function has an argument axis which you should set to 1.
Training will be the same as the binary case. First, we will find the gradient. Then we will update the weights using gradient descent.

def logistic_regression_classify_multi(data, w):
    """ Classification function for multi class logistic regression. 
    
    Args:
        data (np.array): Dataset of shape (N, D).
        w (np.array): Weights of logistic regression model of shape (D, K)
    Returns:
        np. array: Label assignments of data of shape (N, ).
    """
    #### write your code here: find predictions, argmax to find the correct label
    predictions = np.argmax(f_softmax(data, w), axis=1)
    return predictions

def logistic_regression_train_multi(data, labels, k=3, max_iters=10, lr=0.001, loss=True, plot=True):
    """ Classification function for multi class logistic regression. 
    
    Args:
        data (np.array): Dataset of shape (N, D).
        labels (np.array): Labels of shape (N, )
        k (integer): Number of classes. Default=3
        max_iters (integer): Maximum number of iterations. Default:10
        lr (integer): The learning rate of  the gradient step. Default:0.001
        loss (boolean): Calculates the loss of logistic regression at each iteration and prints it. Default:True
        plot (boolean): Plots the predictions at each iteration. Default:True
    Returns:
        np. array: Label assignments of data of shape (N, ).
    """
    weights = np.random.normal(0, 0.1, [data.shape[1], k])
    for iter in range(max_iters):
        gradient = gradient_logistic_multi(data, labels, weights)
        weights = weights - lr*gradient
        ##################################
        predictions = logistic_regression_classify_multi(data, weights)
        if loss:
            print('loss at iteration', iter, ":", loss_logistic_multi(data, labels, weights))
        if plot:
            fig = helpers.visualize_predictions(data=data, labels_gt=labels, labels_pred=predictions)
            plt.title("iteration "+ str(iter))
        if accuracy(labels, predictions) == 1:
            break
    return weights

weights_multi = logistic_regression_train_multi(data_multi, labels_multi, max_iters=15, lr=5e-4, loss=True, plot=True)

loss at iteration 0 : 168.40287557673335

loss at iteration 1 : 163.3610312105979

loss at iteration 2 : 158.52336639094764

loss at iteration 3 : 153.8833870340836

loss at iteration 4 : 149.43438211695434

loss at iteration 5 : 145.16949107747527

loss at iteration 6 : 141.08176709891274

loss at iteration 7 : 137.1642352249997

loss at iteration 8 : 133.40994458818162

loss at iteration 9 : 129.81201434697817

predictions_multi = logistic_regression_classify_multi(data_multi, weights_multi)
print("Accuracy is", accuracy(labels_multi, predictions_multi))

Accuracy is 1.0

A side note: Notice that using this simple formulation, we have trained K classifiers for K classes. Our probability assignments are according to the softmax function. $$P(l_n=k|\xb_n, \mathbf{W}) = \frac{e^{\xb_n^T\cdot \wb_k}}{\sum_j^K e^{\xb_n^T\cdot \wb_j}}$$ And $$\sum_{k}^{K}P(l_n=k|\xb_n, \mathbf{W})=1$$ However, in the binary case we were training 1 classifier for 2 classes. The probabilities are assigned according to the sigmoid function. $$P(l_n=1|\xb_n, \wb) = \frac{e^{(\xb_n^T\cdot \wb)}}{1+e^{(\xb_n^T\cdot \wb)}} \\ P(l_n=0|\xb_n, \wb) = 1-P(l_n=1|\xb_n, \wb) = \frac{1}{1+e^{(\xb_n^T\cdot \wb)}}$$

Similar to the binary case, we can train K-1 classifiers for K classes. We modify the probability assignment function to be, for classes $k={1, ... ,K-1}$. $$P(l_n=k|\xb_n, \mathbf{W}) = \frac{e^{\xb_n^T\cdot \wb_k}}{1+\sum_j^{K-1} e^{\xb_n^T\cdot \wb_j}}$$

Q: What is $P(l_n=K|\xb_n, \mathbf{W})$? $$P(l_n=K|\xb_n, \mathbf{W}) = 1- \sum_{k}^{K-1}P(l_n=k|\xb_n, \mathbf{W})\\ 1- \sum_{k}^{K-1}\frac{e^{\xb_n^T\cdot \wb_k}}{1+\sum_j^{K-1} e^{\xb_n^T\cdot \wb_j}} = 1-\frac{\sum_{k}^{K-1}e^{\xb_n^T\cdot \wb_k}}{1+\sum_j^{K-1} e^{\xb_n^T\cdot \wb_j}}\\ P(l_n=K|\xb_n, \mathbf{W}) = \frac{1}{1+\sum_j^{K-1} e^{\xb_n^T\cdot \wb_j}}$$

Further reading: https://en.wikipedia.org/wiki/Multinomial_logistic_regression

2.3 Test on the Iris dataset:¶

Now let us test our code on the iris dataset. We load and display the dataset below.

data_multi, labels_multi = helpers.load_ds_iris(sep_l=True, sep_w=True, pet_l=False, pet_w=False, setosa=True, versicolor=True, virginica=True)
fig = helpers.scatter2d_multiclass(data_multi, labels_multi, fig=None, fig_size=None, color_map=None,
                         legend=True, legend_map=None, grid=False, show=False)

weights_multi = logistic_regression_train_multi(data_multi, labels_multi, max_iters=100, lr=1e-3, loss=True, plot=False)
predictions_multi = logistic_regression_classify_multi(data_multi, weights_multi)
print("Accuracy is", accuracy(labels_multi, predictions_multi))

loss at iteration 0 : 178.69404343979943
loss at iteration 1 : 175.0096291045312
loss at iteration 2 : 176.99771969627201
loss at iteration 3 : 176.61361365001682
loss at iteration 4 : 174.31708673634293
loss at iteration 5 : 177.23554489540027
loss at iteration 6 : 171.1015117640137
loss at iteration 7 : 176.01066295860542
loss at iteration 8 : 168.02012638114533
loss at iteration 9 : 173.7340864317843
loss at iteration 10 : 165.28304433878745
loss at iteration 11 : 171.15606525176733
loss at iteration 12 : 162.8479490782924
loss at iteration 13 : 168.5577694754059
loss at iteration 14 : 160.6461007923657
loss at iteration 15 : 166.02584814098384
loss at iteration 16 : 158.6276516809596
loss at iteration 17 : 163.59064513088023
loss at iteration 18 : 156.75905575426677
loss at iteration 19 : 161.26421849330023
loss at iteration 20 : 155.01713787743347
loss at iteration 21 : 159.0508779504197
loss at iteration 22 : 153.3851425013391
loss at iteration 23 : 156.9508577829573
loss at iteration 24 : 151.85037606394624
loss at iteration 25 : 154.96198313187006
loss at iteration 26 : 150.40282141702716
loss at iteration 27 : 153.08058598923336
loss at iteration 28 : 149.03431330409592
loss at iteration 29 : 151.30207232591778
loss at iteration 30 : 147.73802799601896
loss at iteration 31 : 149.62130208976518
loss at iteration 32 : 146.5081486967376
loss at iteration 33 : 148.0328574952868
loss at iteration 34 : 145.33963388234417
loss at iteration 35 : 146.5312372370202
loss at iteration 36 : 144.22805088694923
loss at iteration 37 : 145.11099686321722
loss at iteration 38 : 143.16945405363842
loss at iteration 39 : 143.76684779706912
loss at iteration 40 : 142.16029470052362
loss at iteration 41 : 142.49372404900492
loss at iteration 42 : 141.19735414976424
loss at iteration 43 : 141.28682388671712
loss at iteration 44 : 140.27769346364386
loss at iteration 45 : 140.14163249869347
loss at iteration 46 : 139.39861521049147
loss at iteration 47 : 139.05393062743542
loss at iteration 48 : 138.55763384525392
loss at iteration 49 : 138.01979318261655
loss at iteration 50 : 137.75245224693577
loss at iteration 51 : 137.03558098157467
loss at iteration 52 : 136.9809426702397
loss at iteration 53 : 136.09792802305924
loss at iteration 54 : 136.2411308914966
loss at iteration 55 : 135.20372608665787
loss at iteration 56 : 135.5311827031373
loss at iteration 57 : 134.35010795913783
loss at iteration 58 : 134.84939217377624
loss at iteration 59 : 133.53443020654436
loss at iteration 60 : 134.19417127262895
loss at iteration 61 : 132.7542561203425
loss at iteration 62 : 133.56404058083135
loss at iteration 63 : 132.00733924986483
loss at iteration 64 : 132.95762089573634
loss at iteration 65 : 131.29160777596002
loss at iteration 66 : 132.37362559010046
loss at iteration 67 : 130.6051498684191
loss at iteration 68 : 131.8108536252042
loss at iteration 69 : 129.94620009133314
loss at iteration 70 : 131.26818314161036
loss at iteration 71 : 129.31312686717808
loss at iteration 72 : 130.7445655677011
loss at iteration 73 : 128.70442097530162
loss at iteration 74 : 130.23902019720975
loss at iteration 75 : 128.11868503843212
loss at iteration 76 : 129.75062919457005
loss at iteration 77 : 127.55462393800094
loss at iteration 78 : 129.27853299229494
loss at iteration 79 : 127.01103609261787
loss at iteration 80 : 128.8219260485694
loss at iteration 81 : 126.48680553193056
loss at iteration 82 : 128.38005293630945
loss at iteration 83 : 125.98089469884175
loss at iteration 84 : 127.95220473743682
loss at iteration 85 : 125.49233791561855
loss at iteration 86 : 127.53771571823437
loss at iteration 87 : 125.02023545306328
loss at iteration 88 : 127.13596026351453
loss at iteration 89 : 124.56374814613082
loss at iteration 90 : 126.74635004902109
loss at iteration 91 : 124.12209250382048
loss at iteration 92 : 126.3683314330375
loss at iteration 93 : 123.69453626561986
loss at iteration 94 : 126.00138304961709
loss at iteration 95 : 123.28039436109026
loss at iteration 96 : 125.64501358720017
loss at iteration 97 : 122.87902523326466
loss at iteration 98 : 125.298759737641
loss at iteration 99 : 122.48982749033706
Accuracy is 0.66

fig = helpers.visualize_predictions(data=data_multi, labels_gt=labels_multi, labels_pred=predictions_multi)

Q: Comment on the accuracy. What is the difference between the synthetic dataset and this one?

It is doing a good job at classifying class 0, but not 1 and 2. This is because, for these two features, class 0 is linearly separable from classes 1 and 2, but 1 and 2 are not linearly separable. In the synthetic dataset, all classes were linearly separable.