Dear TA's
I have a question regarding ex 7.2 . For me there is an ambiguity between these 2 possible choices (i.e. where should I put batch normalization)
1)based on the paper batch normalization should be implemented before activation function of each layer(after calculating w*x+b and before applying nonlinearity) .
2) but in the question it is written "(i.e. before each layer with learnable parameters)" which means after activation function of previous layer.
according to this link both are correct and can be used
which one we should implement in this question
Best regards,
Saleh