Hello,
I've got some questions about the 'simple baseline' of mp3. First of all, what do you mean by 'moving average of the returns' ? Should we take the mean over the return of an episode? Or the mean over the returns of the steps of all episodes? By returns, do you mean discounted returns or rewards? And finally, should be subtract it to the discounted rewards or to the rewards themselves before computing the discount ?
Thanks in advance,
Sébastien