CS-456: MP3 moving average baseline

Hello,

I've got some questions about the 'simple baseline' of mp3. First of all, what do you mean by 'moving average of the returns' ? Should we take the mean over the return of an episode? Or the mean over the returns of the steps of all episodes? By returns, do you mean discounted returns or rewards? And finally, should be subtract it to the discounted rewards or to the rewards themselves before computing the discount ?

Thanks in advance,

Sébastien

Re: MP3 moving average baseline

by Nicolas El Maalouly - Sunday, 19 May 2019, 10:58 AM

Hi Sébastien,

The goal is to compute 1 moving baseline (it should not depend on the time step or state, and should only be 1 value not a vector). You have 2 options:

- you can compute an average over reward values (average of all rewards encountered for all time steps of all episodes) in which case you should subtract it from rewards before discounting.

- you can compute an average over discounted returns (average of all discounted returns computed from all time steps of all episodes) in which case you should subtract it from the discounted returns.

In both cases you can do a normal average over the steps of an episode, and then a moving average of these averages between episodes.

Re: MP3 moving average baseline

by Sébastien Jacques Roger Morand - Sunday, 19 May 2019, 2:07 PM

Hey Nicolas,

Thanks a lot for your quick answer, on a Sunday on top of that!

Cheers,

Sébastien