CS-456: REINFORCE with baseline algorithm

Dear TAs,

I have a question regarding the "REINFORCE with baseline". It seems that in the lecture when updating the parameter for the value network (w), there is a discount factor (\gamma^t) included in the monomial which does not make sense to me. I checked various literature incl. the 2018 edition of RL book, and they neglected the discount factor when estimating the value network. In the homework implementation we also simply used the MSE which neglected the factor. Are both versions ok or which one makes more sense?

Thanks,

Wanhao

Re: REINFORCE with baseline algorithm

par Nicolas El Maalouly, jeudi 27 juin 2019, 12:48

Both are correct since this only affects the learning rate. It's true that for the value network it's probably better to remove it. In the Sutton and Barton book they had it in older versions, but removed it in the later ones.