CS-456: Miniproject 3: Reinforcement update rule

Hello,

Our group has implemented the "REINFORCE with Baseline" algorithm, but there is one thing that I can't get my head around: while I do understand the math behind the update step of the policy-parameter theta (in particular gamma to the power of t) and imagine that the equations that lead to the update-rule of the value-function must be similar, I struggle at understanding how having an extremely small weight at the last step of an episode (because of gamma raised to some power) benefits the overall estimation of the value-function. Eventually the training will converge and we will have satisfying estimates of the V-values, even for steps close to the end of an episode, but isn't there a better way to do this, why are we only trying (at least that's how I understand it) to have the best possible estimate of the first step's V-value?

Thanks in advance,

Julien