CS-456: MP2-A2C n-step bootstrapping

Hello,

What you are describing is similar to generalized advantage estimation where you combine multiple n-step returns to estimate the return of a single step. This is not what is asked in the project.

The project asks to compute a single return per step, where at time t if you sampled n steps you can use the n-steps to estimate the return of state s_t then the n-1 steps after t+1 to estimate the return of state s_t+1, ..., until the state s_t+n whose return is estimated only from the reward at t+n. (all of them do bootstrap to finish the return estimate obviously).
There is no average to be taken to estimate the return of a given state.

Yes, you should account for StopGradient as the method is a semi-gradient.

Prefer using torch.no_grad as this does not generate the computation graph for autodiff and is thus faster then .detach which removes the tensor from the graph after computing the graph.
Both are "reliable".

ANN Forum

MP2-A2C n-step bootstrapping

Re: MP2-A2C n-step bootstrapping