My teammate and I are quite confused about the n-step bootstrapping algorithm required to be implemented (we read a previous post on this forum on this topic but it did not help us). What we understood is:
From any step t (far enough from termination), we need to compute the n-step return (from t), the (n-1)-return (from t), ..., the 1-step return (still from t).
Take the average of those n targets (let us call it R_av) to compute the advantage at time t as A_t = R_av - V(s_t).
Is that correct? Moreover, we know that we need to account for the StopGradient operation when computing our targets, thus only in step 1., not 2. (because we need the gradient of V(s_t), as opposed to that of posterior steps used in 1.). Are we correct?
Your advice on whether using .detach() or torch.no_grad() are also welcome! .detach() exhibits more versatility, but we are not sure it is as reliable as torch.no_grad().
Many thanks for your help !