CS-456: Clarifications on the A2C algorithm (MP2)

Hello,

I am very confused about whether we need to update the value and policy networks after each environment step (thus update them n times per episodes and after EACH interaction with the environment), or only after a full episode has run.

I initially understood that the data collection step (policy rollout) and the learning step (network updates) needed to be conducted sequentially. I think this is what is implemented in the "Clean RL" code that is provided in Mini-Project 2, i.e. there is no interaction with the environment when updating the networks. However, a sentence in the instructions of MP2 (K=1 worker case) is really confusing: "Update the networks after each environment step", this seems to suggest that we should carry out the policy rollout while updating our networks. So, should we not wait for the end of the episode? But if we perform such "online" updates on our networks (i.e. we don't separate sampling and learning phases), how would we even be able to implement the "n-step bootstrapping", would we need to look ahead n steps for each "global step"?

I hope my questions are clear enough, thank you very much for your help!

Re: Clarifications on the A2C algorithm (MP2)

by Skander Moalla - Tuesday, 7 May 2024, 14:58

I initially understood that the data collection step (policy rollout) and the learning step (network updates) needed to be conducted sequentially.

This is correct.

So, should we not wait for the end of the episode?

But you shouldn't wait for the end of an episode. After each n steps in the environment (per environment) you do one learning step.