CS-456: MP2 - A2C Loss plots

Good evening,

We have been trying to understand why our loss (actor and critic) plots exhibit "weird" explosive patterns for the past week, and we have started to believe it is actually expected. (N.B. We have thoroughly looked at the other questions on the forum so we already made sure only the correct gradients are used).

Our rationale is the following (can you please let us know if we are wrong):

Until the agent is sufficiently trained, the critic network is bad at predicting a near-to-failure state.

Therefore, at termination of early episodes, there is a small gap between the target and the predicted V-value so the critic loss is rather small. As the training progresses, the agent performs longer runs and when it fails, the critic loss suddenly explodes because it was not able to predict the termination state value. This goes on until the agent is sufficiently trained and can reach 500 steps.

When we look at our critic loss plot, we see an increase in the size of the loss "bursts" and an increase in the spacings between them as the episodes become longer. This now makes sense to us but we originally expected an exponentially-decaying critic loss.

Thank you very much for your time !

5 steps per update - 1 worker

Re: MP2 - A2C Loss plots

by Skander Moalla - Monday, 13 May 2024, 10:42

Hello, it's all correct. Your critic is fitting a moving target as your policy gets better and better. It is only expected to converge when the policy has converged. So it's normal to see spikes while the agent is still training.

For the project, in Agent 1 without stochasticity we expect your agent to converge quickly and thus to observe your critic converge too (value loss less than 1e-4). You will likely have to adapt the scale of your y-axis to observe it.