Good evening,
We have been trying to understand why our loss (actor and critic) plots exhibit "weird" explosive patterns for the past week, and we have started to believe it is actually expected. (N.B. We have thoroughly looked at the other questions on the forum so we already made sure only the correct gradients are used).
Our rationale is the following (can you please let us know if we are wrong):
Until the agent is sufficiently trained, the critic network is bad at predicting a near-to-failure state.
Therefore, at termination of early episodes, there is a small gap between the target and the predicted V-value so the critic loss is rather small. As the training progresses, the agent performs longer runs and when it fails, the critic loss suddenly explodes because it was not able to predict the termination state value. This goes on until the agent is sufficiently trained and can reach 500 steps.
When we look at our critic loss plot, we see an increase in the size of the loss "bursts" and an increase in the spacings between them as the episodes become longer. This now makes sense to us but we originally expected an exponentially-decaying critic loss.
Thank you very much for your time !