MP2 - Sparse Rewards / K workers

MP2 - Sparse Rewards / K workers

by Thomas Brunet -
Number of replies: 3

Hi, 

On Tuesday you told us we should be careful when using parallel envs from gymnasium, in particular at truncation where we should retrieve the truncated state to compute the loss. We implemented this and what we get for the agent 0 seems a bit better (even if it still seems quite unstable depending on the seed).

But when we get to sparse rewards, it really doesn't seem to reach the optimal policy : it does go up to 500 but it's terribly unstable. Is it normal ? Since the rewards are very sparse I would say yes but since you say we should reach the optimal policy I'm not so sure what you mean, also I don't think it corresponds to what you showed us. Same thing when we go to 6 workers : it get better than with only 1 but it still seem quite unstable, we don't know if it's normal or not. 

Moreover, it seems like each time the mean episodic reward goes up to 500 very fast (in like 10-15k steps) which doesn't make sense to us.

- To compute the mean episodic reward we add at each new step the reward (so 1) to a variable and each time there is termination or a truncation, we put this value into a list, put the variable back to 0 and go on. Then every 1000 steps, we take the mean of the list (when we have k workers we have k of these lists so k values and we take the average of the k values), this is what we call the mean episodic reward and the values we plot. Is that a correct approach ?

- For the sparse rewards, we give a reward of 0 90% of the time to the learner to compute the loss, but not to the logging, did we understand it well ?

Are we missing something ? Would you have any other hint on where the problem could come from ? We are really lost and we don't understand what are we doing wrong here.

Thanks in advance.

The order of the plots are first agent 0 then agent 1 then agent 2.

In reply to Thomas Brunet

Re: MP2 - Sparse Rewards / K workers

by Skander Moalla -
Hello,

"On Tuesday you told us we should be careful when using parallel envs from gymnasium, in particular at truncation where we should retrieve the truncated state to compute the loss. We implemented this and what we get for the agent 0 seems a bit better (even if it still seems quite unstable depending on the seed)."
Great. The crucial difference you should see is the value loss which should now become much lower (see success criteria)

"it does go up to 500 but it's terribly unstable ... you say we should reach the optimal policy I'm not so sure what you mean"
The optimal policy is the one that gets a return of 500. Reaching it here means that you observe the return 500 multiple times, although it may not be flat 500 for the rest of the training. This is what you have to comment on regarding stability.

"To compute the mean episodic reward ... Is that a correct approach ?"
For a single run, I would recommend logging the returns as soon as episodes finish so that you see the latest evolution.
When aggregating multiple seeds, taking an aggregating over the episodes in the last 1k steps looks good.

"For the sparse rewards, we give a reward of 0 90% of the time to the learner to compute the loss, but not to the logging, did we understand it well ?"
Yes.

"Are we missing something ? Would you have any other hint on where the problem could come from ?"
I don't see an issue in the returns of agent 0.
For agent 1 the blue curve is a bit weird. You can show us other metrics during the TA session.
For agent 2, your curve goes against what you should intuitively understand. If you are using the same hyperparameters as described in the project, I suspect that the way you aggregate the loss from the K workers is not correct. You should take a mean.
In reply to Skander Moalla

Re: MP2 - Sparse Rewards / K workers

by Thomas Brunet -
Hello,
Thanks for your complete answer. What's weird is that we do are using a mean for the k workers... Another question is, when we use k workers, should we count each individual steps of the workers as 1 step (so 1 step is actually 6 steps) or does 1 step of k workers counts as 1 step ?

Thank you,
In reply to Thomas Brunet

Re: MP2 - Sparse Rewards / K workers

by Skander Moalla -
You count the number of steps that the agent has used to learn. So all steps of each worker are counted.