CS-456: Mountain Car - DQN auxilary reward loss behaviour

Hello, for the auixilary reward I am adding the following rewards:

- State dependent - I reward if the agent is moving towards the top. It is between 0 and 1, where 0 is in the middle or on the left side, and 1 is on the top

- Velocity dependent - I am rewarding a positive velocity, and it is also between 0 and 1

When I train the agent, the loss keeps decreasing when the agent gets truncated without reaching the top, but once it starts reaching the destination more frequently the loss starts increasing and ends up diverging basically.

This is the lenght of episodes

and this is the loss:

Here it is given for ~1500 episodes, but it goes the same way until 3000 episodes.

What do you think could be the reason? What do you recommend?

Re: Mountain Car - DQN auxilary reward loss behaviour

by Lucas Louis Gruaz - Monday, 27 May 2024, 09:24

It is expected that the loss increase once the agent discovers the destination as there is an abrupt change in the episode. You should be careful with your rewards: if the maximum reward you can get per step is more than 0, the agent may prefer to avoid finishing the episode to accumulate more rewards (because ending the episode would result in a reward of 0). With your formulation, it seems like the maximum auxiliary reward is 2, which may be an issue.

Your plots don't show in the forum so I cannot comment on that.

Re: Mountain Car - DQN auxilary reward loss behaviour

by Jakhongir Saydaliev - Monday, 27 May 2024, 11:00

Thank you for your response.

I realised that I was not handling the terminal states correctly, that I had to set the output of target netwokr to 0 in case of terminal states.

Now, indeed the loss increases once it starts learning. than decreases slightly and flattens as in the plot I have attached. It is similar for RND as well. Is it expected so?

Another question regarding the normalization of RND rewards. During the exercise session, one of the TAs had told me, I had to normalize states with moving average and std, and the reward with mean and std of rewards of only the initial n rewards (not moving one). But then I look at the project description, and there it was said it had to be a moving nomalization for reward as well. I already implemented it in the way the TA had said and wrote it in the report. Is it ok if I do the reward normalization on the initial n rewards only (instead of the moving one)?

p.s. The second plot is for the RND loss.

Re: Mountain Car - DQN auxilary reward loss behaviour

by Lucas Louis Gruaz - Monday, 27 May 2024, 11:26

Yes, it is expected so. For your plots, you should probably have an average window to see more clearly.
For the normalization, both possibilities have different benefits (how you did it vs how it is explained in the project description). You are expected to do as explained in the project description. In your case it is okay to keep it like that, but mention the reason in your code/report.

Re: Mountain Car - DQN auxilary reward loss behaviour

by Jakhongir Saydaliev - Monday, 27 May 2024, 11:20

Regarding the reward, even if the total gives 2, I am adding it on top of the environment reward which is -1 everywhere except on top. So the sum of auxiliary and environment rewards will be maximum 2 (on top) and lower than 1 everywhere else.