Mountain Car - DQN auxilary reward loss behaviour

Mountain Car - DQN auxilary reward loss behaviour

by Jakhongir Saydaliev -
Number of replies: 4

Hello, for the auixilary reward I am adding the following rewards:

- State dependent - I reward if the agent is moving towards the top. It is between 0 and 1, where 0 is in the middle or on the left side, and 1 is on the top

- Velocity dependent - I am rewarding a positive velocity, and it is also between 0 and 1


When I train the agent, the loss keeps decreasing when the agent gets truncated without reaching the top, but once it starts reaching the destination more frequently the loss starts increasing and ends up diverging basically.

This is the lenght of episodes

 

and this is the loss:


Here it is given for ~1500 episodes, but it goes the same way until 3000 episodes. 

What do you think could be the reason? What do you recommend?

In reply to Jakhongir Saydaliev

Re: Mountain Car - DQN auxilary reward loss behaviour

by Lucas Louis Gruaz -
It is expected that the loss increase once the agent discovers the destination as there is an abrupt change in the episode. You should be careful with your rewards: if the maximum reward you can get per step is more than 0, the agent may prefer to avoid finishing the episode to accumulate more rewards (because ending the episode would result in a reward of 0). With your formulation, it seems like the maximum auxiliary reward is 2, which may be an issue.

Your plots don't show in the forum so I cannot comment on that.
In reply to Lucas Louis Gruaz

Re: Mountain Car - DQN auxilary reward loss behaviour

by Jakhongir Saydaliev -
Thank you for your response.

I realised that I was not handling the terminal states correctly, that I had to set the output of target netwokr to 0 in case of terminal states.

Now, indeed the loss increases once it starts learning. than decreases slightly and flattens as in the plot I have attached. It is similar for RND as well. Is it expected so?

Another question regarding the normalization of RND rewards. During the exercise session, one of the TAs had told me, I had to normalize states with moving average and std, and the reward with mean and std of rewards of only the initial n rewards (not moving one). But then I look at the project description, and there it was said it had to be a moving nomalization for reward as well. I already implemented it in the way the TA had said and wrote it in the report. Is it ok if I do the reward normalization on the initial n rewards only (instead of the moving one)?

p.s. The second plot is for the RND loss.
Attachment 3.3-loss.png
Attachment 3.4-loss.png
In reply to Jakhongir Saydaliev

Re: Mountain Car - DQN auxilary reward loss behaviour

by Lucas Louis Gruaz -
Yes, it is expected so. For your plots, you should probably have an average window to see more clearly.
For the normalization, both possibilities have different benefits (how you did it vs how it is explained in the project description). You are expected to do as explained in the project description. In your case it is okay to keep it like that, but mention the reason in your code/report.
In reply to Lucas Louis Gruaz

Re: Mountain Car - DQN auxilary reward loss behaviour

by Jakhongir Saydaliev -
Regarding the reward, even if the total gives 2, I am adding it on top of the environment reward which is -1 everywhere except on top. So the sum of auxiliary and environment rewards will be maximum 2 (on top) and lower than 1 everywhere else.