DDPG

DDPG

by Nicolás Hernán Reátegui Rodríguez -
Number of replies: 1

Hi, I have a couple of questions on 2 lectures,

  • Lecture 8: 
    • Why do we need a correction weight when using prioritized replay ?
    • About the difference between on-policy and off-policy methods. From what I understood, off-policy methods update a different policy that the one used to generate data. Furthemore, according to lecture 8 we cannot naively use replay-buffer for on-policy methods. I expect the DDPG algorithm to be on-policy as it uses the policy network during both exploration and when updating the weights. Why does using replay-buffer in that situation work well ?
  • Lecture 3: Does stationary  \rho^{\pi}   means that the policy pi(a|s) has converged? In the formula for  \bar{r} there should be  \rho^{\pi}_s instead of  \pi^{\pi}_s right ?

In reply to Nicolás Hernán Reátegui Rodríguez

Re: DDPG

by Ali Bakly -
[Not a TA] So in DDPG the action is chosen from the policy network + exploration noise, which I think would make it off policy, and hence you can use a replay buffer. I would also be interested in hearing a TA's opinion on this.
DDPG