DDPG

DDPG

par Nicolás Hernán Reátegui Rodríguez,
Nombre de réponses : 1

Hi, I have a couple of questions on 2 lectures,

  • Lecture 8: 
    • Why do we need a correction weight when using prioritized replay ?
    • About the difference between on-policy and off-policy methods. From what I understood, off-policy methods update a different policy that the one used to generate data. Furthemore, according to lecture 8 we cannot naively use replay-buffer for on-policy methods. I expect the DDPG algorithm to be on-policy as it uses the policy network during both exploration and when updating the weights. Why does using replay-buffer in that situation work well ?
  • Lecture 3: Does stationary  \rho^{\pi}   means that the policy pi(a|s) has converged? In the formula for  \bar{r} there should be  \rho^{\pi}_s instead of  \pi^{\pi}_s right ?

En réponse à Nicolás Hernán Reátegui Rodríguez

Re: DDPG

par Ali Bakly,
[Not a TA] So in DDPG the action is chosen from the policy network + exploration noise, which I think would make it off policy, and hence you can use a replay buffer. I would also be interested in hearing a TA's opinion on this.
DDPG