CS-456: DDPG | Moodle

Hi, I have a couple of questions on 2 lectures,

Lecture 8:
- Why do we need a correction weight when using prioritized replay ?
- About the difference between on-policy and off-policy methods. From what I understood, off-policy methods update a different policy that the one used to generate data. Furthemore, according to lecture 8 we cannot naively use replay-buffer for on-policy methods. I expect the DDPG algorithm to be on-policy as it uses the policy network during both exploration and when updating the weights. Why does using replay-buffer in that situation work well ?
Lecture 3: Does stationary $\rho^{\pi}$ means that the policy pi(a|s) has converged? In the formula for $\bar{r}$ there should be $\rho^{\pi}_s$ instead of $\pi^{\pi}_s$ right ?

Re: DDPG

by Ali Bakly - Thursday, 6 June 2024, 23:09

[Not a TA] So in DDPG the action is chosen from the policy network + exploration noise, which I think would make it off policy, and hence you can use a replay buffer. I would also be interested in hearing a TA's opinion on this.