CS-456: Model-free RL, policy gradient

Dear TAs,

I have a question regarding making steps in policy gradient in Deep RL. In slide 25, week 8 (see screenshot below), why do we consider the samples from policy and probability with weights theta prime and not theta? What would be the difference between J(theta) in slide 24 and slide 25?

Policy gradient, slide 25 week 8 model free-RL

Thank you in advance!

Best wishes,

Maria

Re: Model-free RL, policy gradient

by Sophia Becker - Wednesday, 12 June 2024, 18:18

Hi,

remember that V_theta(s_0) only depends on s_0 (to compute the V-value, we already took the expectation over the state action sequence sampled by theta, conditioned on the initial state s_0). That means that when we add the state-action-sequence sampled by theta' in the expectation of V_theta(s_0) (going from Eq. (4) to Eq. (5)), this does not affect our expectation of V_theta(s_0).

Hope it's more clear now!

Best,
Your TAs