CS-456: The log-likelyhood trick

(Sorry, I did not know where to post this)

I kind of struggle seeing how we are allowed go from to . If R(H) is independent of theta, I can see where this is coming from, as log( f(x) )' = ( 1/ f(x) ) *( f'(x)), but by the core-rule, grad(R)*log(p) =/= grad(R)

Is R independent of Theta, or do we simply have another way of modeling the future rewards ?

( I assume applying them later with a discount)

Re: The log-likelyhood trick

par Nicolas El Maalouly, dimanche 9 juin 2019, 17:12

Since this is a policy gradient method, we are only modeling the policy (so theta are the policy parameters only) and not the return or value function. R(H) is the reward collected from trajectory H which is given by the environment and does not depend on the policy parameters, so it is considered a constant with respect to theta for a given trajectory H.