The log-likelyhood trick

The log-likelyhood trick

par Jørn Bøni Hofstad,
Number of replies: 1

(Sorry, I did not know where to post this)

I kind of struggle seeing how we are allowed go from to . If R(H) is independent of theta, I can see where this is coming from, as log( f(x) )'  = ( 1/ f(x) ) *( f'(x)), but by the core-rule, grad(R)*log(p)  =/= grad(R)


Is R independent of Theta, or do we simply have another way of modeling the future rewards ?

( I assume applying them later with a discount)

In reply to Jørn Bøni Hofstad

Re: The log-likelyhood trick

par Nicolas El Maalouly,
Since this is a policy gradient method, we are only modeling the policy (so theta are the policy parameters only) and not the return or value function. R(H) is the reward collected from trajectory H which is given by the environment and does not depend on the policy parameters, so it is considered a constant with respect to theta for a given trajectory H.