The log-likelyhood trick

Re: The log-likelyhood trick

by Nicolas El Maalouly -
Number of replies: 0
Since this is a policy gradient method, we are only modeling the policy (so theta are the policy parameters only) and not the return or value function. R(H) is the reward collected from trajectory H which is given by the environment and does not depend on the policy parameters, so it is considered a constant with respect to theta for a given trajectory H.