CS-456: The log-likelyhood trick

The log-likelyhood trick

◄ Exercise 2 Question 3
Q-learning ►

Since this is a policy gradient method, we are only modeling the policy (so theta are the policy parameters only) and not the return or value function. R(H) is the reward collected from trajectory H which is given by the environment and does not depend on the policy parameters, so it is considered a constant with respect to theta for a given trajectory H.

◄ Exercise 2 Question 3
Q-learning ►

Contact
EPFL CH-1015 Lausanne
+41 21 693 11 11

Follow the pulses of EPFL on social networks

Accessibility
Legal notice
Privacy policy

Exercise Questions

The log-likelyhood trick

Re: The log-likelyhood trick