CS-456: Exam 2022, question 4) (iii)

Your reasoning is correct, I think this is just a typo. Indeed, for policy gradient algorithms, we want to perform gradient ascent to find the policy that maximizes the total expected discounted reward. A quick check is always to look at the consistency of the update as you did. If the TD-error is positive as well as the derivative of the log-policy, we have a positive ∆θ, which is what we expect.

ANN Forum

Exam 2022, question 4) (iii)

Re: Exam 2022, question 4) (iii)