CS-456: Exam 2022, question 4) (iii)

Hello,

I have some trouble understanding why the solution of the exam of 2022, question 4) (iii) has a negative sign. Does that contradict the algorithm of Barto and Sutton given in class ?

I feel like if delta is positive, it means that the agent did something that gave a bigger reward than anticipated, so we want to ascend the gradient of that particular action.

Thank you

Re: Exam 2022, question 4) (iii)

by Michael Alain Hauri - Friday, 14 June 2024, 16:17

Your reasoning is correct, I think this is just a typo. Indeed, for policy gradient algorithms, we want to perform gradient ascent to find the policy that maximizes the total expected discounted reward. A quick check is always to look at the consistency of the update as you did. If the TD-error is positive as well as the derivative of the log-policy, we have a positive ∆θ, which is what we expect.