Exam 2022, question 4) (iii)

Exam 2022, question 4) (iii)

by Robin Patriarca -
Number of replies: 1

Hello,

I have some trouble understanding why the solution of the exam of 2022, question 4) (iii) has a negative sign. Does that contradict the algorithm of Barto and Sutton given in class ?

I feel like if delta is positive, it means that the agent did something that gave a bigger reward than anticipated, so we want to ascend the gradient of that particular action.

Thank you

In reply to Robin Patriarca

Re: Exam 2022, question 4) (iii)

by Michael Alain Hauri -
Your reasoning is correct, I think this is just a typo. Indeed, for policy gradient algorithms, we want to perform gradient ascent to find the policy that maximizes the total expected discounted reward. A quick check is always to look at the consistency of the update as you did. If the TD-error is positive as well as the derivative of the log-policy, we have a positive ∆θ, which is what we expect.