CS-456: Q-learning | Moodle 18-19

The update in Q-learning does not depend on which action the agent actually took in the environment, it takes into account the Q-function of the best action. So the Q-function learned at convergence does not depend directly on the policy you use so it is off-policy. Note that it is only the case on convergence since before that the data still depends on the policy, but with enough data (i.e. enough exploration) the Q-function is independent of the policy used.

In contrast, for SARSA, the update uses the Q-function of the action that was taken by the agent, and so the Q-function learning does depend on the policy used so we call it on-policy.

Exercise Questions

Q-learning

Re: Q-learning