CS-456: Off-policy vs On-policy training

Hi Amine,

The difference is in the name off- vs on-policy.

In both cases, we want to do the update step with multiple transitions because a mini-batches (as opposed to updates based on a single sample) make learning more stable.

In the on-policy algorithm, "on-policy" means that we learn values that depend on the policy (the optimal action depends on how future actions will be chosen). So we need to learn based on transitions that represent the current policy. That's why we cannot pick transitions from an old memory buffer (the policy was different at the time). Instead, the way to have multiple transitions based on the current policy is to take transitions from multiple parallel agents.

In the off-policy algorithm, we learn values that do not depend on the policy. So we can afford to use transitions from previous steps, stored in the buffer.

Hope that helps!

ANN Forum

Off-policy vs On-policy training

Re: Off-policy vs On-policy training