CS-456: Off-policy vs On-policy training

Hello,

I don't understand why in off-policy deep rl training, we sample data from our batch wherease in on-policy deep rl, we run multiple agents in parallel.

Thanks.

Re: Off-policy vs On-policy training

par Ariane Delrocq, mardi, 11 juin 2024, 16:51

Hi Amine,

The difference is in the name off- vs on-policy.

In both cases, we want to do the update step with multiple transitions because a mini-batches (as opposed to updates based on a single sample) make learning more stable.

In the on-policy algorithm, "on-policy" means that we learn values that depend on the policy (the optimal action depends on how future actions will be chosen). So we need to learn based on transitions that represent the current policy. That's why we cannot pick transitions from an old memory buffer (the policy was different at the time). Instead, the way to have multiple transitions based on the current policy is to take transitions from multiple parallel agents.

In the off-policy algorithm, we learn values that do not depend on the policy. So we can afford to use transitions from previous steps, stored in the buffer.

Hope that helps!