CS-456: Mp2-3.4: Training batch when the episode is done

Hi, we have a question about the implementation instruction in section 3.4 We are wondering if we should update the batch at the end of one experience even if the batch is not full. Given the combination with the K-worker implementation, it seems natural to fill the batch to the batch size and update it, but is this part of the documentation asking us to update when the experience is aborted? Thanks!

Re: Mp2-3.4: Training batch when the episode is done

par Skander Moalla, mardi, 7 mai 2024, 16:46

Hello,

I'm not sure I understand the question. What is one "experience"? What does it mean to fill the batch?

The task is to collect n-steps from each of the K workers, resulting in a batch of K*n samples. Compute the (up-to-)n-step advantages on each sample in the batch and use of them at once to compute the gradients.

Re: Mp2-3.4: Training batch when the episode is done

par Hyeongdon Moon, mardi, 7 mai 2024, 16:58

Oh sorry for using the terms in personal implementation. Experience means one step. I just want to ask if we should create a new batch if one episode is terminated even if there are fewer than n steps inside the batch. If n=6 and episode A ends within 3 steps, and the next episode is called B, is it okay to put [A1,A2,A3,B1,B2,B3] in a single batch? or Should I separate batch into [A1,A2,A3] and [B1,B2,B3]?

Re: Mp2-3.4: Training batch when the episode is done

par Skander Moalla, mardi, 7 mai 2024, 17:06

Thanks for the clarification. Yes, all the n-steps should be included even if the env is reset in between. In that case, you'd include As and Bs as you mention, and would have to bootstrap correctly so that samples from B don't end up in the value estimation of samples in As.

Re: Mp2-3.4: Training batch when the episode is done

par Hyeongdon Moon, mardi, 7 mai 2024, 17:25

Oh, I see. Thanks for the prompt response. Have a good day!