CS-456: Tips for the RL miniproject

Hello everyone,

We have been getting lots of questions about getting your REINFORCE agent to start learning to win games. Here are some tips which should also serve as answers to some frequently asked questions, and save you time in solving the miniproject:

Please use the update rule theta <- theta + alpha*G*grad(logprob(action)). You might notice that this differs from REINFORCE in leaving out the discount factor (but G is still the discounted return). This corresponds to a different objective function than the unmodified REINFORCE (details below).
Note that learning the task can take quite long, and you might have to wait upto 3000 episodes for your agent to score points consistently. However, if you have reward_shaping on, you should see the reward increasing already in the first 500 episodes or so (will still be noisy).
As already mentioned in the notebook, please try to normalize your returns with the standard deviation. This should reduce training times considerably.
Pay attention to the fact that you need to perform gradient ascent on the policy gradient, and that the cross-entropy loss already includes a negative sign on the log probabilities.

Re: the update rule
REINFORCE can be implemented with different objective functions. The extra gamma^t term you've seen in the lecture and the textbook arises due to the definition of the objective function that is maximized ( J(theta) ) as the expected return (value) of the starting state. An alternative objective function is the average return for all states, which would then not have this term. Incidentally, most of the state-of-the-art actor-critic algorithms like A3C use this second objective function (ex: algorithm S3 in https://arxiv.org/pdf/1602.01783.pdf).

I hope this helps you in your efforts to beat the computer at Pong!
Happy reinforcing!

Best regards,

The TA team