CS-456: A clarification for the mini-projects

Hello everyone,

Following up on a couple of questions in today's exercise session, we would like to add a clarification.

You were told that "For DQN, we do not constrain actions to only available actions. However, whenever the agent takes an unavailable action, we end the game and give the agent a negative reward of value r_{unav} = −1.". We would like to clarify that, when applying the epsilon-greedy policy for action selection, this statement is true only when the unavailable action is taken for exploitation -- i.e., when arg max_a Q(s,a) (taken with probability 1 - epsilon) is unavailable. Hence, the exploratory actions (taken with probability epsilon) should be sampled uniformly from available actions.

We hope this clarification helps.

Best,
TAs