Dear Prof and TAs:
I wanna to ask about the second true or false question: why under the condition that δ is rt+1 we get the reinforce without baseline but not r(t+1) + γ V(st+1) since in the policy gradient the G = the cumulative reward with discount factor?
Thank you very much!