Since this is a policy gradient method, we are only modeling the policy (so theta are the policy parameters only) and not the return or value function. R(H) is the reward collected from trajectory H which is given by the environment and does not depend on the policy parameters, so it is considered a constant with respect to theta for a given trajectory H.