(Sorry, I did not know where to post this)
I kind of struggle seeing how we are allowed go from to . If R(H) is independent of theta, I can see where this is coming from, as log( f(x) )' = ( 1/ f(x) ) *( f'(x)), but by the core-rule, grad(R)*log(p) =/= grad(R)
Is R independent of Theta, or do we simply have another way of modeling the future rewards ?
( I assume applying them later with a discount)