CS-456: Question about loss function for 2019 Exam question 6

In this question we are tasked with defining a loss function for a 2-step SARSA loss for a neural network.

I presume this means defining a loss like L = sum_{over episodes} sum_{over timesteps} 1/2 * (r_t + gamma * r_{t+1} + gamma^2 Q(s_{t+2}, a_{t+2}) - Q(s_t, a_t))^2

Should we introduce statistical weights here and then use a log-likelihood trick similar to the gradient ascent derivation of policy gradient?

If so, how would we go about this?

If that were the case, then we get issues with the policy being unknown.

Thank you in advance,

Lucas Crijns