Hello,
Would it be possible to have some more explanation on this exercise.
Specifically:
1. Why is the update made with Q(s6,a2) in question a.
2. Why does the first update in the third trial happens for Q(s,a) ? And why is the update made with Q(s4,a1) ? in question b.
Thank you,
Virginie
1) non zero reward only occurs after state s5 so it gets the first update. In s6 the only possible action is up which is a2, so the next Q value after state s5 is Q(s6,a2).
2) this should be s3 instead of s (just like in the second trial we had s4).
Thank you !