Re: MP1

by Anja Surina -
Number of replies: 0
Solving the problem in this context means obtaining a good policy that consistently reaches the final state before 200 steps. It would not achieve the same reward in all episodes because the agent is stochasticly placed in the environment. But there is a pattern emerging that we later ask you about in the document.