In this notebook, the goal of this environment is to implement 4 policy gradient algorithms: the regular policy gradient (REINFORCE), the policy gradient with a baseline subtracted (REINFORCE with baseline), the one-step actor-critic method, and the advantage actor-critic method. Our agent will play a game in continuous state space, for which we will use a basis function representation to limit ourselves to a finite amount of parameters.
Let's start with a few imports:
%load_ext autoreload
%autoreload 2
%matplotlib inline
from environments.environment2 import *
import matplotlib
import matplotlib.cm
import matplotlib.colors
import numpy as np
import random
from tqdm import tqdm
from scipy.special import *
# reasonable plot size
plt.rcParams['figure.dpi'] = 100
The environment is a 2D grid, with a goal in the center, and a "U"-shaped barrier. Our goal is to learn how to reach the objective in the center regardless of where we start. In particular, we will study how the agent learns to go around the barrier when starting at the bottom of the grid.
Here is what the grid looks like: (the objective is in green, the walls are in red).
env = UMaze()
env.render()
We assume that the state space is continuous, i.e. that the agent can be at any allowed position in the grid. At each position, we can move a fixed distance $0.2$ horizontally, vertically, or diagonally. Hence, there are $8$ possible actions:
We grant a reward of $1$ when the agent reaches the objective. Actions that would cause the agent to hit a wall are ignored.
Note: the agent is not penalized when it makes a wrong move, otherwise it might become scared of walls and get stuck below the U.
To model our policy function for the continous state space, we first need to give each state and each action a value. For that, we will use a grid of $N$ gaussian basis functions with a fixed standard deviation $\sigma$. We will study how these two parameters influence the speed and quality of learning.
Let's define the action value function $q(\vec s, a)$ as follows: $$q(\vec s, a) = \sum_{i=1}^N w_{i,a} \exp \left(- \frac{||\vec s - \vec x_i||^2}{2\sigma^2}\right)$$ where $\vec x_i$ denotes the position of the $i$-th basis function. For our policy function $\pi$, we will use a softmax policy: $$\pi(\vec s, a) = \frac{\exp {q(\vec s, a)}}{\sum_{b=0}^7 \exp {q(\vec s, b)}}.$$ Our goal is to learn the $N \times 8$ weights $w_{i, a}$.
For some algorithms, we will also need to learn a state value function $v(\vec s)$. Once again, we will use basis functions: $$v(\vec s) = \sum_{i=1}^N \theta_{i} \exp \left(- \frac{||\vec s - \vec x_i||^2}{2\sigma^2}\right).$$ For the value function, there are $N$ parameters to learn: the weights $\theta_i$.
The standard policy gradient only uses the policy function and the gradient of the log-policy $\nabla_{w} \ln \pi(\vec s, a)$. Here is the pseudocode:
We can subtract a baseline in the policy gradient algorithm. As seen in the lecture, that should improve the speed of learning. We will use the value function as the baseline, and we will also need its gradient to update it.
In the previous two algorithms, we can only perform the update at the end of the episode. The disadvantages are that we need to wait longer before we can learn, and we also use more memory to store all the past states, actions and rewards. With the one-step actor-critic algorithm, the updates are instead performed in real time.
We also look at the eligibility trace version of the actor-critic algorithm, also known as Advantage Actor-Critic.
Your goal is to implement the algorithms explained above, and observe the following:
Here are a few important notes:
Here are a few core functions that we will use in all 3 algorithms. The ModelParams
class is used to store the discount factor $\gamma$, the count and shape of the basis functions, and the weights $w_{i,a}$ and $\theta_i$. Please complete the code as necessary.
class ModelParams:
def __init__(self, npoints=13, gamma=0.99, stdev=0.5):
self.action_weights = np.zeros((npoints * npoints, 8))
self.value_weights = np.zeros(npoints * npoints)
# need to be VERY careful here given the large number of iterations
# if gamma is too small we'll have a hard time updating the policy correctly!
self.gamma = gamma
self.stdev = stdev
self.npoints = npoints
positions = np.linspace(-6, 6, npoints)
x_pos, y_pos = np.meshgrid(positions, positions)
self.x_pos = x_pos.flatten()
self.y_pos = y_pos.flatten()
self.w_cache = None
self.w_cache_pos = None
def get_interpolation_weights(self, position: State) -> np.ndarray:
"""
This returns an array of size N, weight for each i: exp(-||s-x_i||^2/(2\sigma^2))
position: a list of two values: first for x pos; second for y pos
"""
# We do a bit of caching to not compute this every time
if self.w_cache_pos == position:
return self.w_cache
w = np.exp(- ((self.x_pos - position[0]) ** 2 + (self.y_pos - position[1]) ** 2) / (2 * self.stdev ** 2))
self.w_cache = w
self.w_cache_pos = position
return w
def compute_value(self, position: State) -> float:
"""
This computes the v(s) function.
"""
w = self.get_interpolation_weights(position)
v = np.sum(w*self.value_weights)
return v
def get_action_values(self, position: State) -> np.ndarray:
"""
This computes [q(s,a) for all a] for a specific s.
"""
w = self.get_interpolation_weights(position)
q = np.sum(np.expand_dims(w,1)*self.action_weights, axis=0)
return q
def get_action_probabilities(self, position: State) -> np.ndarray:
"""
This computes the [\pi(s, a) for all a].
"""
q = self.get_action_values(position)
pi = softmax(q)
return pi
def get_action_probability(self, position: State, action: Action) -> float:
"""
This computes \pi(s, a).
action: an int of [0,7]
"""
return self.get_action_probabilities(position)[action]
def pick_action(self, position: State) -> Action:
"""
This randomly chooses an action following the policy.
"""
probs = self.get_action_probabilities(position)
# np.random.choice is a bit slow, so we code it by hand
r = random.random()
for i, p in enumerate(probs):
r -= p
if r < 0:
return i
def compute_action_log_grad(self, position: State, action: Action) -> np.ndarray:
"""
This computes \nabla_w \ln(\pi(s,a))
"""
pi = self.get_action_probabilities(position)
w = self.get_interpolation_weights(position)
grad = np.zeros_like(self.action_weights)
grad = - np.outer(w, pi)
grad[:,action] += w
return grad
def compute_value_grad(self, position: State) -> np.ndarray:
"""
This computes \nabla_\theta v(s).
"""
w = self.get_interpolation_weights(position)
return w
def find_starting_pos(env: UMaze) -> State:
"""
This picks a random starting position inside the bounds of the environment.
"""
pos_candidate = None
while pos_candidate is None:
pos_candidate = (random.uniform(-5, 5), random.uniform(-5, 5))
# the two below are easier, and can be used for debugging
# pos_candidate = (0, 4)
# pos_candidate = (random.uniform(-1.5, 1.5), random.uniform(-1.5, 1.5))
if env.is_outside_of_area(*pos_candidate) or env.is_in_objective(*pos_candidate):
pos_candidate = None
return pos_candidate
Here are a few functions to provide nice visualization for the learned value and policy functions.
def render_values(params: ModelParams):
"""
Visualises the v(s) function for all possible states.
"""
xs = np.linspace(-6, 6, 101)
ys = np.linspace(-6, 6, 101)
X, Y = np.meshgrid(xs, ys, indexing="ij")
values = np.zeros_like(X)
for i in range(101):
for j in range(101):
values[i, j] = params.compute_value((xs[i], ys[j]))
#plt.imshow(values, extent=[-6, 6, -6, 6])
cs = plt.contourf(X, Y, values, levels=10, cmap="jet")
plt.xlabel("x")
plt.ylabel("y")
cbar = plt.colorbar(cs)
cbar.set_label("Value")
plt.title("Computed values")
def render_actions(params: ModelParams):
"""
This visualises the most probable action in each state using a colored arrow.
Color of the arrow indicates probability.
Only shows a few discrete points.
"""
colormap = matplotlib.cm.get_cmap("jet")
xs = np.linspace(-5, 5, 21)
ys = np.linspace(-5, 5, 21)
for x in xs:
for y in ys:
action_probabilities = params.get_action_probabilities((x, y))
action = np.argmax(action_probabilities)
max_probability = action_probabilities[action]
plt.plot(x, y, "o", color="black", markersize=0.1)
plt.arrow(x, y, 0.3 * MOVES[action][0], 0.3 * MOVES[action][1], head_width=0.1, head_length=0.1, color=colormap(max_probability), linewidth=1)
plt.xlabel("x")
plt.ylabel("y")
cbar = plt.colorbar(matplotlib.cm.ScalarMappable(norm=matplotlib.colors.Normalize(vmin=0, vmax=1), cmap=colormap))
cbar.set_label("Probability")
plt.title("Preferred actions")
Here is a function that you can use to run a learning algorithm, and display a nice visualization of the results.
from IPython.display import Markdown
def run_algorithm(env: UMaze, params: ModelParams, algorithm, plot_values, num_episodes: int, bucket_size: int = 500, **kwargs):
"""
This function runs an algorithm for a specified number of episodes.
Its arguments are:
* env(UMaze): Instance of the environment
* params(ModelParam): Parameters given to the model.
* algorithm(function that accepts env and params as arguments and returns a Tuple[float, int]): Your algorithm.
* plot_values(bool): If True, it will plot the values using one of the helper functions above.
* num_episodes(int): Number of episodes to play.
* bucket_size(int): Specifies how big groups should be when computing averages, i.e. a size of 500 averages a reward for diplay over 500 episodes.
* Extra arguments specified like "action_rate=0.1" will be passed to the algorithm
"""
assert num_episodes % bucket_size == 0, "num_episodes must be a multiple of bucket_size"
num_buckets = num_episodes // bucket_size
rewards = np.zeros(num_buckets)
lengths = np.zeros(num_buckets)
for i in tqdm(range(num_episodes)):
random.seed(i) # set the seed to the episode id to ensure reproducibility
episode_reward, episode_length = algorithm(env, params, **kwargs)
rewards[i // bucket_size] += episode_reward / bucket_size
lengths[i // bucket_size] += episode_length / bucket_size
plot_points = (np.arange(num_buckets) + 1) * bucket_size
display(Markdown("#### %s (sigma=%.2f, npoints=%d^2)" % (algorithm.__name__, params.stdev, params.npoints)))
plt.subplot(221)
plt.plot(plot_points, rewards)
plt.title("Fraction of episodes where the goal is reached before %d iterations" % 300) # Max iterations is hardcoded to 300 by default
plt.xlabel("Episode")
plt.ylabel("Fraction")
plt.subplot(222)
plt.plot(plot_points, lengths)
plt.title("Episode length over time")
plt.xlabel("Episode")
plt.ylabel("Episode length")
plt.subplot(223)
render_actions(params)
if plot_values:
plt.subplot(224)
render_values(params)
plt.subplots_adjust(wspace=0.2, hspace=0.3)
#plt.tight_layout()
plt.gcf().set_size_inches(12, 8)
plt.show()
First, implement the reinforce algorithm:
def reinforce(env: UMaze, params: ModelParams, action_rate = 5e-2, max_iter = 300) -> Tuple[float, int]:
"""
Run an episode and mutate the params in-place.
:return: tuple containing the (total) reward for the episode, and the number of steps
"""
env.reset()
# Pick random starting position
env._state = find_starting_pos(env)
states = [env._state]
actions = []
rewards = []
iteration = 0
while iteration < max_iter and not env.end:
# Compute action
action = params.pick_action(states[-1])
actions.append(action)
# Do action
state, reward = env.do_action(action)
states.append(state)
rewards.append(reward)
iteration += 1
rewards = np.array(rewards)
for t in range(len(actions)):
G = np.sum(np.power(params.gamma, np.arange(t, len(actions))-t)*rewards[t:])
params.action_weights += action_rate * params.gamma**t * G * params.compute_action_log_grad(states[t], actions[t])
return sum(rewards), iteration
Now, you can perform a few experiments!
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=9), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:24<00:00, 26.03it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=9), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [04:31<00:00, 36.87it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=9), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:10<00:00, 76.75it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=9), reinforce, plot_values=False, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:34<00:00, 106.38it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=9), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:03<00:00, 81.14it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=13), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:26<00:00, 25.90it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=13), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:56<00:00, 56.74it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=13), reinforce, plot_values=False, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:37<00:00, 102.77it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=13), reinforce, plot_values=False, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:30<00:00, 110.86it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=13), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [03:50<00:00, 43.40it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.15, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [07:25<00:00, 22.46it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:17<00:00, 26.51it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:07<00:00, 78.54it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:39<00:00, 100.46it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [01:44<00:00, 95.34it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:04<00:00, 32.80it/s]
Now, copy/paste and modify your reinforce
function above to subtract a basline.
def reinforce_with_baseline(env: UMaze, params: ModelParams, action_rate = 5e-2, value_rate = 5e-2, max_iter = 300) -> Tuple[float, int]:
"""
Run an episode and mutate the params in-place.
:return: tuple containing the (total) reward for the episode, and the number of steps
"""
env.reset()
# Pick random starting position
env._state = find_starting_pos(env)
states = [env._state]
actions = []
rewards = []
iteration = 0
while iteration < max_iter and not env.end:
# Compute action
action = params.pick_action(states[-1])
actions.append(action)
# Do action
state, reward = env.do_action(action)
states.append(state)
rewards.append(reward)
iteration += 1
rewards = np.array(rewards)
for t in range(len(actions)):
G = np.sum(np.power(params.gamma, np.arange(t, len(actions))-t)*rewards[t:])
delta = G - params.compute_value(states[t])
params.value_weights += value_rate * delta * params.compute_value_grad(states[t])
params.action_weights += action_rate * params.gamma**t * delta * params.compute_action_log_grad(states[t], actions[t])
return sum(rewards), iteration
Now, you can perform a few experiments!
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=9), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:44<00:00, 28.99it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=9), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [04:20<00:00, 38.32it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=9), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:52<00:00, 58.12it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=9), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:23<00:00, 69.76it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=9), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:39<00:00, 62.72it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:55<00:00, 28.12it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [03:29<00:00, 47.78it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:30<00:00, 66.42it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:04<00:00, 80.56it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [03:12<00:00, 51.98it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.15, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:39<00:00, 25.04it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:52<00:00, 28.36it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [03:12<00:00, 52.03it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:35<00:00, 64.22it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:13<00:00, 75.14it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:10<00:00, 32.23it/s]
Now, implement the one-step actor-critic algorithm.
def one_step_actor_critic(env: UMaze, params: ModelParams, action_rate = 5e-2, value_rate = 5e-2, max_iter = 300) -> Tuple[float, int]:
"""
Run an episode and mutate the params in-place.
:return: tuple containing the (total) reward for the episode, and the number of steps
"""
env.reset()
# Pick random starting position
env._state = find_starting_pos(env)
total_reward = 0
iteration = 0
I = 1
while iteration < max_iter and not env.end:
old_state = env._state
# Compute action
action = params.pick_action(old_state)
# Do action
state, reward = env.do_action(action)
total_reward += reward
# Update
new_value = 0 if env.end else params.compute_value(state)
old_value = params.compute_value(old_state)
delta = reward + params.gamma * new_value - old_value
params.value_weights += value_rate * delta * params.compute_value_grad(old_state)
params.action_weights += action_rate * I * delta * params.compute_action_log_grad(old_state, action)
I *= params.gamma
iteration += 1
return total_reward, iteration
Now, you can perform a few experiments!
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=9), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:42<00:00, 29.21it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=9), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [04:41<00:00, 35.51it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=9), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:04<00:00, 80.42it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=9), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:25<00:00, 117.36it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=9), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:30<00:00, 110.34it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=13), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:03<00:00, 27.52it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=13), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:48<00:00, 59.29it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=13), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:25<00:00, 116.44it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=13), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:08<00:00, 145.31it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=13), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:16<00:00, 130.07it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.15, npoints=19), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:46<00:00, 24.58it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=19), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:22<00:00, 26.14it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=19), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [01:58<00:00, 84.47it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=19), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:14<00:00, 134.24it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=19), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:03<00:00, 156.79it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=19), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [01:41<00:00, 98.09it/s]
Finally, implement the actor-critic algorithm with elgibility traces. Copy/paste and modify your one_step_actor_critic
function above to use eligibility traces.
def actor_critic_with_eligibility_traces(env: UMaze, params: ModelParams, action_rate = 5e-2, value_rate = 5e-2, max_iter = 300, trace_decay=0.99) -> Tuple[float, int]:
"""
Run an episode and mutate the params in-place.
:return: tuple containing the (total) reward for the episode, and the number of steps
"""
env.reset()
# Pick random starting position
env._state = find_starting_pos(env)
total_reward = 0
iteration = 0
I = 1
zw = np.zeros_like(params.value_weights)
ztheta = np.zeros_like(params.action_weights)
while iteration < max_iter and not env.end:
old_state = env._state
# Compute action
action = params.pick_action(old_state)
# Do action
state, reward = env.do_action(action)
total_reward += reward
# Update
new_value = 0 if env.end else params.compute_value(state)
old_value = params.compute_value(old_state)
delta = reward + params.gamma * new_value - old_value
zw = params.gamma * trace_decay * zw + params.compute_value_grad(old_state)
ztheta = params.gamma * trace_decay * ztheta + I * params.compute_action_log_grad(old_state, action)
params.value_weights += value_rate * delta * zw
params.action_weights += action_rate * delta * ztheta
I *= params.gamma
iteration += 1
return total_reward, iteration
Now, you can perform a few experiments!
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=9), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:02<00:00, 27.60it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=9), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [04:14<00:00, 39.26it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=9), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:25<00:00, 68.72it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=9), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [01:55<00:00, 86.45it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=9), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:01<00:00, 27.65it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=13), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:14<00:00, 26.74it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=13), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [03:10<00:00, 52.60it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=13), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:00<00:00, 82.95it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=13), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:53<00:00, 28.30it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=13), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:22<00:00, 26.17it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.15, npoints=19), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [07:00<00:00, 23.80it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=19), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:00<00:00, 27.77it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=19), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:37<00:00, 63.51it/s]
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=19), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [07:11<00:00, 23.19it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=19), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:27<00:00, 25.81it/s]
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=19), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:10<00:00, 27.02it/s]
Make sure you try running with various npoints
and stdev
values. Good values for npoints
are $9$, $13$ and $19$.
ANSWER: REINFORCE seems to learn faster, but produces "less pretty" action value plots. REINFORCE is faster as it does not need to update the values, and learns faster hence its episodes tend to get shorter earlier.
Algorithms with a value function (e.g., REINFORCE with baseline) are expected to be more stable than REINFORCE, and should allow higher learning rate for the action. Experiment with the learning rates of action and see what happens.
You can adjust the learning rates by passing extra arguments to run_algorithm
, for example:
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000, action_rate=<insert action rate>)
Can you explain this?
# TO COMPLETE: perform some experiments
run_algorithm(UMaze(), ModelParams(stdev=0.5, npoints=13), reinforce, plot_values=False, num_episodes=10000, action_rate=0.5)
run_algorithm(UMaze(), ModelParams(stdev=0.5, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000, action_rate=0.5)
Answer: when increase the learning rate of action to 0.5, we find REINFORCE with baseline is learning fast (faster than REINFORCE) and stablely. In general, when one tunes the learning rate of action separately for REINFORCE and REINFORCE with baseline, REINFORCE with baseline should learn faster.
ANSWER: When stdev
is small, eligibility trace speeds learning; when stdev
is not small, eligibility trace doesn't make much differences
ANSWER: Actor-critic seems to learn faster than REINFORCE with baseline.
npoints
: try running with various standard deviations.¶stdev
is too small?¶Can you explain this?
ANSWER: The algorithm takes longer to learn. Because when stdev is small, weights for each grid need to be learn 'independently'. In the case when stdev
is extremely small: the points outside of the support of the basis functions always have an action or state value of 0. We have poor representation hence poor learning.
stdev
is too big?¶Can you explain this?
ANSWER: We are unable to capture local variations because each basis function affects too many points. The learning becomes unstable.
stdev
depend on npoints
?¶Can you explain this?
ANSWER: Yes, as npoints
increases, the spacing between points decreases, hence the optimal stdev
decreases too.