Environment 2¶

Policy Gradient algorithms¶

Introduction¶

In this notebook, the goal of this environment is to implement 4 policy gradient algorithms: the regular policy gradient (REINFORCE), the policy gradient with a baseline subtracted (REINFORCE with baseline), the one-step actor-critic method, and the advantage actor-critic method. Our agent will play a game in continuous state space, for which we will use a basis function representation to limit ourselves to a finite amount of parameters.

Let's start with a few imports:

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
In [2]:
from environments.environment2 import *

import matplotlib
import matplotlib.cm
import matplotlib.colors
import numpy as np
import random
from tqdm import tqdm
from scipy.special import *

# reasonable plot size
plt.rcParams['figure.dpi'] = 100

The environment¶

The environment is a 2D grid, with a goal in the center, and a "U"-shaped barrier. Our goal is to learn how to reach the objective in the center regardless of where we start. In particular, we will study how the agent learns to go around the barrier when starting at the bottom of the grid.

  • The grid is 10x10, going from $(-5, -5)$ to $(5, 5)$.
  • The objective is a circle of radius $0.5$ centered at $(0, 0)$, that must be reached for the agent to win.
  • The barrier is a "U"-shaped wall of width $1$ centered at $(0, 0)$.
  • The agent is not allowed to exit the grid, or enter the wall.

Here is what the grid looks like: (the objective is in green, the walls are in red).

In [3]:
env = UMaze()
env.render()

Action and reward modeling¶

We assume that the state space is continuous, i.e. that the agent can be at any allowed position in the grid. At each position, we can move a fixed distance $0.2$ horizontally, vertically, or diagonally. Hence, there are $8$ possible actions:

  • Action $0$: move by $0.2 \cdot (0, 1)$.
  • Action $1$: move by $0.2 \cdot (\sqrt{2}/2, \sqrt{2}/2)$.
  • Action $2$: move by $0.2 \cdot (1, 0)$.
  • Action $3$: move by $0.2 \cdot (\sqrt{2}/2, -\sqrt{2}/2)$.
  • Action $4$: move by $0.2 \cdot (0, -1)$.
  • Action $5$: move by $0.2 \cdot (-\sqrt{2}/2, -\sqrt{2}/2)$.
  • Action $6$: move by $0.2 \cdot (-1, 0)$.
  • Action $7$: move by $0.2 \cdot (-\sqrt{2}/2, \sqrt{2}/2)$.

We grant a reward of $1$ when the agent reaches the objective. Actions that would cause the agent to hit a wall are ignored.

Note: the agent is not penalized when it makes a wrong move, otherwise it might become scared of walls and get stuck below the U.

Policy and value functions¶

To model our policy function for the continous state space, we first need to give each state and each action a value. For that, we will use a grid of $N$ gaussian basis functions with a fixed standard deviation $\sigma$. We will study how these two parameters influence the speed and quality of learning.

Let's define the action value function $q(\vec s, a)$ as follows: $$q(\vec s, a) = \sum_{i=1}^N w_{i,a} \exp \left(- \frac{||\vec s - \vec x_i||^2}{2\sigma^2}\right)$$ where $\vec x_i$ denotes the position of the $i$-th basis function. For our policy function $\pi$, we will use a softmax policy: $$\pi(\vec s, a) = \frac{\exp {q(\vec s, a)}}{\sum_{b=0}^7 \exp {q(\vec s, b)}}.$$ Our goal is to learn the $N \times 8$ weights $w_{i, a}$.

For some algorithms, we will also need to learn a state value function $v(\vec s)$. Once again, we will use basis functions: $$v(\vec s) = \sum_{i=1}^N \theta_{i} \exp \left(- \frac{||\vec s - \vec x_i||^2}{2\sigma^2}\right).$$ For the value function, there are $N$ parameters to learn: the weights $\theta_i$.

Policy gradient (REINFORCE)¶

The standard policy gradient only uses the policy function and the gradient of the log-policy $\nabla_{w} \ln \pi(\vec s, a)$. Here is the pseudocode:

Reinforce

Policy gradient with baseline subtracted (REINFORCE with baseline)¶

We can subtract a baseline in the policy gradient algorithm. As seen in the lecture, that should improve the speed of learning. We will use the value function as the baseline, and we will also need its gradient to update it.

ReinforceWithBaseline

One-step actor-critic¶

In the previous two algorithms, we can only perform the update at the end of the episode. The disadvantages are that we need to wait longer before we can learn, and we also use more memory to store all the past states, actions and rewards. With the one-step actor-critic algorithm, the updates are instead performed in real time.

OneStepActorCritic

Actor-critic with eligibility traces¶

We also look at the eligibility trace version of the actor-critic algorithm, also known as Advantage Actor-Critic.

EligibilityTraceActorCritic

Implementation¶

Your goal is to implement the algorithms explained above, and observe the following:

  • How the number of basis functions influences the speed of learning.
  • How the standard deviation of the basis functions influences the speed of learning.
  • How the choice of algorithm influences the speed of learning.

Here are a few important notes:

  • In each episode, we pick the starting position at random.
  • The discount factor $\gamma$ needs to be very close to $1$ due to the high number of steps per episode. $0.99$ is a good choice.
  • To avoid wasting too much time in the early episodes, we enforce a hard limit on the number of steps per episode. $300$ is a good limit.

Core model functions¶

Here are a few core functions that we will use in all 3 algorithms. The ModelParams class is used to store the discount factor $\gamma$, the count and shape of the basis functions, and the weights $w_{i,a}$ and $\theta_i$. Please complete the code as necessary.

In [4]:
class ModelParams:
    def __init__(self, npoints=13, gamma=0.99, stdev=0.5):
        self.action_weights = np.zeros((npoints * npoints, 8))
        self.value_weights = np.zeros(npoints * npoints)
        # need to be VERY careful here given the large number of iterations
        # if gamma is too small we'll have a hard time updating the policy correctly!
        self.gamma = gamma
        self.stdev = stdev

        self.npoints = npoints
        positions = np.linspace(-6, 6, npoints)  
        x_pos, y_pos = np.meshgrid(positions, positions)
        self.x_pos = x_pos.flatten()
        self.y_pos = y_pos.flatten()

        self.w_cache = None
        self.w_cache_pos = None

    def get_interpolation_weights(self, position: State) -> np.ndarray:
        """
        This returns an array of size N, weight for each i: exp(-||s-x_i||^2/(2\sigma^2))
        position: a list of two values: first for x pos; second for y pos
        """
        # We do a bit of caching to not compute this every time
        if self.w_cache_pos == position:
            return self.w_cache

        w = np.exp(- ((self.x_pos - position[0]) ** 2 + (self.y_pos - position[1]) ** 2) / (2 * self.stdev ** 2))
        self.w_cache = w
        self.w_cache_pos = position
        return w

    def compute_value(self, position: State) -> float:
        """
        This computes the v(s) function.
        """
        w = self.get_interpolation_weights(position)
        v = np.sum(w*self.value_weights)
        return v
    
    def get_action_values(self, position: State) -> np.ndarray:
        """
        This computes [q(s,a) for all a] for a specific s.
        """
        w = self.get_interpolation_weights(position)
        q =  np.sum(np.expand_dims(w,1)*self.action_weights, axis=0)
        return q
    
    def get_action_probabilities(self, position: State) -> np.ndarray:
        """
        This computes the [\pi(s, a) for all a].
        """
        q = self.get_action_values(position)
        pi = softmax(q)
        return pi
    
    def get_action_probability(self, position: State, action: Action) -> float:
        """
        This computes \pi(s, a).
        action: an int of [0,7]
        """
        return self.get_action_probabilities(position)[action]

    def pick_action(self, position: State) -> Action:
        """
        This randomly chooses an action following the policy.
        """
        probs = self.get_action_probabilities(position)
        # np.random.choice is a bit slow, so we code it by hand
        r = random.random()
        for i, p in enumerate(probs):
            r -= p
            if r < 0:
                return i

    def compute_action_log_grad(self, position: State, action: Action) -> np.ndarray:
        """
        This computes \nabla_w \ln(\pi(s,a))
        """
        pi = self.get_action_probabilities(position)
        w = self.get_interpolation_weights(position)
        grad = np.zeros_like(self.action_weights)
        grad = - np.outer(w, pi)
        grad[:,action] += w

        return grad
        

    def compute_value_grad(self, position: State) -> np.ndarray:
        """
        This computes \nabla_\theta v(s).
        """
        w = self.get_interpolation_weights(position)
        return w
        

def find_starting_pos(env: UMaze) -> State:
    """
    This picks a random starting position inside the bounds of the environment.
    """
    pos_candidate = None
    while pos_candidate is None:
        pos_candidate = (random.uniform(-5, 5), random.uniform(-5, 5))
        # the two below are easier, and can be used for debugging
#         pos_candidate = (0, 4)
#         pos_candidate = (random.uniform(-1.5, 1.5), random.uniform(-1.5, 1.5))
        if env.is_outside_of_area(*pos_candidate) or env.is_in_objective(*pos_candidate):
            pos_candidate = None
    return pos_candidate

Plots and helpers¶

Here are a few functions to provide nice visualization for the learned value and policy functions.

In [5]:
def render_values(params: ModelParams):
    """
    Visualises the v(s) function for all possible states.
    """
    xs = np.linspace(-6, 6, 101)
    ys = np.linspace(-6, 6, 101)
    X, Y = np.meshgrid(xs, ys, indexing="ij")
    values = np.zeros_like(X)
    for i in range(101):
        for j in range(101):
            values[i, j] = params.compute_value((xs[i], ys[j]))
    #plt.imshow(values, extent=[-6, 6, -6, 6])
    cs = plt.contourf(X, Y, values, levels=10, cmap="jet")
    plt.xlabel("x")
    plt.ylabel("y")
    cbar = plt.colorbar(cs)
    cbar.set_label("Value")
    plt.title("Computed values")

def render_actions(params: ModelParams):
    """
    This visualises the most probable action in each state using a colored arrow. 
    Color of the arrow indicates probability.
    Only shows a few discrete points.
    """
    colormap = matplotlib.cm.get_cmap("jet")

    xs = np.linspace(-5, 5, 21)
    ys = np.linspace(-5, 5, 21)
    for x in xs:
        for y in ys:
            action_probabilities = params.get_action_probabilities((x, y))
            action = np.argmax(action_probabilities)
            max_probability = action_probabilities[action]
            plt.plot(x, y, "o", color="black", markersize=0.1)
            plt.arrow(x, y, 0.3 * MOVES[action][0], 0.3 * MOVES[action][1], head_width=0.1, head_length=0.1, color=colormap(max_probability), linewidth=1)

    plt.xlabel("x")
    plt.ylabel("y")
    cbar = plt.colorbar(matplotlib.cm.ScalarMappable(norm=matplotlib.colors.Normalize(vmin=0, vmax=1), cmap=colormap))
    cbar.set_label("Probability")
    plt.title("Preferred actions")

Here is a function that you can use to run a learning algorithm, and display a nice visualization of the results.

In [6]:
from IPython.display import Markdown

def run_algorithm(env: UMaze, params: ModelParams, algorithm, plot_values, num_episodes: int, bucket_size: int = 500, **kwargs):
    """
    This function runs an algorithm for a specified number of episodes.
    Its arguments are:
    * env(UMaze): Instance of the environment
    * params(ModelParam): Parameters given to the model.
    * algorithm(function that accepts env and params as arguments and returns a Tuple[float, int]): Your algorithm.
    * plot_values(bool): If True, it will plot the values using one of the helper functions above.
    * num_episodes(int): Number of episodes to play.
    * bucket_size(int): Specifies how big groups should be when computing averages, i.e. a size of 500 averages a reward for diplay over 500 episodes.
    * Extra arguments specified like "action_rate=0.1" will be passed to the algorithm
    """
    assert num_episodes % bucket_size == 0, "num_episodes must be a multiple of bucket_size"

    num_buckets = num_episodes // bucket_size
    rewards = np.zeros(num_buckets)
    lengths = np.zeros(num_buckets)
    for i in tqdm(range(num_episodes)):
        random.seed(i) # set the seed to the episode id to ensure reproducibility
        episode_reward, episode_length = algorithm(env, params, **kwargs)
        rewards[i // bucket_size] += episode_reward / bucket_size
        lengths[i // bucket_size] += episode_length / bucket_size
    plot_points = (np.arange(num_buckets) + 1) * bucket_size

    display(Markdown("#### %s (sigma=%.2f, npoints=%d^2)" % (algorithm.__name__, params.stdev, params.npoints)))
    plt.subplot(221)
    plt.plot(plot_points, rewards)
    plt.title("Fraction of episodes where the goal is reached before %d iterations" % 300) # Max iterations is hardcoded to 300 by default
    plt.xlabel("Episode")
    plt.ylabel("Fraction")

    plt.subplot(222)
    plt.plot(plot_points, lengths)
    plt.title("Episode length over time")
    plt.xlabel("Episode")
    plt.ylabel("Episode length")

    plt.subplot(223)
    render_actions(params)

    if plot_values:
        plt.subplot(224)
        render_values(params)

    plt.subplots_adjust(wspace=0.2, hspace=0.3)
    #plt.tight_layout()
    plt.gcf().set_size_inches(12, 8)
    plt.show()

REINFORCE¶

First, implement the reinforce algorithm:

In [7]:
def reinforce(env: UMaze, params: ModelParams, action_rate = 5e-2, max_iter = 300) -> Tuple[float, int]:
    """
    Run an episode and mutate the params in-place.
    :return: tuple containing the (total) reward for the episode, and the number of steps
    """
    env.reset()
    # Pick random starting position
    env._state = find_starting_pos(env)
    states = [env._state]
    actions = []
    rewards = []
    iteration = 0
    while iteration < max_iter and not env.end:
        # Compute action
        action = params.pick_action(states[-1])
        actions.append(action)
        # Do action
        state, reward = env.do_action(action)
        states.append(state)
        rewards.append(reward)
        iteration += 1
    
    rewards = np.array(rewards)
    for t in range(len(actions)):
        G = np.sum(np.power(params.gamma, np.arange(t, len(actions))-t)*rewards[t:])
        params.action_weights += action_rate * params.gamma**t * G * params.compute_action_log_grad(states[t], actions[t])
    
    return sum(rewards), iteration

Now, you can perform a few experiments!

In [8]:
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=9), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:24<00:00, 26.03it/s]

reinforce (sigma=0.25, npoints=9^2)¶

In [9]:
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=9), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [04:31<00:00, 36.87it/s]

reinforce (sigma=0.50, npoints=9^2)¶

In [10]:
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=9), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:10<00:00, 76.75it/s]

reinforce (sigma=0.75, npoints=9^2)¶

In [11]:
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=9), reinforce, plot_values=False, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:34<00:00, 106.38it/s]

reinforce (sigma=1.00, npoints=9^2)¶

In [12]:
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=9), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:03<00:00, 81.14it/s]

reinforce (sigma=1.50, npoints=9^2)¶

In [13]:
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=13), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:26<00:00, 25.90it/s]

reinforce (sigma=0.25, npoints=13^2)¶

In [14]:
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=13), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:56<00:00, 56.74it/s]

reinforce (sigma=0.50, npoints=13^2)¶

In [15]:
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=13), reinforce, plot_values=False, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:37<00:00, 102.77it/s]

reinforce (sigma=0.75, npoints=13^2)¶

In [16]:
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=13), reinforce, plot_values=False, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:30<00:00, 110.86it/s]

reinforce (sigma=1.00, npoints=13^2)¶

In [17]:
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=13), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [03:50<00:00, 43.40it/s]

reinforce (sigma=1.50, npoints=13^2)¶

In [18]:
run_algorithm(UMaze(), ModelParams(stdev=0.15, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [07:25<00:00, 22.46it/s]

reinforce (sigma=0.15, npoints=19^2)¶

In [19]:
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:17<00:00, 26.51it/s]

reinforce (sigma=0.25, npoints=19^2)¶

In [20]:
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:07<00:00, 78.54it/s]

reinforce (sigma=0.50, npoints=19^2)¶

In [21]:
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:39<00:00, 100.46it/s]

reinforce (sigma=0.75, npoints=19^2)¶

In [22]:
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [01:44<00:00, 95.34it/s]

reinforce (sigma=1.00, npoints=19^2)¶

In [23]:
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=19), reinforce, plot_values=False, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:04<00:00, 32.80it/s]

reinforce (sigma=1.50, npoints=19^2)¶

REINFORCE with baseline¶

Now, copy/paste and modify your reinforce function above to subtract a basline.

In [26]:
def reinforce_with_baseline(env: UMaze, params: ModelParams, action_rate = 5e-2, value_rate = 5e-2, max_iter = 300) -> Tuple[float, int]:
    """
    Run an episode and mutate the params in-place.
    :return: tuple containing the (total) reward for the episode, and the number of steps
    """
    env.reset()
    # Pick random starting position
    env._state = find_starting_pos(env)
    states = [env._state]
    actions = []
    rewards = []
    iteration = 0
    while iteration < max_iter and not env.end:
        # Compute action
        action = params.pick_action(states[-1])
        actions.append(action)
        # Do action
        state, reward = env.do_action(action)
        states.append(state)
        rewards.append(reward)
        iteration += 1
    
    rewards = np.array(rewards)
    for t in range(len(actions)):
        G = np.sum(np.power(params.gamma, np.arange(t, len(actions))-t)*rewards[t:])
        delta = G - params.compute_value(states[t])
        params.value_weights += value_rate * delta * params.compute_value_grad(states[t])
        params.action_weights += action_rate * params.gamma**t * delta * params.compute_action_log_grad(states[t], actions[t])
    
    return sum(rewards), iteration
    

Now, you can perform a few experiments!

In [28]:
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=9), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:44<00:00, 28.99it/s]

reinforce_with_baseline (sigma=0.25, npoints=9^2)¶

In [29]:
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=9), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [04:20<00:00, 38.32it/s]

reinforce_with_baseline (sigma=0.50, npoints=9^2)¶

In [30]:
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=9), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:52<00:00, 58.12it/s]

reinforce_with_baseline (sigma=0.75, npoints=9^2)¶

In [31]:
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=9), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:23<00:00, 69.76it/s]

reinforce_with_baseline (sigma=1.00, npoints=9^2)¶

In [32]:
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=9), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:39<00:00, 62.72it/s]

reinforce_with_baseline (sigma=1.50, npoints=9^2)¶

In [33]:
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:55<00:00, 28.12it/s]

reinforce_with_baseline (sigma=0.25, npoints=13^2)¶

In [34]:
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [03:29<00:00, 47.78it/s]

reinforce_with_baseline (sigma=0.50, npoints=13^2)¶

In [35]:
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:30<00:00, 66.42it/s]

reinforce_with_baseline (sigma=0.75, npoints=13^2)¶

In [36]:
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:04<00:00, 80.56it/s]

reinforce_with_baseline (sigma=1.00, npoints=13^2)¶

In [37]:
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [03:12<00:00, 51.98it/s]

reinforce_with_baseline (sigma=1.50, npoints=13^2)¶

In [38]:
run_algorithm(UMaze(), ModelParams(stdev=0.15, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:39<00:00, 25.04it/s]

reinforce_with_baseline (sigma=0.15, npoints=19^2)¶

In [39]:
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:52<00:00, 28.36it/s]

reinforce_with_baseline (sigma=0.25, npoints=19^2)¶

In [40]:
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [03:12<00:00, 52.03it/s]

reinforce_with_baseline (sigma=0.50, npoints=19^2)¶

In [41]:
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:35<00:00, 64.22it/s]

reinforce_with_baseline (sigma=0.75, npoints=19^2)¶

In [42]:
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:13<00:00, 75.14it/s]

reinforce_with_baseline (sigma=1.00, npoints=19^2)¶

In [43]:
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=19), reinforce_with_baseline, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:10<00:00, 32.23it/s]

reinforce_with_baseline (sigma=1.50, npoints=19^2)¶

One-step actor-critic¶

Now, implement the one-step actor-critic algorithm.

In [44]:
def one_step_actor_critic(env: UMaze, params: ModelParams, action_rate = 5e-2, value_rate = 5e-2, max_iter = 300) -> Tuple[float, int]:
    """
    Run an episode and mutate the params in-place.
    :return: tuple containing the (total) reward for the episode, and the number of steps
    """
    env.reset()
    # Pick random starting position
    env._state = find_starting_pos(env)
    total_reward = 0
    iteration = 0
    I = 1
    while iteration < max_iter and not env.end:
        old_state = env._state
        # Compute action
        action = params.pick_action(old_state)
        # Do action
        state, reward = env.do_action(action)
        total_reward += reward
        # Update
        new_value = 0 if env.end else params.compute_value(state)
        old_value = params.compute_value(old_state)

        delta = reward + params.gamma * new_value - old_value
        params.value_weights += value_rate * delta * params.compute_value_grad(old_state)
        params.action_weights += action_rate * I * delta * params.compute_action_log_grad(old_state, action)
        
        I *= params.gamma
        iteration += 1

    return total_reward, iteration

Now, you can perform a few experiments!

In [45]:
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=9), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:42<00:00, 29.21it/s]

one_step_actor_critic (sigma=0.25, npoints=9^2)¶

In [46]:
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=9), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [04:41<00:00, 35.51it/s]

one_step_actor_critic (sigma=0.50, npoints=9^2)¶

In [47]:
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=9), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:04<00:00, 80.42it/s]

one_step_actor_critic (sigma=0.75, npoints=9^2)¶

In [48]:
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=9), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:25<00:00, 117.36it/s]

one_step_actor_critic (sigma=1.00, npoints=9^2)¶

In [49]:
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=9), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:30<00:00, 110.34it/s]

one_step_actor_critic (sigma=1.50, npoints=9^2)¶

In [50]:
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=13), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:03<00:00, 27.52it/s]

one_step_actor_critic (sigma=0.25, npoints=13^2)¶

In [51]:
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=13), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:48<00:00, 59.29it/s]

one_step_actor_critic (sigma=0.50, npoints=13^2)¶

In [52]:
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=13), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:25<00:00, 116.44it/s]

one_step_actor_critic (sigma=0.75, npoints=13^2)¶

In [53]:
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=13), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:08<00:00, 145.31it/s]

one_step_actor_critic (sigma=1.00, npoints=13^2)¶

In [54]:
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=13), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:16<00:00, 130.07it/s]

one_step_actor_critic (sigma=1.50, npoints=13^2)¶

In [55]:
run_algorithm(UMaze(), ModelParams(stdev=0.15, npoints=19), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:46<00:00, 24.58it/s]

one_step_actor_critic (sigma=0.15, npoints=19^2)¶

In [56]:
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=19), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:22<00:00, 26.14it/s]

one_step_actor_critic (sigma=0.25, npoints=19^2)¶

In [57]:
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=19), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [01:58<00:00, 84.47it/s]

one_step_actor_critic (sigma=0.50, npoints=19^2)¶

In [58]:
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=19), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:14<00:00, 134.24it/s]

one_step_actor_critic (sigma=0.75, npoints=19^2)¶

In [59]:
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=19), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|████████████████████████████████████| 10000/10000 [01:03<00:00, 156.79it/s]

one_step_actor_critic (sigma=1.00, npoints=19^2)¶

In [60]:
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=19), one_step_actor_critic, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [01:41<00:00, 98.09it/s]

one_step_actor_critic (sigma=1.50, npoints=19^2)¶

Actor-critic with eligibility traces¶

Finally, implement the actor-critic algorithm with elgibility traces. Copy/paste and modify your one_step_actor_critic function above to use eligibility traces.

In [61]:
def actor_critic_with_eligibility_traces(env: UMaze, params: ModelParams, action_rate = 5e-2, value_rate = 5e-2, max_iter = 300, trace_decay=0.99) -> Tuple[float, int]:
    """
    Run an episode and mutate the params in-place.
    :return: tuple containing the (total) reward for the episode, and the number of steps
    """
    env.reset()
    # Pick random starting position
    env._state = find_starting_pos(env)
    total_reward = 0
    iteration = 0
    I = 1
    zw = np.zeros_like(params.value_weights)
    ztheta = np.zeros_like(params.action_weights)
    while iteration < max_iter and not env.end:
        old_state = env._state
        # Compute action
        action = params.pick_action(old_state)
        # Do action
        state, reward = env.do_action(action)
        total_reward += reward
        # Update
        new_value = 0 if env.end else params.compute_value(state)
        old_value = params.compute_value(old_state)
        delta = reward + params.gamma * new_value - old_value
        zw = params.gamma * trace_decay * zw + params.compute_value_grad(old_state)
        ztheta = params.gamma * trace_decay * ztheta + I * params.compute_action_log_grad(old_state, action)
        params.value_weights += value_rate * delta * zw
        params.action_weights += action_rate * delta * ztheta

        I *= params.gamma
        iteration += 1

    return total_reward, iteration  

Now, you can perform a few experiments!

In [62]:
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=9), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:02<00:00, 27.60it/s]

actor_critic_with_eligibility_traces (sigma=0.25, npoints=9^2)¶

In [63]:
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=9), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [04:14<00:00, 39.26it/s]

actor_critic_with_eligibility_traces (sigma=0.50, npoints=9^2)¶

In [64]:
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=9), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:25<00:00, 68.72it/s]

actor_critic_with_eligibility_traces (sigma=0.75, npoints=9^2)¶

In [65]:
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=9), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [01:55<00:00, 86.45it/s]

actor_critic_with_eligibility_traces (sigma=1.00, npoints=9^2)¶

In [66]:
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=9), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:01<00:00, 27.65it/s]

actor_critic_with_eligibility_traces (sigma=1.50, npoints=9^2)¶

In [67]:
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=13), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:14<00:00, 26.74it/s]

actor_critic_with_eligibility_traces (sigma=0.25, npoints=13^2)¶

In [68]:
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=13), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [03:10<00:00, 52.60it/s]

actor_critic_with_eligibility_traces (sigma=0.50, npoints=13^2)¶

In [69]:
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=13), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:00<00:00, 82.95it/s]

actor_critic_with_eligibility_traces (sigma=0.75, npoints=13^2)¶

In [70]:
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=13), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [05:53<00:00, 28.30it/s]

actor_critic_with_eligibility_traces (sigma=1.00, npoints=13^2)¶

In [71]:
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=13), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:22<00:00, 26.17it/s]

actor_critic_with_eligibility_traces (sigma=1.50, npoints=13^2)¶

In [72]:
run_algorithm(UMaze(), ModelParams(stdev=0.15, npoints=19), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [07:00<00:00, 23.80it/s]

actor_critic_with_eligibility_traces (sigma=0.15, npoints=19^2)¶

In [73]:
run_algorithm(UMaze(), ModelParams(stdev=0.25, npoints=19), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:00<00:00, 27.77it/s]

actor_critic_with_eligibility_traces (sigma=0.25, npoints=19^2)¶

In [74]:
run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=19), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [02:37<00:00, 63.51it/s]

actor_critic_with_eligibility_traces (sigma=0.50, npoints=19^2)¶

In [75]:
run_algorithm(UMaze(), ModelParams(stdev=0.75, npoints=19), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [07:11<00:00, 23.19it/s]

actor_critic_with_eligibility_traces (sigma=0.75, npoints=19^2)¶

In [76]:
run_algorithm(UMaze(), ModelParams(stdev=1.00, npoints=19), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:27<00:00, 25.81it/s]

actor_critic_with_eligibility_traces (sigma=1.00, npoints=19^2)¶

In [77]:
run_algorithm(UMaze(), ModelParams(stdev=1.50, npoints=19), actor_critic_with_eligibility_traces, plot_values=True, num_episodes=10000)
100%|█████████████████████████████████████| 10000/10000 [06:10<00:00, 27.02it/s]

actor_critic_with_eligibility_traces (sigma=1.50, npoints=19^2)¶

Questions¶

Make sure you try running with various npoints and stdev values. Good values for npoints are $9$, $13$ and $19$.

Compare the various algorithms¶

So far, we left the learning rates at their default value of 5e-2. How do REINFORCE and REINFORCE with baseline compare?¶

ANSWER: REINFORCE seems to learn faster, but produces "less pretty" action value plots. REINFORCE is faster as it does not need to update the values, and learns faster hence its episodes tend to get shorter earlier.

Algorithms with a value function (e.g., REINFORCE with baseline) are expected to be more stable than REINFORCE, and should allow higher learning rate for the action. Experiment with the learning rates of action and see what happens.

You can adjust the learning rates by passing extra arguments to run_algorithm, for example:

run_algorithm(UMaze(), ModelParams(stdev=0.50, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000, action_rate=<insert action rate>)

What happens when you increase the learning rate of action?¶

Can you explain this?

In [27]:
# TO COMPLETE: perform some experiments
run_algorithm(UMaze(), ModelParams(stdev=0.5, npoints=13), reinforce, plot_values=False, num_episodes=10000, action_rate=0.5)
run_algorithm(UMaze(), ModelParams(stdev=0.5, npoints=13), reinforce_with_baseline, plot_values=True, num_episodes=10000, action_rate=0.5)
100%|█████████████████████████████████████| 10000/10000 [02:05<00:00, 79.43it/s]

reinforce (sigma=0.50, npoints=13^2)¶

100%|█████████████████████████████████████| 10000/10000 [05:35<00:00, 29.79it/s]

reinforce_with_baseline (sigma=0.50, npoints=13^2)¶

Answer: when increase the learning rate of action to 0.5, we find REINFORCE with baseline is learning fast (faster than REINFORCE) and stablely. In general, when one tunes the learning rate of action separately for REINFORCE and REINFORCE with baseline, REINFORCE with baseline should learn faster.

How do the two actor-critic algorithms compare?¶

ANSWER: When stdev is small, eligibility trace speeds learning; when stdev is not small, eligibility trace doesn't make much differences

How does REINFORCE with baseline compare to the actor-critic algorithms?¶

ANSWER: Actor-critic seems to learn faster than REINFORCE with baseline.

For each value of npoints: try running with various standard deviations.¶

What happens when stdev is too small?¶

Can you explain this?

ANSWER: The algorithm takes longer to learn. Because when stdev is small, weights for each grid need to be learn 'independently'. In the case when stdev is extremely small: the points outside of the support of the basis functions always have an action or state value of 0. We have poor representation hence poor learning.

What happens when stdev is too big?¶

Can you explain this?

ANSWER: We are unable to capture local variations because each basis function affects too many points. The learning becomes unstable.

Does the optimal value of stdev depend on npoints?¶

Can you explain this?

ANSWER: Yes, as npoints increases, the spacing between points decreases, hence the optimal stdev decreases too.