{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CJzeB-IUSMIZ"
   },
   "source": [
    "# Miniproject: Landing on the Moon\n",
    "\n",
    "## Introduction\n",
    "\n",
    "### Description\n",
    "\n",
    "Traditionally, reinforcement learning has operated on \"tabular\" state spaces, e.g. \"State 1\", \"State 2\", \"State 3\" etc. However, many important and interesting reinforcement learning problems (like moving robot arms or playing Atari games) are based on either continuous or very high-dimensional state spaces (like robot joint angles or pixels). Deep neural networks constitute one method for learning a value function or policy from continuous and high-dimensional observations. \n",
    "\n",
    "In this miniproject, you will teach an agent to play the Lunar Lander game from [OpenAI Gym](https://gym.openai.com/envs/LunarLander-v2/). The agent needs to learn how to land a lunar module safely on the surface of the moon. The state space is 8-dimensional and (mostly) continuous, consisting of the X and Y coordinates, the X and Y velocity, the angle, and the angular velocity of the lander, and two booleans indicating whether the left and right leg of the lander have landed on the moon.\n",
    "\n",
    "The agent gets a reward of +100 for landing safely and -100 for crashing. In addition, it receives \"shaping\" rewards at every step. It receives positive rewards for moving closer to [0,0], decreasing in velocity, shifting to an upright angle and touching the lander legs on the moon. It receives negative rewards for moving away from the landing site, increasing in velocity, turning sideways, taking the lander legs off the moon and for using fuel (firing the thrusters). The best score an agent can achieve in an episode is about +250.\n",
    "\n",
    "There are two versions of the task: one with discrete controls and one with continuous controls but we'll only work with the discrete version. In the discrete version, the agent can take one of four actions at each time step: [do nothing, fire engines left, fire engines right, fire engines down]. \n",
    "\n",
    "We will use Policy Gradient approaches (using the REINFORCE rule) to learn the task. As you remember, in standard supervised learning tasks (e.g. image classification), the network generates a probability distribution over the outputs, and is trained to maximize the probability of a specific target output given an observation (input). In Policy Gradient methods, the network generates a probability distribution over actions, and is trained to maximize expected future rewards given an observation.\n",
    "\n",
    "### Questions\n",
    "**Question 1**. Suppose that you are designing the environment rewards yourself. Why do you think it is a good idea to have rewards in addition to the +100 reward for safe landing (e.g. for moving closer / further from [0, 0], for touching the lander legs on the moon)? One might say that if we only have a final reward, the agent will still be able to learn how to reach it. What will be the problem here?\n",
    "\n",
    "**Answer**:\n",
    "\n",
    "\n",
    "**Question 2**. Now suppose you decide to give the agent a small reward if it moves closer to the landing point but you forget to penalize it when it moves away from it. What kinds of strange behaviour you may observe from the trained agent?\n",
    "\n",
    "**Answer**:\n",
    "\n",
    "\n",
    "### Prerequisites\n",
    "\n",
    "- You will need to install [Miniconda](https://docs.conda.io/en/latest/miniconda.html) and run the following commands:\n",
    "\n",
    "  - conda create -n cs456env python=3.7.2\n",
    "  - source activate cs456env\n",
    "  - pip install gym[Box2D]\n",
    "  - pip install ipykernel\n",
    "  - pip install tensorflow\n",
    "  - pip install keras\n",
    "  - pip install matplotlib\n",
    "  \n",
    "- You should know the concepts of \"policy\", \"policy gradient\", \"REINFORCE\", \"REINFORCE with baseline\". If you want to start and haven't seen these yet in class, read Sutton & Barto (2018) Chapter 13 (13.1-13.4).\n",
    "\n",
    "### What you will learn\n",
    "\n",
    "- You will learn how to implement a policy gradient neural network using the REINFORCE algorithm.\n",
    "- You will learn how to implement baselines, including a learned value network.\n",
    "- You will learn how to analyze the performance of an RL algorithm.\n",
    "\n",
    "### Notes \n",
    "- Reinforcement learning is noisy! Normally one should average over multiple random seeds with the same parameters to really see the impact of a change to the model, but we won't do this due to time constraints. However, you should be able to see learning over time with every approach. If you don't see any improvement, or very unstable learning, double-check your model and try adjusting the learning rate.\n",
    "\n",
    "- You may sometimes see `AssertionError: IsLocked() = False` after restarting your code. To fix this, reinitialize the environments by running the Gym Setup code below.\n",
    "\n",
    "- You will not be marked on the episode movies. Please delete these movies before uploading your code.\n",
    "\n",
    "### Evaluation criteria\n",
    "\n",
    "The miniproject is marked out of 18, with a further mark breakdown in each question:\n",
    "- Exercise 1: 7 points\n",
    "- Exercise 2: 3 points\n",
    "- Exercise 3: 3 points\n",
    "- Exercise 4: 5 points\n",
    "\n",
    "We may perform random tests of your code but will not rerun the whole notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 17
    },
    "id": "HIJH-Ns0SMIp",
    "outputId": "19cf5bce-072c-436b-a9fb-3de7b726d6aa"
   },
   "outputs": [],
   "source": [
    "%%javascript\n",
    "IPython.OutputArea.prototype._should_scroll = function(lines) {\n",
    "    return false;\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JvVREM-USMIr"
   },
   "source": [
    "### Your Names\n",
    "\n",
    "Before you start, please enter your sciper number(s) in the field below; they are used to load the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-03-09T09:08:24.514461Z",
     "start_time": "2018-03-09T09:08:24.506410Z"
    },
    "id": "IrgMhC4bSMIr"
   },
   "outputs": [],
   "source": [
    "sciper = {'student_1': 0, \n",
    "          'student_2': 0}\n",
    "seed = sciper['student_1']+sciper['student_2']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "seB6IjU1SMIr"
   },
   "source": [
    "## Setup\n",
    "\n",
    "### Dependencies and constants"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "SCLi0Uz9iDXX",
    "outputId": "f7d9f963-8a42-47f2-8888-d4dd1ee20ef9"
   },
   "outputs": [],
   "source": [
    "import gym\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import logging\n",
    "from matplotlib.animation import FuncAnimation\n",
    "from IPython.display import HTML, clear_output\n",
    "from gym.envs.box2d.lunar_lander import heuristic\n",
    "\n",
    "import keras\n",
    "import tensorflow as tf\n",
    "from keras.models import Sequential\n",
    "from keras.layers import Dense, Lambda\n",
    "from keras.optimizers import Adam\n",
    "from keras import backend as K\n",
    "\n",
    "np.random.seed(seed)\n",
    "tf.random.set_seed(seed*2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "d0tAYCl4SMIs"
   },
   "source": [
    "### Gym Setup\n",
    "\n",
    "Here we load the Reinforcement Learning environments from Gym.\n",
    "\n",
    "We limit each episode to 500 steps so that we can train faster. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "3RarYKAOSMIs"
   },
   "outputs": [],
   "source": [
    "gym.logger.setLevel(logging.ERROR)\n",
    "discrete_env = gym.make('LunarLander-v2')\n",
    "discrete_env._max_episode_steps = 500\n",
    "discrete_env.seed(seed*3)\n",
    "gym.logger.setLevel(logging.WARN)\n",
    "\n",
    "%matplotlib inline\n",
    "plt.rcParams['figure.figsize'] = 12, 8\n",
    "plt.rcParams[\"animation.html\"] = \"jshtml\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3bee0anSSMIt"
   },
   "source": [
    "### Utilities\n",
    "\n",
    "We include a function that lets you visualize an \"episode\" (i.e. a series of observations resulting from the actions that the agent took in the environment).\n",
    "\n",
    "As well, we will use the `Results` class (a wrapper around a python dictionary) to store, save, load and plot your results. You can save your results to disk with `results.save('filename')` and reload them with `Results(filename='filename')`. Use `results.pop(experiment_name)` to delete an old experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "e2aJVaASSMIt"
   },
   "outputs": [],
   "source": [
    "def AddValue(output_size, value):\n",
    "    return Lambda(lambda x: x + value, output_shape=(output_size,))\n",
    "\n",
    "def render(episode, env):\n",
    "    \n",
    "    fig = plt.figure()\n",
    "    img = plt.imshow(env.render(mode='rgb_array'))\n",
    "    plt.axis('off')\n",
    "\n",
    "    def animate(i):\n",
    "        img.set_data(episode[i])\n",
    "        return img,\n",
    "\n",
    "    anim = FuncAnimation(fig, animate, frames=len(episode), interval=24, blit=True)\n",
    "    html = HTML(anim.to_jshtml())\n",
    "    \n",
    "    plt.close(fig)\n",
    "    !rm None0000000.png\n",
    "    \n",
    "    return html\n",
    "\n",
    "class Results(dict):\n",
    "    \n",
    "    def __init__(self, *args, **kwargs):\n",
    "        if 'filename' in kwargs:\n",
    "            data = np.load(kwargs['filename'])\n",
    "            super().__init__(data)\n",
    "        else:\n",
    "            super().__init__(*args, **kwargs)\n",
    "        self.new_key = None\n",
    "        self.plot_keys = None\n",
    "        self.ylim = None\n",
    "        \n",
    "    def __setitem__(self, key, value):\n",
    "        super().__setitem__(key, value)\n",
    "        self.new_key = key\n",
    "\n",
    "    def plot(self, window):\n",
    "        clear_output(wait=True)\n",
    "        for key in self:\n",
    "            #Ensure latest results are plotted on top\n",
    "            if self.plot_keys is not None and key not in self.plot_keys:\n",
    "                continue\n",
    "            elif key == self.new_key:\n",
    "                continue\n",
    "            self.plot_smooth(key, window)\n",
    "        if self.new_key is not None:\n",
    "            self.plot_smooth(self.new_key, window)\n",
    "        plt.xlabel('Episode')\n",
    "        plt.ylabel('Reward')\n",
    "        plt.legend(loc='lower right')\n",
    "        if self.ylim is not None:\n",
    "            plt.ylim(self.ylim)\n",
    "        plt.show()\n",
    "        \n",
    "    def plot_smooth(self, key, window):\n",
    "        if len(self[key]) == 0:\n",
    "            plt.plot([], [], label=key)\n",
    "            return None\n",
    "        y = np.convolve(self[key], np.ones((window,))/window, mode='valid')\n",
    "        x = np.linspace(window/2, len(self[key]) - window/2, len(y))\n",
    "        plt.plot(x, y, label=key)\n",
    "        \n",
    "    def save(self, filename='results'):\n",
    "        np.savez(filename, **self)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "d3Cqp4T3SMIt"
   },
   "source": [
    "### Test runs\n",
    "\n",
    "To get an idea of how the environment works, we'll plot an episode resulting from random actions at each point in time, and a \"perfect\" episode using a specially-designed function to land safely within the yellow flags. \n",
    "\n",
    "Please remove these plots before submitting the miniproject to reduce the file size."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "SXRlD9qFSMIu"
   },
   "outputs": [],
   "source": [
    "def run_fixed_episode(env, policy):\n",
    "    frames = []\n",
    "    observation = env.reset()\n",
    "    done = False\n",
    "    while not done:\n",
    "        frames.append(env.render(mode='rgb_array'))\n",
    "        action = policy(env, observation)\n",
    "        observation, reward, done, info = env.step(action)\n",
    "    return frames\n",
    "    \n",
    "def random_policy(env, observation):\n",
    "    return env.action_space.sample()\n",
    "\n",
    "def heuristic_policy(env, observation):\n",
    "    return heuristic(env.unwrapped, observation)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 690
    },
    "id": "gaFRxXOFSMIu",
    "outputId": "5bfb8bd6-8713-4d38-acfb-9fd19aeb5684"
   },
   "outputs": [],
   "source": [
    "episode = run_fixed_episode(discrete_env, random_policy)\n",
    "render(episode, discrete_env)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 690
    },
    "id": "kBdD4gsfSMIv",
    "outputId": "fac694a0-c813-4025-d217-12cdde8d913a"
   },
   "outputs": [],
   "source": [
    "episode = run_fixed_episode(discrete_env, heuristic_policy)\n",
    "render(episode, discrete_env)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "L-26a5CoSMIy"
   },
   "source": [
    "## Experiment Loop\n",
    "\n",
    "This is the method we will call to setup an experiment. Reinforcement learning usually operates on an Observe-Decide-Act cycle, as you can see below.\n",
    "\n",
    "You don't need to add anything here; you will be working directly on the RL agent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "NnBvklKsSMIy"
   },
   "outputs": [],
   "source": [
    "num_episodes = 3000\n",
    "\n",
    "def run_experiment(experiment_name, env, num_episodes, policy_learning_rate=0.001, value_learning_rate=0.001, \n",
    "                   baseline=None, entropy_cost=0, max_ent_cost=0, num_layers=3):\n",
    "\n",
    "    #Initiate the learning agent\n",
    "    agent = RLAgent(n_obs=env.observation_space.shape[0], \n",
    "                    action_space=env.action_space,\n",
    "                    policy_learning_rate=policy_learning_rate, \n",
    "                    value_learning_rate=value_learning_rate, \n",
    "                    discount=0.99, \n",
    "                    baseline=baseline, \n",
    "                    entropy_cost=entropy_cost, \n",
    "                    max_ent_cost=max_ent_cost,\n",
    "                    num_layers=num_layers)\n",
    "\n",
    "    rewards = []\n",
    "    all_episode_frames = []\n",
    "    step = 0\n",
    "    for episode in range(1, num_episodes+1):\n",
    "    \n",
    "        #Update results plot and occasionally store an episode movie\n",
    "        episode_frames = None\n",
    "        if episode % 10 == 0:\n",
    "            results[experiment_name] = np.array(rewards)\n",
    "            results.plot(10)\n",
    "        if episode % 500 == 0:\n",
    "            episode_frames = []\n",
    "            \n",
    "        #Reset the environment to a new episode\n",
    "        observation = env.reset()\n",
    "        episode_reward = 0\n",
    "\n",
    "        while True:\n",
    "        \n",
    "            if episode_frames is not None:\n",
    "                episode_frames.append(env.render(mode='rgb_array'))\n",
    "\n",
    "            # 1. Decide on an action based on the observations\n",
    "            action = agent.decide(observation)\n",
    "\n",
    "            # 2. Take action in the environment\n",
    "            next_observation, reward, done, info = env.step(action)\n",
    "            episode_reward += reward\n",
    "\n",
    "            # 3. Store the information returned from the environment for training\n",
    "            agent.observe(observation, action, reward)\n",
    "\n",
    "            # 4. When we reach a terminal state (\"done\"), use the observed episode to train the network\n",
    "            if done:\n",
    "                rewards.append(episode_reward)\n",
    "                if episode_frames is not None:\n",
    "                    all_episode_frames.append(episode_frames)\n",
    "                agent.train()\n",
    "                break\n",
    "\n",
    "            # Reset for next step\n",
    "            observation = next_observation\n",
    "            step += 1\n",
    "            \n",
    "    return all_episode_frames, agent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "eKjbSbixSMI1"
   },
   "source": [
    "## The Agent\n",
    "\n",
    "Here we give the outline of a python class that will represent the reinforcement learning agent (along with its decision-making network). We'll modify this class to add additional methods and functionality throughout the course of the miniproject.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "sJk4sMGuSMI4"
   },
   "outputs": [],
   "source": [
    "class RLAgent(object):\n",
    "    \n",
    "    def __init__(self, n_obs, action_space, policy_learning_rate, value_learning_rate, \n",
    "                 discount, baseline=None, entropy_cost=0, max_ent_cost=0, num_layers=3):\n",
    "\n",
    "        #We need the state and action dimensions to build the network\n",
    "        self.n_obs = n_obs  \n",
    "        self.n_act = action_space.n\n",
    "        \n",
    "        self.plr = policy_learning_rate\n",
    "        self.vlr = value_learning_rate\n",
    "        self.gamma = discount\n",
    "        self.entropy_cost = entropy_cost\n",
    "        self.max_ent_cost = max_ent_cost\n",
    "        self.num_layers = num_layers\n",
    "        \n",
    "        <add code here>\n",
    "        \n",
    "\n",
    "        #These lists stores the cumulative observations for this episode\n",
    "        self.episode_observations, self.episode_actions, self.episode_rewards = [], [], []\n",
    "\n",
    "        #Build the keras network\n",
    "        self._build_network()\n",
    "\n",
    "    def observe(self, state, action, reward):\n",
    "        \"\"\" This function takes the observations the agent received from the environment and stores them\n",
    "            in the lists above.\"\"\"\n",
    "        pass\n",
    "        \n",
    "    def decide(self, state):\n",
    "        \"\"\" This function feeds the observed state to the network, which returns a distribution\n",
    "            over possible actions. Sample an action from the distribution and return it.\"\"\"\n",
    "        pass\n",
    "\n",
    "    def train(self):\n",
    "        \"\"\" When this function is called, the accumulated episode observations, actions and discounted rewards\n",
    "            should be fed into the network and used for training. Use the _get_returns function to first turn \n",
    "            the episode rewards into discounted returns. \n",
    "            Apply simple or adaptive baselines if needed, depending on parameters.\"\"\"\n",
    "        pass\n",
    "\n",
    "    def _get_returns(self):\n",
    "        \"\"\" This function should process self.episode_rewards and return the discounted episode returns\n",
    "            at each step in the episode. Hint: work backwards.\"\"\"\n",
    "        pass\n",
    "\n",
    "    def _build_network(self):\n",
    "        \"\"\" This function should build the network that can then be called by decide and train. \n",
    "            The network takes observations as inputs and has a policy distribution as output.\"\"\"\n",
    "        pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7n-UmwhaSMI6"
   },
   "source": [
    "## Exercise 1: REINFORCE with simple baseline\n",
    "\n",
    "### Description\n",
    "\n",
    "Implement the REINFORCE Policy Gradient algorithm using a deep neural network as a function approximator.\n",
    "\n",
    "1. Implement the `observe` method of the RLAgent above.\n",
    "2. Implement the `_build_network` method. Your network should take the 8-dimensional state space as input and output a softmax distribution over the 4 discrete actions. It should have 3 hidden layers with 16 units each with ReLU activations. Use the REINFORCE loss function. HINT: Keras has a built-in \"categorical cross-entropy\" loss, and a `sample_weight` argument in fit/train_on_batch. Consider how these could be used together.\n",
    "3. Implement the `decide`, `train` and `_get_returns` methods using the inputs and outputs of your network. In `train`, implement a baseline based on a moving average (over episodes) of the mean returns (over trials of one episode); it should only be in effect when the agent is constructed with the `use_simple_baseline` keyword. Also, use `train_on_batch` to form one minibatch from all the experiences in an episode. Hint: see Question 2) below.\n",
    "4. Try a few learning rates and pick the best one (the default for Adam is a good place to start). Run the functions below and include the resulting plots, with and without the baseline, for your chosen learning rate. \n",
    "5. Answer the questions below in max. 1-2 sentence(s).\n",
    "\n",
    "WARNING: Running any experiments with the same names (first argument in run_experiment) will cause your results to be overwritten. \n",
    "\n",
    "**Mark breakdown: 7 points total**\n",
    "- 5 points for implementing and plotting basic REINFORCE with reasonable performance (i.e. a positive score) and answering the questions below.\n",
    "- 2 points for implementing and plotting the simple baseline with reasonable performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "H_tbHOSJSMI8"
   },
   "source": [
    "### Solution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 497
    },
    "id": "YgBnDR_1SMI8",
    "outputId": "00b91432-dd4f-484b-bfdd-4f1cdd337524"
   },
   "outputs": [],
   "source": [
    "#Supply a filename here to load results from disk\n",
    "results = Results()\n",
    "policy_learning_rate = 0.002\n",
    "_, _ = run_experiment(\"REINFORCE\", discrete_env, num_episodes, policy_learning_rate)\n",
    "episodes, _ = run_experiment(\"REINFORCE (with baseline)\", discrete_env, num_episodes, policy_learning_rate, \n",
    "                             baseline='simple')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "gD1LNjY7SMI9",
    "outputId": "6b4df87b-5ae2-462c-cfee-9b4e58200bd6"
   },
   "outputs": [],
   "source": [
    "render(episodes[-1], discrete_env)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "VQoNVTl0SMI-"
   },
   "source": [
    "**Question 1**: We have at least three posibilities of picking the action: i) sample an action according to the softmax distribution, ii) select action with max action probability and iii) use an epsilon-greedy strategy. What is the difference between these strategies and which one(s) is(are) preferable during training and which one(s) is(are) preferable during testing?. \n",
    "\n",
    "\n",
    "**Answer**:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "j_VgNlZsSMI-"
   },
   "source": [
    "**Question 2**: In the train method above we throw away the data from an episode after we use it to train the network (make sure that you do that). Why is it not a good idea to keep the old episodes and train the policy network on both old and new data? (Note: Reusing data can still be possible but requires modifications to the REINFORCE algorithm that we are using).\n",
    "\n",
    "**Answer**:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ViA-XGNvSMI_"
   },
   "source": [
    "## Exercise 2: Adaptive baseline\n",
    "### Description\n",
    "\n",
    "Add a second neural network to your model that learns an observations-dependent adaptive baseline and subtracts it from your discounted returns.\n",
    "\n",
    "1. Modify the `_build_network` function of RLAgent to create a second \"value network\" when `adaptive` is passed for the baseline argument. The value network should have the same or similar structure as the policy network, without the softmax at the output.\n",
    "3. In addition to training your policy network, train the value network on the Mean-Squared Error compared to the discounted returns.\n",
    "4. Train your policy network on $R - b(s)$, i.e. the returns minus the adaptive baseline (the output of the value network). Your implementation should allow for a different learning rate for the value and policy network.\n",
    "5. Try a few learning rates and plot all your best results together (without baseline, with simple baseline, with adaptive baseline). You may or may not be able to improve on the simple baseline! Return the trained model to use it in the next exercise.\n",
    "\n",
    "TECHNICAL NOTE: Some textbooks may refer to this approach as \"Actor-Critic\", where the policy network is the \"Actor\" and the value network is the \"Critic\". Sutton and Barto (2018) suggest that Actor-Critic only applies when the discounted returns are bootstrapped from the value network output, as you saw in class. This can introduce instability in learning that needs to be addressed with more advanced techniques, so we won't use it for this miniproject. You can read more about state-of-the-art Actor-Critic approaches here: https://arxiv.org/pdf/1602.01783.pdf\n",
    "\n",
    "**Mark breakdown: 3 points total**\n",
    "- 3 points for implementing and plotting the adaptive baseline with the other two conditions, with reasonable performance (i.e. at least similar to the performance in Exercise 1)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jw4J1tWtSMI_"
   },
   "source": [
    "### Solution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "NyQu57nfSMI_",
    "outputId": "028b5260-17f7-44e2-a332-4e6013dc4248",
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "value_learning_rate = 0.002\n",
    "episodes, d_model = run_experiment(\"REINFORCE (adaptive baseline)\", discrete_env, num_episodes, policy_learning_rate, \n",
    "                                   value_learning_rate, baseline='adaptive')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "1OMufH8KSMI_",
    "outputId": "7e86cf26-efd9-4ddc-b498-8ab0d4f0c675"
   },
   "outputs": [],
   "source": [
    "render(episodes[-1], discrete_env)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "thlnQsckSMJC"
   },
   "source": [
    "## Exercise 3: Visualizing the Value Function\n",
    "\n",
    "### Description\n",
    "\n",
    "Ideally, our value network should have learned to predict the relative values across the input space. We can test this by plotting the value prediction for different observations.\n",
    "\n",
    "1. Write a function to plot the value network prediction across [x,y] space for given (constant) values of the other state variables. X is always in [-1,1], and Y generally lies in [-0.2,1], where the landing pad is at [0,0]. (`plt.imshow`, `plt.title`, and `plt.colorbar` can be useful)\n",
    "2. Plot (with titles specifying the state variable combinations) the values for 5-6 combinations of the other 6 state variables, including [0,0,0,0,0,0]. The X and Y velocity are generally within [-1,1], the angle is in [-pi,pi] and the angular velocity lies roughly within [-3,3]. The last two inputs indicating whether the legs have touched the ground are 0 (False) or 1 (True). Include two combinations with (one of the) state variables out of these ranges. Use the same color bar limits across the graphs so that they can be compared easily. \n",
    "3. Answer the question below in max. 2-3 sentence(s).\n",
    "\n",
    "**Mark breakdown: 3 points total**\n",
    "- 3 points for the plots of the value function and answering the question below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vZqhrWmLSMJD"
   },
   "source": [
    "### Solution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Vhg0znkOSMJE"
   },
   "source": [
    "**Question**: Does your value map for the state variables combination [0,0,0,0,0,0] make sense? What about the value maps for the combinations with state variables out of the ranges above?\n",
    "\n",
    "**Answer**:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 4: Comparing Architectures\n",
    "### Descrition\n",
    "\n",
    "Choosing a good neural network architecture is always a tricky question - on one hand, you want a complex architecture that is flexible enough to be able to solve the task, and on the other hand, you want to train your network as fast as possible and to not overuse your computational power. In the previous sections, we asked you to create a network with 3 hidden layers which you saw that is able to successfully solve the task and play the game. What happens if we do the same with 1 or 2 hidden layers? In this exercise, we ask you to look into the effect of the architecture and to compare different models with each other.\n",
    "\n",
    "1. Include an extra parameter `num_layers` in the RLAgent class (by default it is equal to 3).\n",
    "2. Change the `_build_network` function so that it creates a policy and value networks with the required number of layers.\n",
    "3. Compare (on the same axes) the resulting plots for num_layers = 1, 2, 3."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "DIaeIPwISMJF"
   },
   "source": [
    "## For your Interest.."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fxGtxNzPSMJF"
   },
   "source": [
    "The code you've written above can be easily adapted for other environments in Gym. If you like, try playing around with different environments and network structures!"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [],
   "name": "miniproject3_2019-Solutions.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "CS456",
   "language": "python",
   "name": "cs456"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}