{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Miniproject 3: Landing on the Moon\n",
    "\n",
    "## Introduction\n",
    "\n",
    "### Description\n",
    "\n",
    "Traditionally, reinforcement learning has operated on \"tabular\" state spaces, e.g. \"State 1\", \"State 2\", \"State 3\" etc. However, many important and interesting reinforcement learning problems (like moving robot arms or playing Atari games) are based on either continuous or very high-dimensional state spaces (like robot joint angles or pixels). Deep neural networks constitute one method for learning a value function or policy from continuous and high-dimensional observations. \n",
    "\n",
    "In this miniproject, you will teach an agent to play the Lunar Lander game from OpenAI Gym. The agent needs to learn how to land a lunar module safely on the surface of the moon. The state space is 8-dimensional and (mostly) continuous, consisting of the X and Y coordinates, the X and Y velocity, the angle, and the angular velocity of the lander, and two booleans indicating whether the left and right leg of the lander have landed on the moon.\n",
    "\n",
    "The agent gets a reward of +100 for landing safely and -100 for crashing. In addition, it receives \"shaping\" rewards at every step. It receives positive rewards for moving closer to [0,0], decreasing in velocity, shifting to an upright angle and touching the lander legs on the moon. It receives negative rewards for moving away from the landing site, increasing in velocity, turning sideways, taking the lander legs off the moon and for using fuel (firing the thrusters). The best score an agent can achieve in an episode is about +250.\n",
    "\n",
    "There are two versions of the task: one with discrete controls and one with continuous controls but we'll only work with the discrete version. In the discrete version, the agent can take one of four actions at each time step: [do nothing, fire engines left, fire engines right, fire engines down]. \n",
    "\n",
    "We will use Policy Gradient approaches (using the REINFORCE rule) to learn the task. In the previous miniproject, the network generates a probability distribution over the outputs, and is trained to maximize the probability of a specific target output given an observation (input). In Policy Gradient methods, the network generates a probability distribution over actions, and is trained to maximize expected future rewards given an observation.\n",
    "\n",
    "### Prerequisites\n",
    "\n",
    "- You need to install [OpenAI Gym](https://gym.openai.com/docs/) and Box2D in addition to [tensorflow](https://www.tensorflow.org/install/) and [keras](https://keras.io/) in you environment. To do so just use the following commands:\n",
    "  - source activate cs456env\n",
    "  - pip install gym\n",
    "  - pip install box2d-py\n",
    "  \n",
    "- You should know the concepts of \"policy\", \"policy gradient\", \"REINFORCE\", \"REINFORCE with baseline\". If you want to start and haven't seen these yet in class, read Sutton & Barto (2018) Chapter 13 (13.1-13.4).\n",
    "\n",
    "### What you will learn\n",
    "\n",
    "- You will learn how to implement a policy gradient neural network using the REINFORCE algorithm.\n",
    "- You will learn how to implement baselines, including a learned value network.\n",
    "- You will learn how to regularize a policy using entropy.\n",
    "\n",
    "### Notes \n",
    "- Reinforcement learning is noisy! Normally one should average over multiple random seeds with the same parameters to really see the impact of a change to the model, but we won't do this due to time constraints. However, you should be able to see learning over time with every approach. If you don't see any improvement, or very unstable learning, double-check your model and try adjusting the learning rate.\n",
    "\n",
    "- You may sometimes see \"AssertionError: IsLocked() = False\" after restarting your code. To fix this, reinitialize the environments by running the Gym Setup code below.\n",
    "\n",
    "- You will not be marked on the episode movies. Please delete these movies before uploading your code.\n",
    "\n",
    "### Evaluation criteria\n",
    "\n",
    "The miniproject is marked out of 18, with a further mark breakdown in each question:\n",
    "- Exercise 1: 7 points\n",
    "- Exercise 2: 3 points\n",
    "- Exercise 3: 3 points\n",
    "- Exercise 4: 5 points\n",
    "\n",
    "We may perform random tests of your code but will not rerun the whole notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%javascript\n",
    "IPython.OutputArea.prototype._should_scroll = function(lines) {\n",
    "    return false;\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Your Names\n",
    "\n",
    "Before you start, please enter your sciper number(s) in the field below; they are used to load the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-03-09T09:08:24.514461Z",
     "start_time": "2018-03-09T09:08:24.506410Z"
    }
   },
   "outputs": [],
   "source": [
    "sciper = {'student_1': 0, \n",
    "          'student_2': 0}\n",
    "seed = sciper['student_1']+sciper['student_2']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "### Dependencies and constants"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-03-09T09:09:16.113721Z",
     "start_time": "2018-03-09T09:09:16.100520Z"
    }
   },
   "outputs": [],
   "source": [
    "import gym\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import logging\n",
    "from matplotlib.animation import FuncAnimation\n",
    "from IPython.display import HTML, clear_output\n",
    "from gym.envs.box2d.lunar_lander import heuristic\n",
    "\n",
    "import keras\n",
    "import tensorflow as tf\n",
    "from tensorflow.contrib.distributions import Beta\n",
    "from keras.models import Sequential\n",
    "from keras.layers import Dense, Lambda\n",
    "from keras.optimizers import Adam\n",
    "from keras import backend as K\n",
    "\n",
    "np.random.seed(seed)\n",
    "tf.set_random_seed(seed*2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Gym Setup\n",
    "\n",
    "Here we load the Reinforcement Learning environments from Gym.\n",
    "\n",
    "We limit each episode to 500 steps so that we can train faster. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gym.logger.setLevel(logging.ERROR)\n",
    "discrete_env = gym.make('LunarLander-v2')\n",
    "discrete_env._max_episode_steps = 500\n",
    "discrete_env.seed(seed*3)\n",
    "gym.logger.setLevel(logging.WARN)\n",
    "\n",
    "%matplotlib inline\n",
    "plt.rcParams['figure.figsize'] = 12, 8\n",
    "plt.rcParams[\"animation.html\"] = \"jshtml\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Utilities\n",
    "\n",
    "We include a function that lets you visualize an \"episode\" (i.e. a series of observations resulting from the actions that the agent took in the environment).\n",
    "\n",
    "As well, we will use the \"Results\" class (a wrapper around a python dictionary) to store, save, load and plot your results. You can save your results to disk with results.save('filename') and reload them with Results(filename='filename'). Use results.pop(experiment_name) to delete an old experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def AddValue(output_size, value):\n",
    "    return Lambda(lambda x: x + value, output_shape=(output_size,))\n",
    "\n",
    "def render(episode, env):\n",
    "    \n",
    "    fig = plt.figure()\n",
    "    img = plt.imshow(env.render(mode='rgb_array'))\n",
    "    plt.axis('off')\n",
    "\n",
    "    def animate(i):\n",
    "        img.set_data(episode[i])\n",
    "        return img,\n",
    "\n",
    "    anim = FuncAnimation(fig, animate, frames=len(episode), interval=24, blit=True)\n",
    "    html = HTML(anim.to_jshtml())\n",
    "    \n",
    "    plt.close(fig)\n",
    "    !rm None0000000.png\n",
    "    \n",
    "    return html\n",
    "\n",
    "class Results(dict):\n",
    "    \n",
    "    def __init__(self, *args, **kwargs):\n",
    "        if 'filename' in kwargs:\n",
    "            data = np.load(kwargs['filename'])\n",
    "            super().__init__(data)\n",
    "        else:\n",
    "            super().__init__(*args, **kwargs)\n",
    "        self.new_key = None\n",
    "        self.plot_keys = None\n",
    "        self.ylim = None\n",
    "        \n",
    "    def __setitem__(self, key, value):\n",
    "        super().__setitem__(key, value)\n",
    "        self.new_key = key\n",
    "\n",
    "    def plot(self, window):\n",
    "        clear_output(wait=True)\n",
    "        for key in self:\n",
    "            #Ensure latest results are plotted on top\n",
    "            if self.plot_keys is not None and key not in self.plot_keys:\n",
    "                continue\n",
    "            elif key == self.new_key:\n",
    "                continue\n",
    "            self.plot_smooth(key, window)\n",
    "        if self.new_key is not None:\n",
    "            self.plot_smooth(self.new_key, window)\n",
    "        plt.xlabel('Episode')\n",
    "        plt.ylabel('Reward')\n",
    "        plt.legend(loc='lower right')\n",
    "        if self.ylim is not None:\n",
    "            plt.ylim(self.ylim)\n",
    "        plt.show()\n",
    "        \n",
    "    def plot_smooth(self, key, window):\n",
    "        if len(self[key]) == 0:\n",
    "            plt.plot([], [], label=key)\n",
    "            return None\n",
    "        y = np.convolve(self[key], np.ones((window,))/window, mode='valid')\n",
    "        x = np.linspace(window/2, len(self[key]) - window/2, len(y))\n",
    "        plt.plot(x, y, label=key)\n",
    "        \n",
    "    def save(self, filename='results'):\n",
    "        np.savez(filename, **self)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Test runs\n",
    "\n",
    "To get an idea of how the environment works, we'll plot an episode resulting from random actions at each point in time, and a \"perfect\" episode using a specially-designed function to land safely within the yellow flags. \n",
    "\n",
    "Please remove these plots before submitting the miniproject to reduce the file size."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_fixed_episode(env, policy):\n",
    "    frames = []\n",
    "    observation = env.reset()\n",
    "    done = False\n",
    "    while not done:\n",
    "        frames.append(env.render(mode='rgb_array'))\n",
    "        action = policy(env, observation)\n",
    "        observation, reward, done, info = env.step(action)\n",
    "    return frames\n",
    "    \n",
    "def random_policy(env, observation):\n",
    "    return env.action_space.sample()\n",
    "\n",
    "def heuristic_policy(env, observation):\n",
    "    return heuristic(env.unwrapped, observation)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "episode = run_fixed_episode(discrete_env, random_policy)\n",
    "render(episode, discrete_env)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "episode = run_fixed_episode(discrete_env, heuristic_policy)\n",
    "render(episode, discrete_env)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Experiment Loop\n",
    "\n",
    "This is the method we will call to setup an experiment. Reinforcement learning usually operates on an Observe-Decide-Act cycle, as you can see below.\n",
    "\n",
    "You don't need to add anything here; you will be working directly on the RL agent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_episodes = 3000\n",
    "\n",
    "def run_experiment(experiment_name, env, num_episodes, policy_learning_rate = 0.001, value_learning_rate = 0.001, \n",
    "                   baseline=None, entropy_cost = 0, max_ent_cost = 0):\n",
    "\n",
    "    #Initiate the learning agent\n",
    "    agent = RLAgent(n_obs = env.observation_space.shape[0], action_space = env.action_space,\n",
    "                    policy_learning_rate = policy_learning_rate, value_learning_rate = value_learning_rate, \n",
    "                    discount=0.99, baseline = baseline, entropy_cost = entropy_cost, max_ent_cost = max_ent_cost)\n",
    "\n",
    "    rewards = []\n",
    "    all_episode_frames = []\n",
    "    step = 0\n",
    "    for episode in range(1, num_episodes+1):\n",
    "    \n",
    "        #Update results plot and occasionally store an episode movie\n",
    "        episode_frames = None\n",
    "        if episode % 10 == 0:\n",
    "            results[experiment_name] = np.array(rewards)\n",
    "            results.plot(10)\n",
    "        if episode % 500 == 0:\n",
    "            episode_frames = []\n",
    "            \n",
    "        #Reset the environment to a new episode\n",
    "        observation = env.reset()\n",
    "        episode_reward = 0\n",
    "\n",
    "        while True:\n",
    "        \n",
    "            if episode_frames is not None:\n",
    "                episode_frames.append(env.render(mode='rgb_array'))\n",
    "\n",
    "            # 1. Decide on an action based on the observations\n",
    "            action = agent.decide(observation)\n",
    "\n",
    "            # 2. Take action in the environment\n",
    "            next_observation, reward, done, info = env.step(action)\n",
    "            episode_reward += reward\n",
    "\n",
    "            # 3. Store the information returned from the environment for training\n",
    "            agent.observe(observation, action, reward)\n",
    "\n",
    "            # 4. When we reach a terminal state (\"done\"), use the observed episode to train the network\n",
    "            if done:\n",
    "                rewards.append(episode_reward)\n",
    "                if episode_frames is not None:\n",
    "                    all_episode_frames.append(episode_frames)\n",
    "                agent.train()\n",
    "                break\n",
    "\n",
    "            # Reset for next step\n",
    "            observation = next_observation\n",
    "            step += 1\n",
    "            \n",
    "    return all_episode_frames, agent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Agent\n",
    "\n",
    "Here we give the outline of a python class that will represent the reinforcement learning agent (along with its decision-making network). We'll modify this class to add additional methods and functionality throughout the course of the miniproject.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class RLAgent(object):\n",
    "    \n",
    "    def __init__(self, n_obs, action_space, policy_learning_rate, value_learning_rate, \n",
    "                 discount, baseline = None, entropy_cost = 0, max_ent_cost = 0):\n",
    "\n",
    "        #We need the state and action dimensions to build the network\n",
    "        self.n_obs = n_obs  \n",
    "        self.n_act = action_space.n\n",
    "        \n",
    "        self.gamma = discount\n",
    "        self.moving_baseline = None\n",
    "\n",
    "        #Fill in the rest of the agent parameters to use in the methods below\n",
    "        self.policy_learning_rate = ...\n",
    "        self.value_learning_rate = ...\n",
    "        \n",
    "        self.use_baseline = ...\n",
    "        self.use_adaptive_baseline = ...\n",
    "        if baseline == 'adaptive':\n",
    "            ...\n",
    "        elif baseline == 'simple':\n",
    "            ...\n",
    "\n",
    "        #These lists stores the cumulative observations for this episode\n",
    "        self.episode_observations, self.episode_actions, self.episode_rewards = [], [], []\n",
    "\n",
    "        #Build the keras network\n",
    "        self._build_network()\n",
    "\n",
    "    def observe(self, state, action, reward):\n",
    "        \"\"\" This function takes the observations the agent received from the environment and stores them\n",
    "            in the lists above. \"\"\"\n",
    "        pass\n",
    "        \n",
    "    def decide(self, state):\n",
    "        \"\"\" This function feeds the observed state to the network, which returns a distribution\n",
    "            over possible actions. Sample an action from the distribution and return it.\"\"\"\n",
    "        pass\n",
    "\n",
    "    def train(self):\n",
    "        \"\"\" When this function is called, the accumulated observations, actions and discounted rewards from the\n",
    "            current episode should be fed into the network and used for training. Use the _get_returns function \n",
    "            to first turn the episode rewards into discounted returns. \"\"\"\n",
    "        pass\n",
    "\n",
    "    def _get_returns(self):\n",
    "        \"\"\" This function should process self.episode_rewards and return the discounted episode returns\n",
    "            at each step in the episode, then optionally apply a baseline. Hint: work backwards.\"\"\"\n",
    "        pass\n",
    "\n",
    "    def _build_network(self):\n",
    "        \"\"\" This function should build the network that can then be called by decide and train. \n",
    "            The network takes observations as inputs and has a policy distribution as output.\"\"\"\n",
    "        pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 1: REINFORCE with simple baseline\n",
    "\n",
    "### Description\n",
    "\n",
    "Implement the REINFORCE Policy Gradient algorithm using a deep neural network as a function approximator.\n",
    "\n",
    "1. Implement the \"observe\" method of the RLAgent above.\n",
    "2. Implement the \"_build_network\" method. Your network should take the 8-dimensional state space as input and output a softmax distribution over the 4 discrete actions. It should have 3 hidden layers with about 16 units each with ReLU activations. Use the REINFORCE loss function. HINT: Keras has a built-in \"categorical cross-entropy\" loss, and a \"sample_weight\" argument in fit/train_on_batch. Consider how these could be used together.\n",
    "3. Implement the \"decide\", \"train\" and \"_get_returns\" methods using the inputs and outputs of your network. In \"_get_returns\", implement a baseline based on a moving average of the returns; it should only be in effect when the agent is constructed with the \"use_baseline\" keyword. In \"train\", use train_on_batch to form one minibatch from all the experiences in an episode. \n",
    "4. Try a few learning rates and pick the best one (the default for Adam is a good place to start). Run the functions below and include the resulting plots, with and without the baseline, for your chosen learning rate. \n",
    "5. Answer the question below in max. 1-2 sentence(s).\n",
    "\n",
    "WARNING: Running any experiments with the same names (first argument in run_experiment) will cause your results to be overwritten. \n",
    "\n",
    "**Mark breakdown: 7 points total**\n",
    "- 5 points for implementing and plotting basic REINFORCE with reasonable performance (i.e. a positive score) and answering the questions below.\n",
    "- 2 points for implementing and plotting the simple baseline with reasonable performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Supply a filename here to load results from disk\n",
    "results = Results()\n",
    "policy_learning_rate = None\n",
    "_, _ = run_experiment(\"REINFORCE\", discrete_env, num_episodes, learning_rate)\n",
    "episodes, _ = run_experiment(\"REINFORCE (with baseline)\", discrete_env, num_episodes, policy_learning_rate, \n",
    "                             baseline='simple')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "render(episodes[-1], discrete_env)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 1**: Why is it better to sample an action from the softmax distribution rather than just pick the action with highest probability? \n",
    "\n",
    "**Answer**:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 2**: In the train method above we throw away the data from an episode after we use it to train the network (make sure that you do that). Why is it not a good idea to keep the old episodes in our data and train the policy network on both old and new data? (Note: Reusing data can still be possible but requires modifications to the REINFORCE algorithm we are using).\n",
    "\n",
    "**Answer**:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 2: Adaptive baseline\n",
    "### Description\n",
    "\n",
    "Add a second neural network to your model that learns an observations-dependent adaptive baseline and subtracts it from your discounted returns.\n",
    "\n",
    "1. Modify the \"_build_network\" function of RLAgent to create a second \"value network\" when \"adaptive\" is passed for the baseline argument. The value network should have the same or similar structure as the policy network, without the softmax at the output.\n",
    "3. In addition to training your policy network, train the value network on the Mean-Squared Error compared to the adjusted returns.\n",
    "4. Train your policy network on R - b(s), i.e. the returns minus the adaptive baseline (the output of the value network). Your implementation should allow for a different learning rate for the value and policy network.\n",
    "5. Try a few learning rates and plot all your best results together (without baseline, with simple baseline, with adaptive baseline). You may or may not be able to improve on the simple baseline! Return the trained model to use it in the next exercise.\n",
    "\n",
    "TECHNICAL NOTE: Some textbooks may refer to this approach as \"Actor-Critic\", where the policy network is the \"Actor\" and the value network is the \"Critic\". Sutton and Barto (2018) suggest that Actor-Critic only applies when the discounted returns are bootstrapped from the value network output, as you saw in class. This can introduce instability in learning that needs to be addressed with more advanced techniques, so we won't use it for this miniproject. You can read more about state-of-the-art Actor-Critic approaches here: https://arxiv.org/pdf/1602.01783.pdf\n",
    "\n",
    "**Mark breakdown: 3 points total**\n",
    "- 3 points for implementing and plotting the adaptive baseline with the other two conditions, with reasonable performance (i.e. at least similar to the performance in Exercise 1)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "value_learning_rate = None\n",
    "episodes, d_model = run_experiment(\"REINFORCE (adaptive baseline)\", discrete_env, num_episodes, policy_learning_rate, \n",
    "                                   value_learning_rate, baseline='adaptive')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "render(episodes[-1], discrete_env)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 3: Visualizing the Value Function\n",
    "\n",
    "### Description\n",
    "\n",
    "Ideally, our value network should have learned to predict the relative values across the input space. We can test this by plotting the value prediction for different observations.\n",
    "\n",
    "1. Write a function to plot the value network prediction across [x,y] space for given (constant) values of the other state variables. X is always in [-1,1], and Y generally lies in [-0.2,1], where the landing pad is at [0,0]. (plt.imshow, plt.title, and plt.colorbar can be useful)\n",
    "2. Plot (with titles specifying the state variable combinations) the values for 5-6 combinations of the other 6 state variables, including [0,0,0,0,0,0]. The X and Y velocity are generally within [-1,1], the angle is in [-pi,pi] and the angular velocity lies roughly within [-3,3]. The last two inputs indicating whether the legs have touched the ground are 0 (False) or 1 (True). Include two combinations with (one of the) state variables out of these ranges. Use the same color bar limits across the graphs so that they can be compared easily. \n",
    "3. Answer the question below in max. 2-3 sentence(s).\n",
    "\n",
    "**Mark breakdown: 3 points total**\n",
    "- 2 points for the plots of the value function.\n",
    "- 1 point for answering the question below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question**: Does your value map for the state variables combination [0,0,0,0,0,0] make sense? What about the value maps for the combinations with state variables out of the ranges above?\n",
    "\n",
    "**Answer**:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 4: Regularizing the Policy\n",
    "### Description\n",
    "\n",
    "In reinforcement learning one faces the exploration/exploitation dilemma. To maximize rewards, the agent needs to \"exploit\", or take the actions that it thinks will lead to the highest reward. However, doing this too early in training can result in a poor policy; the agent needs to adequately \"explore\" the state space in order to learn which policy is actually best at each state.\n",
    "\n",
    "In Q-learning, this is often accomplished by having the agent take random actions with a certain probability early in training. In Policy Gradient, this randomness is built-in because the agent is sampling from a policy distribution. We can improve exploration by making sure that the probability doesn't become too concentrated on certain actions early in training. In other words, we should try to regularize the information entropy (or uncertainty) of the policy distribution.\n",
    "\n",
    "1. Include a regularizer on the output of your policy network based on the entropy of the categorical distribution. You can do this by writing a function that calculates the entropy of your policy weighted by self.entropy_cost (using functions from Keras.backend, imported above), and passing it to your output layer with the activity_regularizer keyword. HINT: Keras will treat this as an additional loss function; make sure that it has the right sign!\n",
    "2. Try several values for the entropy cost below to see if you can improve performance. Plot them along with the adaptive baseline result (where the entropy cost was 0). Good costs will likely be on the order of 0.001; include one trial where the entropy cost was too high.\n",
    "\n",
    "Entropy regularization is related to but distinct from Maximum Entropy Reinforcement Learning (MERL). In the MERL framework the objective function is modified so that the agent tries to maximize reward and entropy of the policy along the full trajectory. \n",
    "\n",
    "3. Add entropy to reward when computing the episode returns (Hint: An entropy term should be added to the reward at every time step. This can be done simply by subtracting the log of the probability of the action taken at that time step (weighted by self.max_ent_cost) from the reward at that step. \n",
    "4. Try several values for the maximum entropy cost below to see if you can improve performance. Plot them along with the adaptive baseline result (where the maximum entropy cost was 0). Include one trial where the entropy cost was too high. (Do not use entropy regularization in this question, i.e. keep entropy_cost at 0).\n",
    "\n",
    "**Mark breakdown: 5 points total**\n",
    "- 3 points for plotting the results with entropy regularization with several costs, with reasonable performance.\n",
    "- 2 points for plotting the results with maximum entropy objective with several costs, with reasonable performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "entropy_costs = []\n",
    "policy_learning_rate = None\n",
    "value_learning_rate = None\n",
    "results = Results()\n",
    "results.plot_keys = [\"REINFORCE (adaptive baseline)\"]\n",
    "for cost in entropy_costs:\n",
    "    experiment_name = \"REINFORCE (adaptive, entropy cost: %s)\" % str(cost)\n",
    "    results.plot_keys.append(experiment_name)\n",
    "    _, _ = run_experiment(experiment_name, discrete_env, num_episodes, policy_learning_rate, value_learning_rate, baseline='adaptive', \n",
    "                          entropy_cost=cost)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_ent_costs = []\n",
    "policy_learning_rate = None\n",
    "value_learning_rate = None\n",
    "results = Results()\n",
    "results.plot_keys = [\"REINFORCE (adaptive baseline)\"]\n",
    "for cost in max_ent_costs:\n",
    "    experiment_name = \"REINFORCE (adaptive, maximum entropy cost: %s)\" % str(cost)\n",
    "    results.plot_keys.append(experiment_name)\n",
    "    _, _ = run_experiment(experiment_name, discrete_env, num_episodes, policy_learning_rate, value_learning_rate, baseline='adaptive', \n",
    "                          max_ent_cost=cost)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## For your Interest.."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The code you've written above can be easily adapted for other environments in Gym. If you like, try playing around with different environments and network structures!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}