{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Miniproject | Learning to play Pong with Deep Reinforcement Learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![PongUrl](https://pygame-learning-environment.readthedocs.io/en/latest/_images/pong.gif \"pong\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "### Description\n",
    "\n",
    "Traditionally, reinforcement learning has operated on \"tabular\" state spaces, e.g. \"State 1\", \"State 2\", \"State 3\" etc. However, many important and interesting reinforcement learning problems (like moving robot arms or playing Atari games) are based on either continuous or very high-dimensional state spaces (like robot joint angles or pixels). Deep neural networks constitute one method for learning a value function or policy from continuous and high-dimensional observations. \n",
    "\n",
    "In this miniproject, you will teach an agent to play the game [Pong](https://pygame-learning-environment.readthedocs.io/en/latest/user/games/pong.html) from the PyGame Learning Environment. While it is possible to learn the task directly from screen pixel values as done by DQN on the Atari games, here we consider a simpler low-dimensional state space. The agent needs to control a paddle to hit a ball and drive it past it's opponent's paddle which is controlled by the computer. The state space is 7-dimensional and continuous, and consists of the following state variables:\n",
    "- player paddle's y position.\n",
    "- player paddle's velocity.\n",
    "- cpu paddle's y position.\n",
    "- ball's x position.\n",
    "- ball's y position.\n",
    "- ball's x velocity.\n",
    "- ball's y velocity.\n",
    "\n",
    "The agent can take one of two actions, accelerate up or accelerate down. The agent gets a reward of +1 when it scores a point, i.e, when it drives the ball past the computer-controlled paddle, and a reward of -1 when it loses a point. We define a game (or episode) to be over when either the agent (or the computer) scores 7 points, after which a new game is started. Because the episodes become very long once the agent learns to compete with its opponent, we stop training when the agent wins a certain number of games in a row (20 by default). You can change this parameter if you wish to further train and improve your agent.\n",
    "\n",
    "We will use Policy Gradient approaches to learn the task. In supervised learning tasks, the network generates a probability distribution over the outputs, and is trained to maximize the probability of a specific target output given an observation. In Policy Gradient methods, the network generates a probability distribution over actions, and is trained to maximize expected future rewards given an observation.\n",
    "\n",
    "You should remember that reinforcement learning is noisy! You may get different results from one trial to another, and sometimes simpler approaches will outperform more complicated ones. If you don't see any improvement, or unstable learning, double-check your model and try adjusting the learning rate.\n",
    "\n",
    "### Prerequisites\n",
    "\n",
    "- You should have set up the CS456 virtual conda environment and installed the dependencies as described in the document [Miniprojects | Environment Setup and XOR exercise](https://moodle.epfl.ch/pluginfile.php/2024974/mod_resource/content/6/env_setup.pdf) published on moodle. You should launch this notebook from this environment, i.e. using the CS456 python 3 kernel.\n",
    "- For this miniproject you will also need to install [Pygame](https://www.pygame.org/wiki/about), Pillow, and the [PyGame Learning Environment](https://pygame-learning-environment.readthedocs.io/en/latest/user/home.html). To do so, the easiest workflow to follow is:\n",
    "    - source activate ann_env\n",
    "    - pip3 install Pillow\n",
    "    - pip3 install pygame\n",
    "    - git clone https://github.com/ntasfi/PyGame-Learning-Environment\n",
    "    - cd PyGame-Learning-Environment\n",
    "    - pip3 install -e .\n",
    "\n",
    "- You should know the concepts of \"policy\", \"policy gradient\", \"REINFORCE\", \"REINFORCE with baseline\". If you want to start and haven't seen these yet in class, read Sutton & Barto (2018) Chapter 13 (13.1-13.4).\n",
    "\n",
    "### What you will learn\n",
    "\n",
    "- You will learn how to implement a policy gradient neural network using the REINFORCE algorithm.\n",
    "- You will learn how to implement baselines with a learned value network.\n",
    "- You will learn to be more patient :) Some fits may take your computer quite a bit of time; run them over night (or on an external server). If you have access to a GPU you can also use the [gpu support of tensorflow 2](https://www.tensorflow.org/install/gpu) to speed up simulations.\n",
    "\n",
    "### Notes \n",
    "- Reinforcement learning is noisy! Normally one should average over multiple random seeds with the same parameters to really see the impact of a change to the model, but we won't do this due to time constraints. However, you should be able to see learning over time with every approach. If you don't see any improvement, or very unstable learning, double-check your model and try adjusting the learning rate.\n",
    "\n",
    "### Evaluation criteria\n",
    "\n",
    "The evaluation is (mostly) based on the figures you submit and your answers to the questions. Provide clear and concise answers respecting the indicated maximum length. Keep your code tidy, organised and commented to allow us (and yourself) to understand what is going on. All plots must have axes labels as well as legends and a title where needed.\n",
    "\n",
    "**The submitted notebook must be run by you!** We will only do random tests of your code and not re-run the full notebook. There will be fraud detection sessions at the end of the semester.\n",
    "\n",
    "### Your names\n",
    "\n",
    "**Before you start**: please enter your full name(s) in the field below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "student1 = \"Firstname Lastname\"\n",
    "student2 = \"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%javascript\n",
    "IPython.OutputArea.prototype._should_scroll = function(lines) {\n",
    "    return false;\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "### Dependencies and constants"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "\n",
    "os.environ[\"CUDA_DEVICE_ORDER\"] = \"PCI_BUS_ID\"   # No cuda available on personal laptop\n",
    "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"\"\n",
    "\n",
    "os.putenv('SDL_VIDEODRIVER', 'fbcon') # settings for pygame display\n",
    "os.environ[\"SDL_VIDEODRIVER\"] = \"dummy\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import random\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "import keras\n",
    "import tensorflow as tf\n",
    "from keras.models import Sequential, load_model, Model\n",
    "from keras.layers import Dense\n",
    "from keras.optimizers import Adam\n",
    "from keras import backend as K\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "from matplotlib.animation import FuncAnimation\n",
    "from IPython.display import HTML, clear_output\n",
    "import matplotlib\n",
    "matplotlib.rcParams['animation.embed_limit'] = 2**128\n",
    "\n",
    "from ple import PLE\n",
    "from ple.games.pong import Pong\n",
    "import pygame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Game setup\n",
    "\n",
    "Here we load the Pong Reinforcement Learning environment from PLE. We limit each game (episode) to 7 points so that we can train faster.\n",
    "\n",
    "We also define a preprocessing function `process_state` that normalizes the state values to have maximum norms close to 1. Feature normalization has been seen to generally help neural networks learn faster, and is common practice in deep reinforcement learning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "NORMALIZE_FACTORS = np.array([48, 50, 48, 64, 48, 50, 50])\n",
    "\n",
    "def process_state(state):\n",
    "    state = np.array(list(state.values()))\n",
    "    state /= NORMALIZE_FACTORS\n",
    "    \n",
    "    return state"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setting up the game environment, refer to the PLE docs if you want to know the details\n",
    "game = Pong(MAX_SCORE=7)\n",
    "game_env = PLE(game, fps=30, display_screen=False, state_preprocessor=process_state, reward_values = {\"win\": 0, \"loss\": 0})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Utilities\n",
    "\n",
    "We include a function that lets you visualize an \"episode\" (i.e. a series of observations resulting from the actions that the agent took in the environment).\n",
    "\n",
    "As well, we will use the `Results` class (a wrapper around a python dictionary) to store, save, load and plot your results. You can save your results to disk with results.save('filename') and reload them with Results(filename='filename'). Use results.pop(experiment_name) to delete an old experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def render(episode):\n",
    "    \n",
    "    fig = plt.figure()\n",
    "    img = plt.imshow(np.transpose(episode[0],[1,0,2]))\n",
    "    plt.axis('off')\n",
    "\n",
    "    def animate(i):\n",
    "        img.set_data(np.transpose(episode[i],[1,0,2]))\n",
    "        return img,\n",
    "\n",
    "    anim = FuncAnimation(fig, animate, frames=len(episode), interval=24, blit=True)\n",
    "    html = HTML(anim.to_jshtml())\n",
    "    \n",
    "    plt.close(fig)\n",
    "    !rm None0000000.png\n",
    "    \n",
    "    return html\n",
    "\n",
    "\n",
    "class Results(dict):\n",
    "    \n",
    "    def __init__(self, *args, **kwargs):\n",
    "        if 'filename' in kwargs:\n",
    "            data = np.load(kwargs['filename'])\n",
    "            super(Results, self).__init__(data)\n",
    "        else:\n",
    "            super(Results, self).__init__(*args, **kwargs)\n",
    "        self.new_key = None\n",
    "        self.plot_keys = None\n",
    "        self.ylim = None\n",
    "        \n",
    "    def __setitem__(self, key, value):\n",
    "        super().__setitem__(key, value)\n",
    "        self.new_key = key\n",
    "\n",
    "    def plot(self, window):\n",
    "        clear_output(wait=True)\n",
    "        for key in self:\n",
    "            #Ensure latest results are plotted on top\n",
    "            if self.plot_keys is not None and key not in self.plot_keys:\n",
    "                continue\n",
    "            elif key == self.new_key:\n",
    "                continue\n",
    "            self.plot_smooth(key, window)\n",
    "        if self.new_key is not None:\n",
    "            self.plot_smooth(self.new_key, window)\n",
    "        plt.xlabel('Episode')\n",
    "        plt.ylabel('Reward')\n",
    "        plt.legend(loc='lower right')\n",
    "        if self.ylim is not None:\n",
    "            plt.ylim(self.ylim)\n",
    "        plt.show()\n",
    "        \n",
    "    def plot_smooth(self, key, window):\n",
    "        if len(self[key]) == 0:\n",
    "            plt.plot([], [], label=key)\n",
    "            return None\n",
    "        y = np.convolve(self[key], np.ones((window,))/window, mode='valid')\n",
    "        x = np.linspace(window/2, len(self[key]) - window/2, len(y))\n",
    "        plt.plot(x, y, label=key)\n",
    "        \n",
    "    def save(self, filename='results'):\n",
    "        np.savez(filename, **self)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Test runs\n",
    "\n",
    "To get an idea of how the environment works, we'll plot an episode resulting from random actions at each point in time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_fixed_episode(env, policy):\n",
    "    frames = []\n",
    "    env.reset_game()\n",
    "    done = False\n",
    "    while not done:\n",
    "        observation = env.getGameState()\n",
    "        action = policy(env, observation)\n",
    "        frames.append(env.getScreenRGB())\n",
    "        reward = env.act(action)\n",
    "        done = env.game_over()\n",
    "    return frames\n",
    "    \n",
    "def random_policy(env, observation):\n",
    "    return random.sample(env.getActionSet(), 1)[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "episode = run_fixed_episode(game_env, random_policy)\n",
    "render(episode)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also define a function to run an episode with the policies that you will be training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_fixed_episode_learned(env, policy):\n",
    "    frames = []\n",
    "    env.reset_game()\n",
    "    done = False\n",
    "    while not done:\n",
    "        observation = env.getGameState()\n",
    "        action_idx = policy.decide(observation)\n",
    "        action = env.getActionSet()[action_idx]\n",
    "        frames.append(env.getScreenRGB())\n",
    "        reward = env.act(action)\n",
    "        done = env.game_over()\n",
    "    return frames"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Experiment Loop\n",
    "\n",
    "This is the method we will call to setup an experiment. Reinforcement learning usually operates on an Observe-Decide-Act cycle, as you can see below.\n",
    "\n",
    "You don't need to add anything here; you will be working directly on the RL agent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_episodes = 5000\n",
    "\n",
    "def run_experiment(experiment_name, env, num_episodes, reward_shaping=False, \n",
    "                   policy_learning_rate=0.001, value_learning_rate = 0.001, \n",
    "                   baseline=None, fileNamePolicy=None, fileNameValue=None, verbose=False, stopping_criterion=20):\n",
    "\n",
    "    env.init()\n",
    "    discount_factor = 0.99\n",
    "    \n",
    "    #Initiate the learning agent\n",
    "    agent = RLAgent(n_obs = env.getGameStateDims()[0], policy_learning_rate = policy_learning_rate, value_learning_rate = value_learning_rate, \n",
    "                    discount=discount_factor, baseline=baseline, fileNamePolicy=fileNamePolicy, fileNameValue=fileNameValue)\n",
    "\n",
    "    rewards = []\n",
    "    all_episode_frames = []\n",
    "    \n",
    "    points_won = 0\n",
    "    games_won = 0\n",
    "    win_streak = 0\n",
    "    \n",
    "    for episode in range(1, num_episodes+1):\n",
    "    \n",
    "        #Update results plot and occasionally store an episode movie\n",
    "        episode_frames = None\n",
    "        if episode % 10 == 0:\n",
    "            results[experiment_name] = np.array(rewards)\n",
    "            results.plot(10)\n",
    "            if verbose:\n",
    "                print(\"Number of games won: \" + str(int(games_won)))\n",
    "                print(\"Number of points won: \" + str(int(points_won)))\n",
    "        if episode % 500 == 0:\n",
    "            episode_frames = []\n",
    "            \n",
    "        #Reset the environment for a new episode\n",
    "        env.reset_game()\n",
    "            \n",
    "        observation = env.getGameState()\n",
    "        \n",
    "        player_points = 0\n",
    "        opponent_points = 0\n",
    "        \n",
    "        episode_steps = 0\n",
    "        episode_reward = 0\n",
    "\n",
    "        while True:\n",
    "        \n",
    "            if episode_frames is not None:\n",
    "                episode_frames.append(env.getScreenRGB())\n",
    "\n",
    "            # 1. Decide on an action based on the observations\n",
    "            action_idx = agent.decide(observation)\n",
    "            # convert action index into commands expected by the game environment\n",
    "            action = game_env.getActionSet()[action_idx]\n",
    "\n",
    "            # 2. Take action in the environment\n",
    "            raw_reward = env.act(action)\n",
    "            next_observation = env.getGameState()\n",
    "            \n",
    "            if raw_reward > 0:\n",
    "                points_won += raw_reward\n",
    "                player_points += raw_reward\n",
    "            elif raw_reward < 0:\n",
    "                opponent_points += np.abs(raw_reward)\n",
    "            \n",
    "            episode_steps += 1\n",
    "            \n",
    "            # 3. Reward shaping            \n",
    "            if reward_shaping:\n",
    "                auxiliary_reward = reward_design(observation)\n",
    "                reward = raw_reward + auxiliary_reward\n",
    "            else:\n",
    "                reward = raw_reward\n",
    "    \n",
    "            episode_reward += reward\n",
    "\n",
    "            # 4. Store the information returned from the environment for training\n",
    "            agent.observe(observation, action_idx, reward)\n",
    "\n",
    "            # 5. When we reach a terminal state (\"done\"), use the observed episode to train the network\n",
    "            done = env.game_over() # Check if game is over\n",
    "            if done:\n",
    "                rewards.append(episode_reward)\n",
    "                agent.train()\n",
    "                \n",
    "                # Some diagnostics\n",
    "                if verbose:\n",
    "                    print(\"Game score: \" + str(int(player_points)) + \"-\" + str(int(opponent_points)) + \" over \"\n",
    "                          + str(episode_steps) + \" frames\")\n",
    "                \n",
    "                # Calculating the win streak (number of consecutive games won)\n",
    "                if player_points > opponent_points:\n",
    "                    print(\"Won a game at episode \" + str(episode) + \"!\")\n",
    "                    games_won += 1\n",
    "                    win_streak += 1\n",
    "                else:\n",
    "                    win_streak = 0\n",
    "                    \n",
    "                if episode_frames is not None:\n",
    "                    all_episode_frames.append(episode_frames)                    \n",
    "                \n",
    "                break\n",
    "\n",
    "            # Reset for next step\n",
    "            observation = next_observation\n",
    "        \n",
    "        # Stop if you won enough consecutive games\n",
    "        if win_streak == stopping_criterion:\n",
    "            break\n",
    "            \n",
    "    return all_episode_frames, agent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One could also think to introduce a \"shaping\" reward to encourage the agent to return the ball in the start of training. This could be done in many ways. Here is one function that achieves this by giving the agent a tiny reward for every frame that the ball is moving towards it's opponent's side."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def reward_design(observation):    \n",
    "    ball_vel = observation[5]\n",
    "    \n",
    "    auxiliary_reward = 0\n",
    "    if ball_vel>0:\n",
    "        auxiliary_reward = 1e-3\n",
    "    return auxiliary_reward"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Agent\n",
    "\n",
    "Here we give the outline of a python class that will represent the reinforcement learning agent (along with its decision-making network). \n",
    "\n",
    "**Throughout the course of the miniproject you will be modifying this class to add additional methods and functionality.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class RLAgent(object):\n",
    "    \n",
    "    def __init__(self, n_obs, policy_learning_rate, value_learning_rate, \n",
    "                 discount, baseline=None, fileNamePolicy=None, fileNameValue=None):\n",
    "\n",
    "        #We need the state and action dimensions to build the network\n",
    "        self.n_obs = n_obs  \n",
    "        self.n_act = 1\n",
    "        \n",
    "        self.gamma = discount\n",
    "        \n",
    "        self.use_baseline = baseline is not None\n",
    "        self.use_adaptive_baseline = baseline == 'adaptive'\n",
    "\n",
    "        #Fill in the rest of the agent parameters to use in the methods below\n",
    "        \n",
    "        # TODO\n",
    "\n",
    "        #These lists stores the observations for this episode\n",
    "        self.episode_observations, self.episode_actions, self.episode_rewards = [], [], []\n",
    "\n",
    "        #Build the keras network\n",
    "        self.fileNamePolicy = fileNamePolicy\n",
    "        self.fileNameValue = fileNameValue\n",
    "        self._build_network()\n",
    "\n",
    "        \n",
    "    def observe(self, state, action, reward):\n",
    "        \"\"\" This function takes the observations the agent received from the environment and stores them\n",
    "            in the lists above. \"\"\"\n",
    "        # TODO\n",
    "            \n",
    "    def _get_returns(self):\n",
    "        \"\"\" This function should process self.episode_rewards and return the discounted episode returns\n",
    "            at each step in the episode, then optionally apply a baseline. Hint: work backwards.\"\"\"\n",
    "        # TODO\n",
    "        \n",
    "        return returns\n",
    "    \n",
    "    def _build_network(self):\n",
    "        \"\"\" This function should build the network that can then be called by decide and train. \n",
    "            The network takes observations as inputs and has a policy distribution as output.\"\"\"\n",
    "        # TODO\n",
    "        \n",
    "    def decide(self, state):\n",
    "        \"\"\" This function feeds the observed state to the network, which returns a distribution\n",
    "            over possible actions. Sample an action from the distribution and return it.\"\"\"\n",
    "        # TODO\n",
    "\n",
    "    def train(self):\n",
    "        \"\"\" When this function is called, the accumulated observations, actions and discounted rewards from the\n",
    "            current episode should be fed into the network and used for training. Use the _get_returns function \n",
    "            to first turn the episode rewards into discounted returns. \"\"\"\n",
    "        # TODO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 1: REINFORCE with simple baseline\n",
    "\n",
    "Implement the REINFORCE Policy Gradient algorithm using a deep neural network as a function approximator.\n",
    "\n",
    "1. Implement the `observe` method of the RLAgent above.\n",
    "2. Implement the `_build_network` method. Your network should take the 7-dimensional state space as input and output a bernoulli distribution over the 2 discrete actions. It should have 3 hidden layers with about 32 units each with ReLU activations. Use the REINFORCE loss function. HINT: Keras has a built-in `binary_crossentropy` loss, and a `sample_weight` argument in fit/train_on_batch. Consider how these could be used together to implement the REINFORCE loss function.\n",
    "3. Implement the `decide`, `train` and `_get_returns` methods using the inputs and outputs of your network. In `_get_returns`, implement a baseline based on a moving average of the returns; it should only be in effect when the agent is constructed with the `baseline` keyword. In `train`, use `train_on_batch` to form one minibatch from all the experiences in an episode. \n",
    "4. Try a few learning rates and pick the best one (the default for Adam is a good place to start). Run the functions below and include the resulting plots, with and without the baseline, for your chosen learning rate.\n",
    "5. Answer the question in the markdown cell below in max. 1-2 sentence(s) each.\n",
    "\n",
    "WARNING: Running any experiments with the same names (first argument in run_experiment) will cause your results to be overwritten. \n",
    "\n",
    "**Mark breakdown: 8 points total**\n",
    "- 4 points for implementing and plotting basic REINFORCE with reasonable performance (i.e. a positive score).\n",
    "- 2 points for implementing and plotting the simple baseline with reasonable performance.\n",
    "- 2 points for answering the questions below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ADDITIONAL NOTES: \n",
    "1. Positive rewards are very sparse in this task. If you're finding it hard to tell if your agent is learning at all, you have the option of giving the agent the additional \"shaping\" reward mentioned above by setting the `reward_shaping` argument to True when calling `run_experiment`. After your implementation is working, you can take off these \"training wheels\", and train only with the game rewards.\n",
    "2. Consider normalizing the calculated returns (eg., dividing by the standard deviation) before using them for training your network. This is a trick which is known to usually boost empirical performance by keeping the gradients in a smaller range.\n",
    "3. You can set the argument `verbose` to `True` (when calling `run_experiment`) to get some useful diagnostics of your agent's current performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = Results()\n",
    "policy_learning_rate = 1e-3\n",
    "_, basic_reinforce_policy = run_experiment(\"REINFORCE (with reward shaping)\", game_env, num_episodes, reward_shaping=True,\n",
    "                                           policy_learning_rate=policy_learning_rate, stopping_criterion=30)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = Results()\n",
    "policy_learning_rate = 1e-3\n",
    "_, basic_reinforce_policy = run_experiment(\"REINFORCE\", game_env, num_episodes, reward_shaping=False,\n",
    "                                           policy_learning_rate=policy_learning_rate, stopping_criterion=30)\n",
    "_, baseline_reinforce_policy = run_experiment(\"REINFORCE (with baseline)\", game_env, num_episodes, reward_shaping=False,\n",
    "                                              policy_learning_rate=policy_learning_rate, stopping_criterion=30,\n",
    "                                              baseline='simple')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You can save intermediate results to avoid rerunning experiments\n",
    "results.save('results_exercise_1')\n",
    "\n",
    "# You can also save your learned networks (you will have to adapt this to your naming of the network modules)\n",
    "basic_reinforce_policy.policy_net.save('exercise_1_basic_reinforce_policy_net.h5')\n",
    "baseline_reinforce_policy.policy_net.save('exercise_1_baseline_reinforce_policy_net.h5')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "episode = run_fixed_episode_learned(game_env, baseline_reinforce_policy)\n",
    "render(episode)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 1**: Why is it better to sample an action from the bernoulli distribution rather than just pick the action with highest probability? \n",
    "\n",
    "**Answer**: "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 2**: In the train method above we throw away the data from an episode after we use it to train the network (make sure that you do that). Why is it not a good idea to keep the old episodes in our data and train the policy network on both old and new data? (Note: Reusing data can still be possible but requires modifications to the REINFORCE algorithm we are using).\n",
    "\n",
    "**Answer**:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 3**: In the reward_design function above, we give the agent a positive reward of 0.001 for every frame that the ball is moving in the favorable direction. One may think that such a manipulation will change the optimal policy in a way that the agent tries to maximize the length of each episode instead of necessarily winning the most points. Why is this not case? Can you give a general criteria for the maximum amount of this positive reward such that the optimal policy does not change?\n",
    "\n",
    "**Answer**:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 4**: Suppose a third action was available (eg. the `DO NOTHING` action that is actually available in this environment but we have excluded for this miniproject). What modifications would you need to make to your implementation to handle this possibility?\n",
    "\n",
    "**Answer**:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 2: Adaptive baseline\n",
    "\n",
    "Add a second neural network to your model that learns an observation-dependent adaptive baseline and subtracts it from your discounted returns.\n",
    "\n",
    "1. Modify the `_build_network` function of RLAgent to create a second \"value network\" when `adaptive` is passed for the baseline argument. The value network should have the same or similar structure as the policy network, without the sigmoid at the output.\n",
    "3. In addition to training your policy network, train the value network on the Mean-Squared Error compared to the adjusted returns.\n",
    "4. Train your policy network on R - b(s), i.e. the returns minus the adaptive baseline (the output of the value network). Your implementation should allow for a different learning rate for the value and policy network.\n",
    "5. Try a few learning rates and plot all your best results together (without baseline, with simple baseline, with adaptive baseline). You may or may not be able to improve on the simple baseline! Return the trained model to use it in the next exercise.\n",
    "\n",
    "TECHNICAL NOTE: Some textbooks may refer to this approach as \"Actor-Critic\", where the policy network is the \"Actor\" and the value network is the \"Critic\". Sutton and Barto (2018) suggest that Actor-Critic only applies when the discounted returns are bootstrapped from the value network output, as you saw in class. This can introduce instability in learning that needs to be addressed with more advanced techniques, so we won't use it for this miniproject. You can read more about state-of-the-art Actor-Critic approaches here: https://arxiv.org/pdf/1602.01783.pdf\n",
    "\n",
    "**Mark breakdown: 3 points total**\n",
    "- 3 points for implementing and plotting the adaptive baseline with the other two conditions, with reasonable performance (i.e. at least similar to the performance in Exercise 1)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "policy_learning_rate = 1e-3\n",
    "value_learning_rate = 1e-3\n",
    "\n",
    "_, adaptive_policy = run_experiment(\"REINFORCE (adaptive baseline)\", game_env, num_episodes, reward_shaping=False,\n",
    "                                    policy_learning_rate=policy_learning_rate, \n",
    "                                    value_learning_rate=value_learning_rate, stopping_criterion=30,\n",
    "                                    baseline='adaptive', verbose=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results.save('results_exercise_2')\n",
    "adaptive_policy.policy_net.save('exercise_2_policy_net.h5')\n",
    "adaptive_policy.value_net.save('exercise_2_value_net.h5')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "episode = run_fixed_episode_learned(game_env, adaptive_policy)\n",
    "render(episode)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 3: Value Function Visualization\n",
    "\n",
    "Ideally, our value network should have learned to predict the relative values across the input space. We can test this by plotting the value prediction for different observations.\n",
    "\n",
    "1. Write a function to plot the value network prediction across [x,y] space (all possible ball positions) for given (constant) values of the other state variables. All the position state variables always lie in [0,1], with the agent on the left side of the screen (i.e, at X=0 with Y=0.5 being its middle position). (`plt.imshow`, `plt.title` and `plt.colorbar` can be useful)\n",
    "2. Plot (with titles specifying the state variable combinations) the values for 5-6 combinations of the other 5 state variables. The ball and player velocities are generally within [-1,1] but could lie outside this range. Use the same color bar limits across the graphs so that they can be compared easily. \n",
    "3. Answer the question in the markdown cell below in max. 2-3 sentence(s) each.\n",
    "\n",
    "**Mark breakdown: 3 points total**\n",
    "- 2 points for the plots of the value function.\n",
    "- 1 point for answering the question below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 1**: Does your value map make sense for cases where the paddle is moving towards where the ball is going to be? How about when it is moving away from the ball's expected position? Why or why not?\n",
    "\n",
    "**Answer**:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 2**: It is likely that your value network learns a reasonable value map for some regimes of the state-space while for other combinations of state values, the value map makes little intuitive sense. What could be the reason why some regions are better learned by your network?\n",
    "\n",
    "**Answer**:"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "name": "miniproject3_2019.ipynb",
   "provenance": [],
   "version": "0.3.2"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}