{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Miniproject 2: Chatbot\n",
    "\n",
    "## Introduction\n",
    "\n",
    "### Description\n",
    "\n",
    "Developing a model employing ANN on real-world data requires going through several major steps, each of which with \n",
    "important design choices that directly impact the final results. \n",
    "In this project, we guide you through these choices starting from a large database of \n",
    "[conversations](http://parl.ai/downloads/personachat/personachat.tgz) to a functional chatbot. \n",
    "\n",
    "### Prerequisites\n",
    "\n",
    "- You should have a running installation of [tensorflow](https://www.tensorflow.org/install/) and [keras](https://keras.io/).\n",
    "- You should know the concepts \"recurrent neural networks\", \"LSTM\", \"training and validation data\", \"overfitting\" and \"early stopping\".\n",
    "\n",
    "### What you will learn\n",
    "\n",
    "- You will be guided through a data processing procedure and understand the importance of design choices in ANN modeling\n",
    "- You will learn how to define recurrent neural networks in keras and fit them to data.\n",
    "- You will be guided through a prototyping procedure for the application of deep learning to a specific domain.\n",
    "- You will get in contact with concepts discussed in the lecture, like \"overfitting\", \"LSTM network\", and \"Generative model\".\n",
    "- You will learn to be more patient :) Some fits may take your computer quite a bit of time; run them over night and make sure you save (and load) your data and models.\n",
    "\n",
    "### Evaluation criteria\n",
    "\n",
    "The evaluation is (mostly) based on the figures you submit and your answer sentences. \n",
    "We will only do random tests of your code and not re-run the full notebook. \n",
    "Please ensure that your notebook is fully executed before handing it in. \n",
    "\n",
    "### Submission \n",
    "\n",
    "You should submit your notebook through the Moodle page submission tool. You should work in teams of two people and each member should submit the same notebook to Moodle.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Functions and imports\n",
    "\n",
    "For your convenience we import some libraries and provide some functions below. Fill in your names, sciper numbers and run the following cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "names = {'student_1': \"John Doe\",\n",
    "        'student_2': \"Ann Onymous\"}\n",
    "\n",
    "sciper = {'student_1': 999999, \n",
    "          'student_2': 888888}\n",
    "\n",
    "seed = sciper['student_1']+sciper['student_2']\n",
    "\n",
    "%matplotlib inline\n",
    "\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import os, sys\n",
    "import copy\n",
    "\n",
    "plt.rcParams['font.size'] = 28\n",
    "plt.style.use('ggplot')\n",
    "plt.rcParams[\"axes.grid\"] = False\n",
    "c = plt.rcParams['axes.prop_cycle'].by_key()['color']\n",
    "\n",
    "import keras\n",
    "from keras.models import Model, load_model\n",
    "from keras.layers import Input, Masking, TimeDistributed, Dense, Concatenate, Dropout, LSTM, GRU, SimpleRNN, Bidirectional, Embedding, BatchNormalization\n",
    "from keras.optimizers import Adam\n",
    "from keras.utils import np_utils\n",
    "from keras.preprocessing.sequence import pad_sequences\n",
    "from keras.callbacks import ModelCheckpoint, EarlyStopping\n",
    "\n",
    "\n",
    "def getRawDataFromFile(datapath=\"data/personachat/\", file=\"train_both_revised.txt\"):\n",
    "    \n",
    "    f = open(datapath+file)\n",
    "\n",
    "    conversations = []\n",
    "    current_conversation = []\n",
    "    \n",
    "    for l, line in enumerate(f):\n",
    "        #print(l, line)\n",
    "        if \"persona:\" in line:\n",
    "            if len(current_conversation) > 1:\n",
    "                conversations.append(current_conversation)\n",
    "            current_conversation = [] \n",
    "            continue\n",
    "\n",
    "        #remove numberings\n",
    "        processed_line = line.split(' ')\n",
    "        processed_line = \" \".join(processed_line[1:])\n",
    "        line = processed_line\n",
    "        #print(line)\n",
    "\n",
    "        conv = line.split('\\t')    \n",
    "        q = conv[0]\n",
    "        a = conv[1]\n",
    "        current_conversation.append(q)\n",
    "        current_conversation.append(a)\n",
    "    \n",
    "    return conversations "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Data visualization and preprocessing\n",
    "\n",
    "Here we will process and visualize the data.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parse raw data \n",
    "\n",
    "Download the dataset on http://parl.ai/downloads/personachat/personachat.tgz. Unpack it and add it to your project folder. Read and run the getRawDataFromFile function (if needed, modify the default path). It extracts the conversations.\n",
    "\n",
    "**Output** Display two randomly selected conversations. [1 pt]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extract word tokens\n",
    "\n",
    "Let's start looking at our data. \n",
    "\n",
    "**Code** Compute the set of unique words (dictionary) in all sentences along with the number of occurences of each of these words. HINT: each word is separated by a space character, use the python string.split(' ') function to separate words. Consider punctuations as 'words'. [1 pt]\n",
    "\n",
    "**Figure** In a bar plot, show the first 75 most frequent words (x-axis) and their number of occurences (y-axis). [1 pt]\n",
    "\n",
    "**Figure** In another bar plot, show the 75 least frequent words (x-axis) and their number of occurences (y-axis). [1 pt] \n",
    "\n",
    "**Figure** In a log-log scale, plot the sorted word index (x-axis) vs their respective count (y-axis). [1 pt]\n",
    "\n",
    "**Question** Relate the sorted word count distribution with Zipf's law.\n",
    "Argue using the log-log plot. [1 pt]\n",
    "\n",
    "**Answer**\n",
    "\n",
    "**Question** How many words appear only once in the entire dataset? [1 pt]\n",
    "\n",
    "**Answer**\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "scrolled": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Filtering\n",
    "\n",
    "We suggest to filter your data by removing sentences containing rare words. \n",
    "\n",
    "\n",
    "**Code** To achieve that, you should create a new dataset where sentences containing rare words (words that occur less than N times in the dataset) are removed. Keep at least 50'000 sentences (depending on your computing power, you can keep more). \n",
    "HINT: Start by updating the dictionary accordingly and then remove any sentence that contains at least a single word that is not in the dictionary of words. [2 pts]\n",
    "\n",
    "**Question**: How much did you reduce the number of unique words with your rare event suppression procedure? [1 pt]\n",
    "    \n",
    "**Answer**: \n",
    "\n",
    "**Question**: How many sentences are in your filtered and original dataset? [1 pt]\n",
    "\n",
    "**Answer**:\n",
    "\n",
    "**Question**: What is the impact on learning and generalization of removing sentences with rare words from your dataset? [2 pt]\n",
    "\n",
    "**Answer**: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tokenization and padding\n",
    "\n",
    "Now you will transform our filtered data into a format that is understandable by an ANN. To achieve that, you should transform words to integers, where single integers in the range [1,size of the dictionary] are mapped to single words in your dictionary. This process is commonly named 'tokenization'. In addition, we will keep the value 0 to a specific artificial word 'PADD' that will be used to account for the variable length of sentences and add to each sentence a 'START' and an 'END' word. \n",
    "\n",
    "**Code** Start by adding the three artificial words to your dictionary (list of possible tokens) and then translate every sentences to a list of integers. \n",
    "HINT: use the Python List index() method. [2 pts]\n",
    "\n",
    "**Figure** Use the violinplot to show the density of tokenized sentences length. [1pt]\n",
    "\n",
    "**Code** From this figure, select a maximum number (=maxlen) of tokens for which most of the sentences have less. Padd (and eventually truncate) all sentences with the 'PADD' token (value 0 in the integer representation) until all tokenized sentences have the same length (maxlen).\n",
    "HINT: use the pad_sequences function from keras.preprocessing.sequence [2 pts]\n",
    "\n",
    "**Code** Check that you can recover the original sentence. Randomly select two sentences from your integer and padded representation and translate them back using your dictionary. [1 pt]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Saving\n",
    "\n",
    "Now is a good time to save your data (end of processing). An example code using the pickle library is shown below.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pickle\n",
    "\n",
    "#save\n",
    "with open(\"data.pkl\", \"wb\") as file:\n",
    "    pickle.dump([filtered_sentences, dictionary, tokens], file)\n",
    "    \n",
    "#load\n",
    "with open(\"data.pkl\", \"rb\") as file:\n",
    "    [filtered_sentences, dictionary, tokens] = pickle.load(file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building and training generative models of language"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### RNN vs LSTM vs GRU \n",
    "\n",
    "Build, train and compare generative models of language based on RNNs with different recurrent units (SimpleRNN, GRU and LSTM). \n",
    "\n",
    "The target of the network will be to approximate the word transition probabilities Pr(word[n+1]|H[n]) with H[n]=f(word[:n]) being the hidden state of the network.  \n",
    "\n",
    "**code** You should complete the proposed model (using the Keras API rather than the Sequential model for more flexibility). Be sure to understand each line. The embedding layer allows to transform an integer to a dense vector. That would be our input to the recurrent network - each sentence is mapped to a sequence of vectors, each representing a single word. You can then design your own readout(s) and output layers. By default, use the proposed meta parameters. You can adapt them if you have more or less computing power (32 epochs should take around 30 minutes). [2 pts]\n",
    "\n",
    "**Question** How will your networks deal with the artificial word 'PADD' that you added at the end of each sentences  [2 pts]\n",
    "\n",
    "**Answer**\n",
    "\n",
    "**code** Then train three different networks with the same architecture but using different recurrent units (simpleRNN, GRU and LSTM). Save the learning history (training/validation loss and accuracy for each epoch) as well as the models. [1 pt]\n",
    "\n",
    "**Question** How can you use this network to approximate the word transition probabilities? What will be the inputs and targets of the network at each batch? Give the input/output tensor dimensions. [2 pts]\n",
    "\n",
    "**Answer**\n",
    "\n",
    "**Figure** Show the learning curves (training and validation loss) for the different recurrent units. [1 pt]\n",
    "\n",
    "**Figure** Show the learning curves (training and validation accuracy) for the different recurrent units. [1 pt]\n",
    "\n",
    "**Question:** Which recurrent unit yields the best validation accuracy? Which is the fastest learner? [1 pt]\n",
    "\n",
    "**Answer**: \n",
    "\n",
    "**Question:** Do you observe an overfitting effect? Where and for which case? Give a possible explanation. [1 pt] \n",
    "\n",
    "**Answer**: \n",
    "\n",
    "**Question:** Suggest one option modifying your dataset to decrease overfitting. [1 pt]\n",
    "\n",
    "**Answer**: \n",
    "\n",
    "**Question:** Suggest one possible option modifying your network to decrease overfitting. [1 pt]\n",
    "\n",
    "**Answer**: \n",
    "\n",
    "**Question:** Suggest one possible option modifying the training modalities to counter overfitting. [1 pt]   \n",
    "\n",
    "**Answer**: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Meta-parameters\n",
    "embedding_size = 128\n",
    "hidden_size = 64\n",
    "dropout = 0.\n",
    "recurrent_dropout = 0.\n",
    "\n",
    "batch_size = 64\n",
    "epochs = 32\n",
    "validation_split = 0.2\n",
    "\n",
    "dataset_cut = -1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "I = {}\n",
    "E = {}\n",
    "H = {}\n",
    "R = {}\n",
    "Y = {}\n",
    "models = {}\n",
    "logs = {}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "#Model suggestion\n",
    "\n",
    "I['RNN'] = Input(shape=(maxlen-1,), name=\"input\")\n",
    "E['RNN'] = Embedding(len(tokens), embedding_size, mask_zero=True, name=\"embedding\")\n",
    "\n",
    "#your network here\n",
    "H['RNN'] = #... Recurrent layer(s)\n",
    "\n",
    "R['RNN'] = #... Readout\n",
    "Y['RNN'] = #... Output\n",
    "\n",
    "models['RNN'] = Model(inputs = [I['RNN']], outputs = [Y['RNN']])\n",
    "models['RNN'].compile(\n",
    "    loss='categorical_crossentropy', \n",
    "    optimizer=Adam(),\n",
    "    metrics=['acc'])\n",
    "models['RNN'].summary()\n",
    "\n",
    "print(X[:,:-1].shape, T[:,1:].shape)\n",
    "logs['RNN'] = models['RNN'].fit({'input': X[:dataset_cut,:-1]}, {'output': T[:dataset_cut,1:]}, \n",
    "                                    epochs=epochs, \n",
    "                                    validation_split=validation_split, \n",
    "                                    batch_size=batch_size).history\n",
    "\n",
    "#save\n",
    "with open(\"RNNmodel_\"+str(embedding_size)+'_'+str(hidden_size)+\"_log.pkl\", \"wb\") as file:\n",
    "    pickle.dump(logs['RNN'], file)\n",
    "models['RNN'].save(\"RNNmodel_\"+str(embedding_size)+'_'+str(hidden_size))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#load\n",
    "with open(\"RNNmodel_\"+str(embedding_size)+'_'+str(hidden_size)+\"_log.pkl\", \"rb\") as file:\n",
    "    RNNmodel_log = pickle.load(file)\n",
    "RNNmodel = load_model(\"RNNmodel_\"+str(embedding_size)+'_'+str(hidden_size))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Optimal network size\n",
    "\n",
    "Compare the learning curves for three networks with 64 (previous exercise), 128 and 256 GRUs (single layer) and one with two hidden layers of 64 GRUs. \n",
    "\n",
    "**Code** Build and train the networks. Apply EarlyStopping (monitor='val_acc', min_delta=0.001, patience=2). Use transfer learning, do not train from scratch your embedding layer, rather re-use the embedding layer from your best performing network in the last exercise. [4 pts]\n",
    "\n",
    "**Figure** Show the learning curves (training and validation loss) for the four models. [1 pt]\n",
    "\n",
    "**Figure** Show the learning curves (training and validation accuracy) for the four models. [1 pt]\n",
    "\n",
    "**Question** List and briefly explain the differences in the learning curves for the different models? [2 pts]\n",
    "\n",
    "**Answer**\n",
    "\n",
    "**Question** What effect had EarlyStopping? Give one advantage and one drawback. [2 pts]\n",
    "\n",
    "**Answer**\n",
    "\n",
    "**Question** What is your best model? Why? [1 pt]\n",
    "\n",
    "**Answer**\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generate sentences\n",
    "\n",
    "Now you will generate new sentences from your best performing model.\n",
    "\n",
    "**Code** To achieve that, use the provided sample function below to generate new sentences from your model. You should start by constructing a sentence that starts with the 'START' artificial word and all other words being the 'PADD' artificial word. Then sample the first word from the corresponding probabilities given by your model. Add this word to the sentence and continue like this until you sample the 'END' artificial word or the maximum sentence length. [2 pts]\n",
    "\n",
    "**Code** Generate 10 sentences for different sampling temperature in [0., 0.25, 0.5, 0.75, 1., 1.5., 2.]. [1 pt]\n",
    "\n",
    "**7 Figures** For each temperature, use matplotlib imshow to plot the probablities of every word in one generated sentence (and only these words) at each time step. y-axis should be the words that are present in the sentence. x-axis the timesteps and the imshow value the probabilities given by the model for all words in the sentence at each timestep. Use the a colormap where 0 is white, e.g. cmap='Greys'. [2 pts]\n",
    "\n",
    "**Code** Finally, seed your model with two different beginnings of max 4 words and let it generate 10 possible continuations (use sampling temperature of 1.). [2 pts]\n",
    "\n",
    "**Question** What is the effect of sampling temperature on the generated sentences? [1 pt]\n",
    "\n",
    "**Answer**\n",
    "\n",
    "**Question** In terms of sampling a probability distribution, what does a sampling temperature of 0 corresponds to? [1 pt] \n",
    "\n",
    "**Answer**\n",
    "\n",
    "**Question** In terms of sampling a probability distribution, what does a sampling temperature of 1. corresponds to? [1 pt] \n",
    "\n",
    "**Answer**\n",
    "\n",
    "**Question** In terms of sampling a probability distribution, what does a very high sampling temperature corresponds to? [1 pt]\n",
    "\n",
    "**Answer**\n",
    "\n",
    "**Question** Based on the plotted word probabilities, explain how a sentence is generated. [2 pts]\n",
    "\n",
    "**Answer**\n",
    "\n",
    "**Question** Do you observe timesteps with more than one word with non-zero probability? How do these probable words relate in terms of language? [1 pt]\n",
    "\n",
    "**Answer**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def sample(preds, temperature=1.):\n",
    "    # helper function to sample an index from a probability array\n",
    "    if temperature == 0.:\n",
    "        return np.argmax(preds)\n",
    "    preds = np.asarray(preds).astype('float64')\n",
    "    preds = np.log(preds) / temperature\n",
    "    exp_preds = np.exp(preds)\n",
    "    preds = exp_preds / np.sum(exp_preds)\n",
    "    probas = np.random.multinomial(1, preds, 1)\n",
    "    return np.argmax(probas)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Word embedding visualization\n",
    "\n",
    "Here, you are asked to visualize the embedding layer. \n",
    "\n",
    "**Code** To do that, project in 2D the embedding vectors for different words. Use t-SNE, a projection that conserve the neighborhood relationships between vectors. HINT: Build a Keras model that takes as input a list of words and outputs a list of vector embeddings as learned by your best performing model. Use t-SNE dimensionality reduction (from sklearn.manifold import TSNE). [2 pts]\n",
    "\n",
    "**Figure** Plot the projection of the first 200 most frequent words in a 2D plot. On the plot, write the words. [2 pt] \n",
    "\n",
    "**Question** Do you observe clusters of words with similar meaning or role in language? Report three of them here. [1 pt]\n",
    "\n",
    "**Answer**\n",
    "\n",
    "**Question** Why is having similar vector representation for similar words a good approach for such models? Explain using the example clusters from before and argue in terms of prediction accuracy and/or generalization. [2 pts]\n",
    "\n",
    "**Answer**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chatbot\n",
    "\n",
    "Finally, you will construct a model with which you can chat. The network will take as input a sentence and output a response.\n",
    "\n",
    "**Code** For that, you should go back to your original data and construct a new dataset containing pairs of sentences, where each pair is a sentence and its answer. Be careful to not include any pair of sentences that contains words not present in your filtered dictionary. [2 pts]\n",
    "\n",
    "**Code** You should then tokenize, padd, truncate each sentence. Only the answers need the 'START' and 'END' artificial words. [1 pt]\n",
    "\n",
    "We provide you with a possible model, you are welcome to change it. This model uses an LSTM layer to encode the first sentence (the context). The final state of this LSTM layer is transfered to initialize the state of a decoder LSTM layer from which the answer sentence will be generated. \n",
    "\n",
    "**Code** Train your chatbot model on your dataset. [1 pt]\n",
    "\n",
    "**Code** Adapt your sentence generation code from before so that you can generate an answer given a context sentence from your model. [2 pts] \n",
    "\n",
    "**Code** After training, randomly select 10 context-answers pairs from your data and show both the real answer (the one from the data) and the generated one for two different sampling temperatures (e.g. 0.5 and 1.0). [2 pts]\n",
    "\n",
    "**Question** How similar are the generated answers and the real ones? Does your model provide probable answers (given the dataset)? Report here one good and one bad example. [2 pts]\n",
    "\n",
    "**Answer**\n",
    "\n",
    "**Question** Which sampling temperature gives better answers? why? [2 pts]\n",
    "\n",
    "**Answer**\n",
    "\n",
    "**Question** Would it be good if your model was able to reproduce exactly each real answer? Why? [1 pt]\n",
    "\n",
    "**Answer**\n",
    "\n",
    "**Code** Entertain yourself with your model. Write some code to chat with your bot, let it discuss with itself, ... be creative! [2 **bonus** pts]\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "context = Input(shape=(maxlen-2,), name=\"input_context\")\n",
    "shared_embedding = E['GRU']\n",
    "context_embedding = shared_embedding(context)\n",
    "\n",
    "encoder_y, encoder_h, encoder_c = LSTM(hidden_size, \n",
    "            return_sequences=False,\n",
    "            return_state=True,\n",
    "            stateful=False,\n",
    "            dropout=dropout,\n",
    "            recurrent_dropout=recurrent_dropout,\n",
    "            go_backwards=True,\n",
    "            name=\"encoder\")(context_embedding)\n",
    "\n",
    "answer = Input(shape=(maxlen-1,), name=\"input_answer\")\n",
    "answer_embedding = shared_embedding(answer)\n",
    "\n",
    "decoder_input = answer_embedding\n",
    "decoder = LSTM(hidden_size, \n",
    "            return_sequences=True,\n",
    "            stateful=False,\n",
    "            dropout=dropout,\n",
    "            recurrent_dropout=recurrent_dropout,\n",
    "            name=\"decoder\")(answer_embedding, initial_state=[encoder_h, encoder_c])\n",
    "# decoder2 = LSTM(hidden_size, \n",
    "#             return_sequences=True,\n",
    "#             stateful=False,\n",
    "#             dropout=dropout,\n",
    "#             recurrent_dropout=recurrent_dropout,\n",
    "#             name=\"decoder2\")(decoder)\n",
    "\n",
    "R = TimeDistributed(Dense(embedding_size, activation='relu'), name='readout')(decoder)\n",
    "Y = TimeDistributed(Dense(len(tokens), activation='softmax'), name='output')(R)\n",
    "\n",
    "Chatbot = Model(inputs = [context, answer], outputs = [Y])\n",
    "Chatbot.compile(\n",
    "    loss='categorical_crossentropy', \n",
    "    optimizer=Adam(),\n",
    "    metrics=['acc'])\n",
    "Chatbot.summary()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "cs456env",
   "language": "python",
   "name": "cs456env"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}