CS-456: How to deal with the text dataset in RNN project

Dear TAs,

We noticed that in the text provided as dataset for project 2 contains only couple of sentences that makes sense i.e. there is a question, a coherent answer and then an abrupt transition to another topic.

Considering this, approaching the part RNN vs LSTM vs GRU we were wondering if in building the training set X and the labels T (in the notation used in the notebook) for the recurrent networks, one should considered as X all the sequences falling in a sliding window over all the text (neglecting the change of topic) and the successive word as Y or alternatively applying the same procedure on a successive pair of question and answer.

Best,

Luca

Re: How to deal with the text dataset in RNN project

by Florian François Colombo - Tuesday, 14 May 2019, 12:37 PM

Hi,

In the first part you are asked to build a generative model of sentences. Only in the last (ChatBot) part you are asked to reprocess your data so as to have question-answer pairs.

Consequently, the provided processing code for the first part only extract all sentences from the dataset (without caring about their order) and the model you build should be trained to model the joint probability of the ordered sequence of words in each of these sentence (independently).

Later, as explain in the instruction, you will need to pair sentences together as they appear in the conversation dataset.

As a simplified example, imagine a conversation with 5 sentences S1,S2,S3,S4,S5. In the fist part your data would be [S1, S2, S3 S4, S5] while in the second part your data are pairs of sentences [S1-S2, S2-S3, S3-S4, S4-S5].

Hoping it helps,

Florian

Re: How to deal with the text dataset in RNN project

by Joachim Jacques Koerfer - Tuesday, 14 May 2019, 1:54 PM

Hi,

I understood the part of the Chatbot and it make sense. For the generating sentences part, if I understood correctly on each sentence we should have a sliding window (which represent X) and the next word next to the sliding window (represent the next word in T). For example :

"this is a meaningful sentence , with lots and lots of words ."with a slide window of size 5 would result in :


X	Next word T
this is a meaningful sentence	,
is a meaningful sentence ,	with
a meaningful sentence , with	lots
meaningful sentence , with lots	and
sentence , with lots and	lots
, with lots and lots	of
with lots and lots of	words
lots and lots of words	.

It would make sense to have this kind of structure to try to find the transition probability of the next word W[n+1] with the hidden state H[n]. But in the provided code, the input should have a size of of maxlen - 1, which is the whole sentence minus the END part of the tokenized sentence. For me it doesn't make much sense trying to predict only the end of the sentence when we are looking for the transition probabilities. Is there something wrong in my reasonning ?

Best,

Joachim Koerfer

Re: How to deal with the text dataset in RNN project

by Florian François Colombo - Wednesday, 15 May 2019, 12:18 PM

Your question might find its answer there https://moodlearchive.epfl.ch/2018-2019/mod/forum/discuss.php?d=17853

The implementation suggested does not involve a sliding window but a prediction (of the next word token) at each timestep.

X: word[0], ..., word[n], ..., word[maxlen-1]
T: word[1], ..., word[n+1], ..., word[maxlen]

Best,
Florian