Constants And Required Functions

Data Preprocessing

Required Directories Creation

Dataset Download and Extraction

Listing Directories

Data Preprocesssing or Transformation

Exploratory Data Analysis

Distribution of Question and Answer Length

Distribution of Number of Words in Question and Answer

Carry out some additional cleaning and formating for applying seq2seq

preprocessor

Carry out pre-processing using preprocessor library :

Detecting whether the text data are proper English Language or not

spacy-langdetect

Save the processed Data For Future Use

Dataset and DataLoader Creation

Seq2Seq algorithm trains a denoising auto-encoder over sequences. It takes as input a sequence like $\boldsymbol{Q} = \boldsymbol{q}_1, \boldsymbol{q}_2, \ldots, \boldsymbol{q}_T$, and generates a new sequence $\boldsymbol{A} = \boldsymbol{a}_1, \boldsymbol{a}_2, \ldots, \boldsymbol{a}_{T'}$ as output. These sequences might not be same i.e. (q𝑗≠a𝑗) and might be of different lengths so, $T \neq T'$. This training approach is called a denoising auto-encoder because the question sequence $Q$ gets mapped to some related answer sequence(answer) $A$, as if $Q$ was a "noisy" version of $A$. There is no one-to-one relationship between question and answer. This appraoch uses sequences which have a temporal component to it. As in, one word follows another to give proper meanig to the sentence. As a result they usually involve an RNN which are good with learning such sequential data. The EncoderRNN takes in $Q$, and produces a final hidden state activation $\boldsymbol{h}_T$, which the AttentionDecoderRNN takes as input to produce a new sequence $A$.

In this project, I have created question and answer pairs and a vocabulary for the seq2seq model from the dataset chosen.

Features of vocabulary :

collate takes the inputs that are of different lengths and creates one larger batch out of them. It will return a set of nested tuples, $((\boldsymbol{Q}, \boldsymbol{A}), \boldsymbol{A})$. This is required because seq2seq_model requires both $\boldsymbol{Q}$ and $\boldsymbol{A}$ during training. The train_seq2seq function expects tuples of $(question, answer)$

$$(\underbrace{(\boldsymbol{Q}, \boldsymbol{A})}_{\text{input}}, \underbrace{\boldsymbol{A}}_{\text{output}})$$

Question & Answer ChatBot QABot Model Creation

Attention

$$\boldsymbol{\alpha} = sm(\text{score}(F(\boldsymbol{x}_1)), \text{score}(F(\boldsymbol{x}_2)), \ldots, \text{score}(F(\boldsymbol{x}_T)))$$$$\boldsymbol{z} = \sum_{i=1}^T \alpha_i \cdot \underbrace{F(\boldsymbol{x}_i)}_{\boldsymbol{h_i}} $$

Dot Score

$$ \text{score}(\boldsymbol{h}_{t}, \bar{\boldsymbol{h}}) = \frac{\boldsymbol{h}_{t}^{\top} \bar{\boldsymbol{h}}}{\sqrt{H}} $$

My implementation of Attention Mechanism

Encoder

Decoder

Seq2Seq

The heart of chatbot is a sequence-to-sequence (seq2seq) model. The goal of a seq2seq model is to take a variable-length question sequence as an input, and return a variable-length answer sequence as an output.

Components :

Create Seq2seq ChatBot Model

Optimizer and Loss Function

About training a neural network for chatbots

Let's denote $\boldsymbol{x}$ as input feature, and $f()$ to denote model. If there is a label associated with $\boldsymbol{x}$, then we will denote it as $y$. Our model takes in $\boldsymbol{x}$, and produces a prediction $\hat{y}$. This becomes $\hat{y} = f(\boldsymbol{x})$. The model needs to adjust some parameters to provide better predictions thus generating a better model. If $\Theta$ denotes all the parameters of a model. $\hat{y} = f_\Theta(\boldsymbol{x})$ represent that the model's behavior is dependent on the value of it's parameters $\Theta$ also known as the "state" of the model.

Our goal for training is to minimize the loss function which quantifies just how badly the model is doing at the goal of predicting the ground truth $y$. If $y$ is goal, and $\hat{y}$ is the prediction, then loss function is denoted by $\ell(y, \hat{y})$. If there is a training set with $N$ examples, the equation is:

$$\min_{\Theta} \sum_{i=1}^N \ell(f_\Theta(\boldsymbol{x}^{(i)}), y^{(i)}) $$

The summation ($\sum_{i=1}^N$) is going over all $N$ pairs of input ($\boldsymbol{x}^{(i)}$) and output ($y^{(i)}$), and determining just how badly ($\ell(\cdot,\cdot)$) are doing. To create the best possible model $\Theta$ is adjusted using gradient descent. If $\Theta_k$ is the current state of our model, which needs to improve, then the next state $\Theta_{k+1}$, that hopefully reduces the loss of the model in terms of a mathematical equation is:

$$\Theta_{k+1} = \Theta_k - \eta \cdot \frac{1}{N} \sum_{i=1}^{N} \nabla_{\Theta}\ell(f_{\Theta_k}(\boldsymbol{x}^{(i)}), y^{(i)})$$

The above equation shows the mathematical representation for gradient decent. We follow the gradient ($\nabla$) to tell us how to adjust $\Theta$. As PyTorch provides us APIs to perform differentiation, we can easily compute $\nabla_{\Theta}$ and don't have to keep track of everything inside of $\Theta$. $\eta$ is learning rate or the step size.

For training we need :

  1. Model $f(\cdot)$ to compute $f_\Theta(\boldsymbol{x}^{(i)})$ which I have done for creating my seq2seq model for the chatbot.
  2. PyTorch stores gradients in a mutable data structure. To set a clean state before we use the data structure I have used optimizer.zero_grad().
  3. Loss function $\ell(\cdot,\cdot)$ is used to compute loss.
  4. loss.backward() is used to compute gradient $\nabla_{\Theta}$ .
  5. optimizer.step() is used to update all parameters and to perform Θ_{k+1} = Θ_k − η * ∇_Θ ℓ(y_hat, y)
$$\Theta_{k+1} = \Theta_k - \eta \cdot \frac{1}{N} \sum_{i=1}^{N} \nabla_{\Theta}\ell(f_{\Theta_k}(\boldsymbol{x}^{(i)}), y^{(i)})$$
  1. Finally, I have computed the loss and attention scores for generating my graph plots.

Plot Losses During Training

Loading Best Model

Evaluation : Test Results

Plot Prediction With Attention

colormap