Constants And Required Functions

Data Preprocessing

Required Directories Creation

Listing Directories

Chatbot Seq2Seq Model using Transformer Combined with GPT2

Motivation

Benefits of using a Transformer

Seq2Seq Model with Transformer, DistilBert Tokenizer and GPT2 Fine Tuning

The heart of chatbot is a sequence-to-sequence (seq2seq) model. The goal of a seq2seq model is to take a variable-length question sequence as an input, and return a variable-length answer sequence as an output.

Components :

Bert : "Bidirectional Encoder Representations from Transformers"

DistilBERT : a distilled version of BERT: smaller, faster, cheaper and lighter.

GPT-2 : “Generative Pretrained Transformer 2

  • GPT2LMHeadModel : "The GPT2 Model transformer with a language modeling head on top."

Dataset and DataLoader Creation


Some information on BERT and other algorithms as explained by their docs for reference to papers read:


Transformer models in general are computationally expensive. Hence for faster run I have used GPT2LMHeadModel* (i.e. distilgpt2) for decoding one step at a time. It is used with GPT2Tokenizer(i.e. distilgpt2) in order to have exact same decoding process that is used by the original GPT2LMHeadModel training to convert tensor into string GPT2 tokenization.

For size of the embedding GPT2LMHeadModel is using with the GPT_MODEL.config.n_embd.

From GPT_TOKENIZER taking vocab_size, bos_token_id, eos_token_id as vocabulary size, start of sentence and both for end of sentence, padding for sentence.

Positional Encoding

$$\tilde{\boldsymbol{h}_t} = \boldsymbol{h}_t + P(t)$$ $$\tilde{\boldsymbol{h}_t} = \boldsymbol{h}_t \cdot \sqrt{D} + P(t)$$

EncoderTransformer

About Transformers and Multi-Head Attention...

$$ \text{score}(\boldsymbol{h}_{t}, \bar{\boldsymbol{h}}) = \frac{\boldsymbol{h}_{t}^{\top} \bar{\boldsymbol{h}}}{\sqrt{D}} $$ $$ \frac{Q K^{\top}}{\sqrt{D}} $$ $$ \text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{D}}\right) V $$ $$ \text{head}_{i}=\text {Attention}\left(Q, K , V \right) $$ $$ \text{head}_{i}=\text {Attention}\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right) $$ $$ \text{MultiHead}\left(Q, K, V)=\text { Concat (head }_{1}, \ldots, \text { head }_{\mathrm{H}}\right) W^{O} $$

image.png

DecoderTransformer

  • GPT2LMHeadModel : "The GPT2 Model transformer with a language modeling head on top."

Fine tunning of GPT2 in Seq2Seq Model

Components :

Bert : "Bidirectional Encoder Representations from Transformers"

DistilBERT : "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter."

  • GPT2LMHeadModel : "The GPT2 Model transformer with a language modeling head on top."

Weight initialization and Freeze the BERT and GPT parameters

Optimizer And Loss Function

About training a neural network for chatbots

Let's denote $\boldsymbol{x}$ as input feature, and $f()$ to denote model. If there is a label associated with $\boldsymbol{x}$, then we will denote it as $y$. Our model takes in $\boldsymbol{x}$, and produces a prediction $\hat{y}$. This becomes $\hat{y} = f(\boldsymbol{x})$. The model needs to adjust some parameters to provide better predictions thus generating a better model. If $\Theta$ denotes all the parameters of a model. $\hat{y} = f_\Theta(\boldsymbol{x})$ represent that the model's behavior is dependent on the value of it's parameters $\Theta$ also known as the "state" of the model.

Our goal for training is to minimize the loss function which quantifies just how badly the model is doing at the goal of predicting the ground truth $y$. If $y$ is goal, and $\hat{y}$ is the prediction, then loss function is denoted by $\ell(y, \hat{y})$. If there is a training set with $N$ examples, the equation is:

$$\min_{\Theta} \sum_{i=1}^N \ell(f_\Theta(\boldsymbol{x}^{(i)}), y^{(i)}) $$

The summation ($\sum_{i=1}^N$) is going over all $N$ pairs of input ($\boldsymbol{x}^{(i)}$) and output ($y^{(i)}$), and determining just how badly ($\ell(\cdot,\cdot)$) are doing. To create the best possible model $\Theta$ is adjusted using gradient descent. If $\Theta_k$ is the current state of our model, which needs to improve, then the next state $\Theta_{k+1}$, that hopefully reduces the loss of the model in terms of a mathematical equation is:

$$\Theta_{k+1} = \Theta_k - \eta \cdot \frac{1}{N} \sum_{i=1}^{N} \nabla_{\Theta}\ell(f_{\Theta_k}(\boldsymbol{x}^{(i)}), y^{(i)})$$

The above equation shows the mathematical representation for gradient decent. We follow the gradient ($\nabla$) to tell us how to adjust $\Theta$. As PyTorch provides us APIs to perform differentiation, we can easily compute $\nabla_{\Theta}$ and don't have to keep track of everything inside of $\Theta$. $\eta$ is learning rate or the step size.

For training we need :

  1. Model $f(\cdot)$ to compute $f_\Theta(\boldsymbol{x}^{(i)})$ which I have done for creating my GPTSeq2Seq model for the chatbot.
  2. PyTorch stores gradients in a mutable data structure. To set a clean state before we use the data structure I have used optimizer.zero_grad().
  3. Loss function $\ell(\cdot,\cdot)$ is used to compute loss.
  4. loss.backward() is used to compute gradient $\nabla_{\Theta}$ .
  5. optimizer.step() is used to update all parameters and to perform Θ_{k+1} = Θ_k − η * ∇_Θ ℓ(y_hat, y)
$$\Theta_{k+1} = \Theta_k - \eta \cdot \frac{1}{N} \sum_{i=1}^{N} \nabla_{\Theta}\ell(f_{\Theta_k}(\boldsymbol{x}^{(i)}), y^{(i)})$$
  1. Finally, I have computed the loss generating my graph plots.

Plot Losses During Training

Evaluation : Test Results

Evaluation Matrix for GPT2 Fine tuned Seq2seq Model

Bleu Score

F1 Score

Rouge L Score

RougeL

"Recall-Oriented Understudy for Gisting Evaluation. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans.

"Given two sequences X and Y, the longest common subsequence (LCS) of X and recall reflects the proportion of words in X (reference summary sentence) that are also present in Y (candidate summary sentence); while unigram precision is the proportion of words in Y that are also in X. Unigram recall and precision count all cooccurring words regardless their orders; while ROUGE-L counts only in-sequence co-occurrences."

ROUGE-L is one type of ROUGE measures. It is calculated by taking into account longest common subsequence (LCS) between two sequences.It counts only in-sequence co-occurrences.

Chat

Start Conversation with the Bot

Write TSV