Sequence to Sequence Modelling using Attention Mechanism in Machine Translation

Md. Wazir Ali
9 min readAug 12, 2021

Table of Contents:-

Introduction to Natural Language Processing

Deep Learning Based Approaches and RNNs

Sequence to Sequence Modelling

Machine Translation — An application of Natural Language Processing

Encoder — Decoder Architecture

Performance Metric

Teacher Forcing

Code for Encoder — Decoder

Attention Mechanism

Types of Attention

Code for Attention Layer

Github Repo

References

Introduction to Natural Language Processing

Natural Language Processing, often abbreviated as NLP, is one of the most important and fastly emerging field of Artificial Intelligence and Machine Learning concerned with interactions between computers and human languages. It deals with mainly textual data of any language. This field has found tons of applications, a few of them being as under:-

  1. Parts of Speech Tagging
  2. Text Classification
  3. Machine Translation
  4. Image Captioning
  5. Text Summarization

Let’s have a brief overview of all the problems which NLP solves one by one.

  1. Parts of Speech Tagging- In this problem, we have a word which has to be classified as one of the parts of speech as in verb or noun or adjective. We need to have labelled training data consisting of words with the labels of noun or adjective or verb, based on which the NLP model would learn to tag a word as a noun or verb or adjective.
  2. Text Classification- In this kind of a problem, a given text is classified as one of the given categories based on the context. The text could be movie reviews and the categories could be the movie ratings as 1 to 5, 1 being the lowest and 5 being the highest. Another example could be the reviews of customers dining at a restaurant with the categories as positive or negative based on the reviews of the food and service they receive at the restaurant. The third kind of example could be articles from newspapers or magazines which could be classified into the various categories based on underlying themes of the text which could relate to politics, sports, entertainment and so on if labelled data is available for supervised learning otherwise this kind of problem is broadly categorized into Topic Modelling.
  3. Machine Translation- In this problem, a given text in a specific language is converted into it’s corresponding meaning in a different language. This involves training the model based on the inputs of sentences belonging to a certain language and it’s corresponding output in the desired language. Examples could be translations from English to Hindi, English to Spanish, English to French, English to Italian, French to Spanish, Italian to French and so on. Other examples of Machine translation also include decoding secret messages. It could be used to decode data which follows some patterns.
  4. Image Captioning- In this type of a problem, the input is an image and the output is a text describing what is there in the image. This also involves training the model with a huge amount of data wherein the input is an image and the output is a text describing what is there in an image.
  5. Text Summarization- In this type of a problem, a news article from a newspaper or a magazine which is very lengthy is summarized into a summary of say 50- 60 words which actually is easy for the readers to interpret or understand.

There are many techniques which are used to solve the above stated problems but in this blog, we would restrict ourselves to Deep Learning based solutions for one of these applications which is the main focus of the blog and that is Machine Translation.

Deep Learning Based Approaches and RNNs

Deep Learning based approaches are the state of the art approaches for solving the NLP problems involving text data. We know that a text data or a sentence written has a particular context and can be interpreted as a combination of words written in a particular sequence.

Coming to the deep learning solution for these kind of problems, we know that a simple MLP would need to have a set of inputs.

In our case, we have a sentence as an input which are a combination of words. We give the separate words as input to the MLP in the form of tokenized data converted into embeddings.

There is a problem with the above approach.

We have varying lengths of sentences. Therefore, to accommodate this, we have to change the length of the input layer in our neural network with every sentence.

One solution to the varying input length problem is to zero pad the inputs till the longest sentence.

But this has a drawback as well.

What happens when we have a test or query sentence which is longer than the longest sentence in the training data?

Recurrent Neural Networks come to the rescue in such cases where we have sequential data wherein a combination of words are in a particular sequence making up for a sentence.

A recurrent neural network is a set of neurons that takes as input a word of the sentence along with a cell state and produces an output y. This is repeated at each time step or for each word wherein the same set of neurons takes the input as the next word in the sentence and the cell state or the second input as the cell state from the previous time step.

Pictorially, we can view RNN as:-

Fully Connected Recurrent Neural Network

The above recurrent neural network can be viewed as under:-

An unrolled Recurrent Neural Network

The inputs X and h are given at time steps t. X_0 and h_0 are the inputs at time 0. Similarly, X_1 and h_1 are the inputs at time 1 and so on…

Now let’s see what happens inside the box labelled A in the above diagram. The box A is a feed forward neural network which takes as input a word represented by Xt at time t in the form of numbers which are word embeddings. Another input to the neural network is the cell state or hidden state from the previous time step t-1 which is ht-1.

Mathematically, the formula for the current cell state h_t is given by:-

where, f is an activation function that is hyperbolic tangent or tanh.

After applying activation function f, we get:-

h is the hidden state vector, Whh is the weight given to the previous hidden state ht-1, Wxh is the weight given to the current input Xt, tanh is the activation function, that implements a non-linearity which squashes the output in the range [-1,1].

Every time step t has an output given by:-

Yt is the output at time step t and Why is the weight at the output state.

Each time step of an RNN could be visualized as under:-

Please note that the output ht at every time step is multiplied by some weight with the output of the tanh function and the output from the tanh function goes to the next time step as the hidden state or the cell state.

Sequence to Sequence Modelling

Sequence to Sequence Modelling is a special case wherein we input a whole sequence of words and we get a whole sequence of words as output.

In this case, we are not concerned about a single output at a single time step rather we are concerned about the whole input sequence of words which are given as inputs. After the whole sequence of inputs are processed, only then the output sequence of words come out as an output, one at every time step.

Problem of Long-Term Dependencies

In case of a very long sequence, the output sequence is dependent on the whole sequence of words in the input. This dependency is not effectively handled by a simple RNN as the weights of the initial time steps of the input sequence doesn’t get updated due to the vanishing gradient problem with the training and there is no way to pass the hidden state information of the early time steps to the later time steps in a sentence through an RNN. So, during the training, the output sequence of words don’t get the right essence of the whole input sequence.

To deal with the above problem, a simple RNN is replaced by either an LSTM unit(Long Short Term Memory Unit) or a GRU (Gated Recurrent Unit). These units can handle long term dependency in case of a sequence to sequence modelling, wherein the output sequence of words is dependent on the input sequence of words.

Pictorially, an LSTM unit can be visualized as under:-

Repeating units of a Long Short Term Memory Cell

Similarly, a single unit of GRU can be visualized as under:-

A single cell of a GRU(Gated Recurrent Unit)

Please read more about LSTMs and GRUs and their internal functioning here.

In the case of Sequence to Sequence Modelling, as per the above diagram, for every time step of the input, we don’t consider the output ht. Instead of that, we consider processing the whole sequence of inputs and then the next LSTM unit repeats over time to produce the sequence of outputs.

We would see an architecture in the next to next section which deals with this kind of a case.

Machine Translation- An Application of Natural Language Processing

In this section, we would discuss a special application of Natural Language Processing which is Machine Translation. This application of NLP mainly deals with converting a sequence of sentence in one language to it’s equivalent sequence of sentences in another language. It also deals with the case of encoding a sequence of words to be used as a code word in army.

In this application, we would train our Neural Network (Recurring units over time instances, each unit being LSTM or GRU) with the input as the sentence which needs to be encoded into a sequence of words or converted into a sentence of a different language. The output of each sentence would be the equivalent sentence in another language or a coded sentence denoting a secret code in military/defence operations. After the weights of the LSTM units are fixed and the network has finished learning the mapping from the input to the output sentence, the model predicts the output sentence for any input sentence from the same language or any output coded sentence for any input sentence.

Now, let’s have a look at the skeletal structure or architecture which does this task of Machine Translation.

An Encoder — Decoder architecture showing Italian to English translation

Typically, the architecture which does the task of Machine Translation is known as Encoder — Decoder architecture. In this architecture, Encoder is a simple RNN or a LSTM unit depending on the length of the input sentence unrolling itself over the time instances and Decoder is also a simple RNN or a LSTM unit correspondingly which decodes the information passed by the encoder extracted from the input sentence and predicts each word of the output sentence one by one.

Encoder — Decoder Architecture

This is a special type of architecture which is used for the task of Machine Translation. In this architecture, as the name suggests, the encoder part encodes the input sequence or in other words, extracts the information from the input sequence of words and the decoder part of the architecture takes the essence of the information encoded from the whole input sentence and tries to decode the information into output sentence by outputting each word at one particular time instant.

Let’s deep dive into this architecture.

--

--