Sequence to Sequence Learning with Neural Networks

26 Sep 2019

The paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. The method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of fixed dimensionality, and then another LSTM to decode the target sequence from the vector. A Straightforward application of the Long Short-Term Memory (LSTM) architecture can solve general sequence to sequence problems. The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain a large fixed-dimensional vector representation, and then another LSTM to extract the output sequence from that vector. The Second LSTM is essentially a recurrent neural network language model except that it is conditioned on the input sequence. The LSTM’s ability to successfully learn on data with long range temporal dependencies makes it a natural choice for this application due to the considerable time lag between the inputs and their corresponding outputs.

The Model

The goal of the LSTM is to estimate the conditional probability p(y1,…yT’∣x₁,…xT) where (x₁,…xT) is an input sequence and y₁,…yT’ is its corresponding output sequence whose length T’ may differ from T. The LSTM computes this conditional probability by first obtaining the fixed-dimensional representation v of the input sequence (x₁,…xT) given by the last hidden state of the LSTM, and then computing the probability of y₁,…yT` with standard LSTM-LM formulation whose initial hidden state is set to the representation v of x₁,…xT : an image alt text

In this equation, each p(yt|v, y₁, . . . , yt−1) distribution is represented with a softmax over all the words in the vocabulary. Note that we require that each sentence ends with a special end-of-sentence symbol , which enables the model to define a distribution over sequences of all possible lengths. The actual model differ from above description in three important ways.

Takeaways

Disclosure

Most of the things are directly from the paper. This post is meant to be a one place for all the papers that I read and take notes. Read the original paper here.