# 8.9. Long Short Term Memory (LSTM)¶

The challenge to address long-term information preservation and short-term input skipping in latent variable models has existed for a long time. One of the earliest approaches to address this was the LSTM by Hochreiter and Schmidhuber, 1997. It shares many of the properties of the Gated Recurrent Unit (GRU) and predates it by almost two decades. Its design is slightly more complex.

Arguably it is inspired by logic gates of a computer. To control a
memory cell we need a number of gates. One gate is needed to read out
the entries from the cell (as opposed to reading any other cell). We
will refer to this as the *output* gate. A second gate is needed to
decide when to read data into the cell. We refer to this as the *input*
gate. Lastly, we need a mechanism to reset the contents of the cell,
governed by a *forget* gate. The motivation for such a design is the
same as before, namely to be able to decide when to remember and when to
ignore inputs into the latent state via a dedicated mechanism. Let’s see
how this works in practice.

## 8.9.1. Gated Memory Cells¶

Three gates are introduced in LSTMs: the input gate, the forget gate, and the output gate. In addition to that we introduce memory cells that take the same shape as the hidden state. Strictly speaking this is just a fancy version of a hidden state, custom engineered to record additional information.

### 8.9.1.1. Input Gates, Forget Gates and Output Gates¶

Just like with GRUs, the data feeding into the LSTM gates is the input at the current time step \(\mathbf{X}_t\) and the hidden state of the previous time step \(\mathbf{H}_{t-1}\). These inputs are processed by a fully connected layer and a sigmoid activation function to compute the values of input, forget and output gates. As a result, the three gate elements all have a value range of \([0,1]\).

We assume there are \(h\) hidden units and that the minibatch is of size \(n\). Thus the input is \(\mathbf{X}_t \in \mathbb{R}^{n \times d}\) (number of examples: \(n\), number of inputs: \(d\)）and the hidden state of the last time step is \(\mathbf{H}_{t-1} \in \mathbb{R}^{n \times h}\). Correspondingly the gates are defined as follows: the input gate is \(\mathbf{I}_t \in \mathbb{R}^{n \times h}\), the forget gate is \(\mathbf{F}_t \in \mathbb{R}^{n \times h}\), and the output gate is \(\mathbf{O}_t \in \mathbb{R}^{n \times h}\). They are calculated as follows:

\(\mathbf{W}_{xi}, \mathbf{W}_{xf}, \mathbf{W}_{xo} \in \mathbb{R}^{d \times h}\) and \(\mathbf{W}_{hi}, \mathbf{W}_{hf}, \mathbf{W}_{ho} \in \mathbb{R}^{h \times h}\) are weight parameters and \(\mathbf{b}_i, \mathbf{b}_f, \mathbf{b}_o \in \mathbb{R}^{1 \times h}\) are bias parameters.

### 8.9.1.2. Candidate Memory Cell¶

Next we design a memory cell. Since we haven’t specified the action of
the various gates yet, we first introduce a *candidate* memory cell
\(\tilde{\mathbf{C}}_t \in \mathbb{R}^{n \times h}\). Its
computation is similar to the three gates described above, but using a
\(\tanh\) function with a value range for \([-1, 1]\) as
activation function. This leads to the following equation at time step
\(t\).

Here \(\mathbf{W}_{xc} \in \mathbb{R}^{d \times h}\) and \(\mathbf{W}_{hc} \in \mathbb{R}^{h \times h}\) are weights and \(\mathbf{b}_c \in \mathbb{R}^{1 \times h}\) is a bias.

### 8.9.1.3. Memory Cell¶

In GRUs we had a single mechanism to govern input and forgetting. Here we have two parameters, \(\mathbf{I}_t\) which governs how much we take new data into account via \(\tilde{\mathbf{C}}_t\) and the forget parameter \(\mathbf{F}_t\) which addresses how much we of the old memory cell content \(\mathbf{C}_{t-1} \in \mathbb{R}^{n \times h}\) we retain. Using the same pointwise multiplication trick as before we arrive at the following update equation.

If the forget gate is always approximately 1 and the input gate is always approximately 0, the past memory cells will be saved over time and passed to the current time step. This design was introduced to alleviate the vanishing gradient problem and to better capture dependencies for time series with long range dependencies. We thus arrive at the following flow diagram.

### 8.9.1.4. Hidden States¶

Lastly we need to define how to compute the hidden state \(\mathbf{H}_t \in \mathbb{R}^{n \times h}\). This is where the output gate comes into play. In the LSTM it is simply a gated version of the \(\tanh\) of the memory cell. This ensures that the values of \(\mathbf{H}_t\) are always in the interval \([-1, 1]\). Whenever the output gate is \(1\) we effectively pass all memory information through to the predictor whereas for output \(0\) we retain all information only within the memory cell and perform no further processing. The figure below has a graphical illustration of the data flow.

## 8.9.2. Implementation from Scratch¶

Now it’s time to implement an LSTM. We begin with a model built from
scratch. As with the experiments in the previous sections we first need
to load the data. We use *The Time Machine* for this.

```
In [1]:
```

```
import sys
sys.path.insert(0, '..')
import d2l
from mxnet import nd, init
from mxnet.gluon import rnn
corpus_indices, vocab = d2l.load_data_time_machine()
```

### 8.9.2.1. Initialize Model Parameters¶

Next we need to define and initialize the model parameters. As
previously, the hyperparameter `num_hiddens`

defines the number of
hidden units. We initialize weights with a Gaussian with \(0.01\)
variance and we set the biases to \(0\).

```
In [2]:
```

```
num_inputs, num_hiddens, num_outputs = len(vocab), 256, len(vocab)
ctx = d2l.try_gpu()
def get_params():
def _one(shape):
return nd.random.normal(scale=0.01, shape=shape, ctx=ctx)
def _three():
return (_one((num_inputs, num_hiddens)),
_one((num_hiddens, num_hiddens)),
nd.zeros(num_hiddens, ctx=ctx))
W_xi, W_hi, b_i = _three() # Input gate parameters
W_xf, W_hf, b_f = _three() # Forget gate parameters
W_xo, W_ho, b_o = _three() # Output gate parameters
W_xc, W_hc, b_c = _three() # Candidate cell parameters
# Output layer parameters
W_hq = _one((num_hiddens, num_outputs))
b_q = nd.zeros(num_outputs, ctx=ctx)
# Create gradient
params = [W_xi, W_hi, b_i, W_xf, W_hf, b_f, W_xo, W_ho, b_o, W_xc, W_hc,
b_c, W_hq, b_q]
for param in params:
param.attach_grad()
return params
```

## 8.9.3. Define the Model¶

In the initialization function, the hidden state of the LSTM needs to return an additional memory cell with a value of \(0\) and a shape of (batch size, number of hidden units). Hence we get the following state initialization.

```
In [3]:
```

```
def init_lstm_state(batch_size, num_hiddens, ctx):
return (nd.zeros(shape=(batch_size, num_hiddens), ctx=ctx),
nd.zeros(shape=(batch_size, num_hiddens), ctx=ctx))
```

The actual model is defined just like we discussed it before with three gates and an auxiliary memory cell. Note that only the hidden state is passed on to the output layer. The memory cells do not participate in the computation directly.

```
In [4]:
```

```
def lstm(inputs, state, params):
[W_xi, W_hi, b_i, W_xf, W_hf, b_f, W_xo, W_ho, b_o, W_xc, W_hc, b_c,
W_hq, b_q] = params
(H, C) = state
outputs = []
for X in inputs:
I = nd.sigmoid(nd.dot(X, W_xi) + nd.dot(H, W_hi) + b_i)
F = nd.sigmoid(nd.dot(X, W_xf) + nd.dot(H, W_hf) + b_f)
O = nd.sigmoid(nd.dot(X, W_xo) + nd.dot(H, W_ho) + b_o)
C_tilda = nd.tanh(nd.dot(X, W_xc) + nd.dot(H, W_hc) + b_c)
C = F * C + I * C_tilda
H = O * C.tanh()
Y = nd.dot(H, W_hq) + b_q
outputs.append(Y)
return outputs, (H, C)
```

### 8.9.3.1. Training and Prediction¶

As in the previous section, during model training, we only use adjacent sampling. After setting the hyper-parameters, we train and model and create a 50 character string of text based on the prefixes “traveller” and “time traveller”.

```
In [5]:
```

```
num_epochs, num_steps, batch_size, lr, clipping_theta = 100, 35, 32, 3, 1
prefixes = ['traveller', 'time traveller']
d2l.train_and_predict_rnn(lstm, get_params, init_lstm_state, num_hiddens,
corpus_indices, vocab, ctx, False, num_epochs,
num_steps, lr, clipping_theta, batch_size, prefixes)
```

```
epoch 25, perplexity 13.628729, time 18.27 sec
epoch 50, perplexity 10.273247, time 18.13 sec
- traveller an an the the the the the the the the the the the
- time traveller an an the the the the the the the the the the the
epoch 75, perplexity 8.065465, time 18.16 sec
epoch 100, perplexity 5.959430, time 17.89 sec
- traveller the the time traveller the the time traveller the
- time traveller the the time traveller the the time traveller the
```

## 8.9.4. Concise Implementation¶

In Gluon, we can call the `LSTM`

class in the `rnn`

module directly
to instantiate the model.

```
In [6]:
```

```
lstm_layer = rnn.LSTM(num_hiddens)
model = d2l.RNNModel(lstm_layer, len(vocab))
d2l.train_and_predict_rnn_gluon(model, num_hiddens, corpus_indices, vocab,
ctx, num_epochs*5, num_steps, lr,
clipping_theta, batch_size, prefixes)
```

```
epoch 125, perplexity 4.453068, time 4.52 sec
epoch 250, perplexity 1.101829, time 4.54 sec
- traveller smiled. 'are you sure we can move freely in space
- time traveller smiled round at us. then, still smiling faight yo
epoch 375, perplexity 1.036125, time 4.53 sec
epoch 500, perplexity 1.024646, time 4.46 sec
- traveller smiled. 'are you sure we can move freely in space
- time traveller smiled. 'are you sure we can move freely in space
```

## 8.9.5. Summary¶

- LSTMs have three types of gates: input, forget and output gates which control the flow of information.
- The hidden layer output of LSTM includes hidden states and memory cells. Only hidden states are passed into the output layer. Memory cells are entirely internal.
- LSTMs can help cope with vanishing and exploding gradients due to long range dependencies and short-range irrelevant data.
- In many cases LSTMs perform slightly better than GRUs but they are more costly to train and execute due to the larger latent state size.
- LSTMs are the prototypical latent variable autoregressive model with nontrivial state control. Many variants thereof have been proposed over the years, e.g. multiple layers, residual connections, different types of regularization.
- Training LSTMs and other sequence models is quite costly due to the long dependency of the sequence. Later we will encounter alternative models such as transformers that can be used in some cases.

## 8.9.6. Exercises¶

- Adjust the hyperparameters. Observe and analyze the impact on runtime, perplexity, and the generted output.
- How would you need to change the model to generate proper words as opposed to sequences of characters?
- Compare the computational cost for GRUs, LSTMs and regular RNNs for a given hidden dimension. Pay special attention to training and inference cost
- Since the candidate memory cells ensure that the value range is between -1 and 1 using the tanh function, why does the hidden state need to use the tanh function again to ensure that the output value range is between -1 and 1?
- Implement an LSTM for time series prediction rather than character sequences.

## 8.9.7. References¶

[1] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.