# 8.8. Gated Recurrent Units (GRU)¶

In the previous section we discussed how gradients are calculated in a recurrent neural network. In particular we found that long products of matrices can lead to vanishing or divergent gradients. Let’s briefly think about what such gradient anomalies mean in practice:

- We might encounter a situation where an early observation is highly
significant for predicting all future observations. Consider the
somewhat contrived case where the first observation contains a
checksum and the goal is to discern whether the checksum is correct
at the end of the sequence. In this case the influence of the first
token is vital. We would like to have some mechanism for storing
vital early information in a
*memory cell*. Without such a mechanism we will have to assign a very large gradient to this observation, since it affects all subsequent observations. - We might encounter situations where some symbols carry no pertinent
observation. For instance, when parsing a webpage there might be
auxiliary HTML code that is irrelevant for the purpose of assessing
the sentiment conveyed on the page. We would like to have some
mechanism for
*skipping such symbols*in the latent state representation. - We might encounter situations where there is a logical break between
parts of a sequence. For instance there might be a transition between
chapters in a book, a transition between a bear and a bull market for
securities, etc.; In this case it would be nice to have a means of
*resetting*our internal state representation.

A number of methods have been proposed to address this. One of the earliest is the Long Short Term Memory (LSTM) [24] which we will discuss in Section 8.9. The Gated Recurrent Unit (GRU) [10] is a slightly more streamlined variant that often offers comparable performance and is significantly faster to compute. See also [12] for more details. Due to its simplicity we start with the GRU.

## 8.8.2. Implementation from Scratch¶

To gain a better understanding of the model let us implement a GRU from scratch.

### 8.8.2.1. Reading the Data Set¶

We begin by reading *The Time Machine* corpus that we used in
Section 8.5. The code for reading the data set is
given below:

```
import d2l
from mxnet import nd
from mxnet.gluon import rnn
batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
```

### 8.8.2.2. Initialize Model Parameters¶

The next step is to initialize the model parameters. We draw the weights
from a Gaussian with variance \(0.01\) and set the bias to
\(0\). The hyper-parameter `num_hiddens`

defines the number of
hidden units. We instantiate all terms relating to update and reset gate
and the candidate hidden state itself. Subsequently we attach gradients
to all parameters.

```
def get_params(vocab_size, num_hiddens, ctx):
num_inputs = num_outputs = vocab_size
normal = lambda shape : nd.random.normal(scale=0.01, shape=shape, ctx=ctx)
three = lambda : (normal((num_inputs, num_hiddens)),
normal((num_hiddens, num_hiddens)),
nd.zeros(num_hiddens, ctx=ctx))
W_xz, W_hz, b_z = three() # Update gate parameter
W_xr, W_hr, b_r = three() # Reset gate parameter
W_xh, W_hh, b_h = three() # Candidate hidden state parameter
# Output layer parameters
W_hq = normal((num_hiddens, num_outputs))
b_q = nd.zeros(num_outputs, ctx=ctx)
# Create gradient
params = [W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q]
for param in params:
param.attach_grad()
return params
```

### 8.8.2.3. Define the Model¶

Now we will define the hidden state initialization function
`init_gru_state`

. Just like the `init_rnn_state`

function defined in
Section 8.5, this function returns a tuple composed
of an NDArray with a shape (batch size, number of hidden units) and with
all values set to 0.

```
def init_gru_state(batch_size, num_hiddens, ctx):
return (nd.zeros(shape=(batch_size, num_hiddens), ctx=ctx), )
```

Now we are ready to define the actual model. Its structure is the same as the basic RNN cell, just that the update equations are more complex.

```
def gru(inputs, state, params):
W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q = params
H, = state
outputs = []
for X in inputs:
Z = nd.sigmoid(nd.dot(X, W_xz) + nd.dot(H, W_hz) + b_z)
R = nd.sigmoid(nd.dot(X, W_xr) + nd.dot(H, W_hr) + b_r)
H_tilda = nd.tanh(nd.dot(X, W_xh) + nd.dot(R * H, W_hh) + b_h)
H = Z * H + (1 - Z) * H_tilda
Y = nd.dot(H, W_hq) + b_q
outputs.append(Y)
return nd.concat(*outputs, dim=0), (H,)
```

### 8.8.2.4. Training and Prediction¶

Training and prediction work in exactly the same manner as before.

```
vocab_size, num_hiddens, ctx = len(vocab), 256, d2l.try_gpu()
num_epochs, lr = 500, 1
model = d2l.RNNModelScratch(len(vocab), num_hiddens, ctx, get_params,
init_gru_state, gru)
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, ctx)
```

```
Perplexity 1.1, 16974 tokens/sec on gpu(0)
time traveller for so it will be convenient to speak of him was
traveller it s against reason said filby what reason said
```

## 8.8.3. Concise Implementation¶

In Gluon, we can directly call the `GRU`

class in the `rnn`

module.
This encapsulates all the configuration details that we made explicit
above. The code is significantly faster as it uses compiled operators
rather than Python for many details that we spelled out in detail
before.

```
gru_layer = rnn.GRU(num_hiddens)
model = d2l.RNNModel(gru_layer, len(vocab))
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, ctx)
```

```
Perplexity 1.1, 201636 tokens/sec on gpu(0)
time traveller it s against reason said filby what reason said
traveller it s against reason said filby what reason said
```

## 8.8.4. Summary¶

- Gated recurrent neural networks are better at capturing dependencies for time series with large time step distances.
- Reset gates help capture short-term dependencies in time series.
- Update gates help capture long-term dependencies in time series.
- GRUs contain basic RNNs as their extreme case whenever the reset gate is switched on. They can ignore sequences as needed.

## 8.8.5. Exercises¶

- Compare runtimes, perplexity and the extracted strings for
`rnn.RNN`

and`rnn.GRU`

implementations with each other. - Assume that we only want to use the input for time step \(t'\) to predict the output at time step \(t > t'\). What are the best values for reset and update gates for each time step?
- Adjust the hyper-parameters and observe and analyze the impact on running time, perplexity, and the written lyrics.
- What happens if you implement only parts of a GRU? That is, implement a recurrent cell that only has a reset gate. Likewise, implement a recurrent cell only with an update gate.