.. _sec_transformer: The Transformer Architecture ============================ We have compared CNNs, RNNs, and self-attention in :numref:`subsec_cnn-rnn-self-attention`. Notably, self-attention enjoys both parallel computation and the shortest maximum path length. Therefore, it is appealing to design deep architectures by using self-attention. Unlike earlier self-attention models that still rely on RNNs for input representations :cite:`Cheng.Dong.Lapata.2016,Lin.Feng.Santos.ea.2017,Paulus.Xiong.Socher.2017`, the Transformer model is solely based on attention mechanisms without any convolutional or recurrent layer :cite:`Vaswani.Shazeer.Parmar.ea.2017`. Though originally proposed for sequence-to-sequence learning on text data, Transformers have been pervasive in a wide range of modern deep learning applications, such as in areas to do with language, vision, speech, and reinforcement learning. .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python import math import pandas as pd import torch from torch import nn from d2l import torch as d2l .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python import math import pandas as pd from mxnet import autograd, init, np, npx from mxnet.gluon import nn from d2l import mxnet as d2l npx.set_np() .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python import math import jax import pandas as pd from flax import linen as nn from jax import numpy as jnp from d2l import jax as d2l .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python import numpy as np import pandas as pd import tensorflow as tf from d2l import tensorflow as d2l .. raw:: html

.. raw:: html

Model ----- As an instance of the encoder–decoder architecture, the overall architecture of the Transformer is presented in :numref:`fig_transformer`. As we can see, the Transformer is composed of an encoder and a decoder. In contrast to Bahdanau attention for sequence-to-sequence learning in :numref:`fig_s2s_attention_details`, the input (source) and output (target) sequence embeddings are added with positional encoding before being fed into the encoder and the decoder that stack modules based on self-attention. .. _fig_transformer: .. figure:: ../img/transformer.svg :width: 320px The Transformer architecture. Now we provide an overview of the Transformer architecture in :numref:`fig_transformer`. At a high level, the Transformer encoder is a stack of multiple identical layers, where each layer has two sublayers (either is denoted as :math:`\textrm{sublayer}`). The first is a multi-head self-attention pooling and the second is a positionwise feed-forward network. Specifically, in the encoder self-attention, queries, keys, and values are all from the outputs of the previous encoder layer. Inspired by the ResNet design of :numref:`sec_resnet`, a residual connection is employed around both sublayers. In the Transformer, for any input :math:`\mathbf{x} \in \mathbb{R}^d` at any position of the sequence, we require that :math:`\textrm{sublayer}(\mathbf{x}) \in \mathbb{R}^d` so that the residual connection :math:`\mathbf{x} + \textrm{sublayer}(\mathbf{x}) \in \mathbb{R}^d` is feasible. This addition from the residual connection is immediately followed by layer normalization :cite:`Ba.Kiros.Hinton.2016`. As a result, the Transformer encoder outputs a :math:`d`-dimensional vector representation for each position of the input sequence. The Transformer decoder is also a stack of multiple identical layers with residual connections and layer normalizations. As well as the two sublayers described in the encoder, the decoder inserts a third sublayer, known as the encoder–decoder attention, between these two. In the encoder–decoder attention, queries are from the outputs of the decoder’s self-attention sublayer, and the keys and values are from the Transformer encoder outputs. In the decoder self-attention, queries, keys, and values are all from the outputs of the previous decoder layer. However, each position in the decoder is allowed only to attend to all positions in the decoder up to that position. This *masked* attention preserves the autoregressive property, ensuring that the prediction only depends on those output tokens that have been generated. We have already described and implemented multi-head attention based on scaled dot products in :numref:`sec_multihead-attention` and positional encoding in :numref:`subsec_positional-encoding`. In the following, we will implement the rest of the Transformer model. .. _subsec_positionwise-ffn: Positionwise Feed-Forward Networks ---------------------------------- The positionwise feed-forward network transforms the representation at all the sequence positions using the same MLP. This is why we call it *positionwise*. In the implementation below, the input ``X`` with shape (batch size, number of time steps or sequence length in tokens, number of hidden units or feature dimension) will be transformed by a two-layer MLP into an output tensor of shape (batch size, number of time steps, ``ffn_num_outputs``). .. raw:: html