.. _chapter_resnet:
Residual Networks (ResNet)
==========================
As we design increasingly deeper networks it becomes imperative to
understand how adding layers can increase the complexity and
expressiveness of the network. Even more important is the ability to
design networks where adding layers makes networks strictly more
expressive rather than just different. To make some progress we need a
bit of theory.
Function Classes
----------------
Consider :math:`\mathcal{F}`, the class of functions that a specific
network architecture (together with learning rates and other
hyperparameter settings) can reach. That is, for all
:math:`f \in \mathcal{F}` there exists some set of parameters :math:`W`
that can be obtained through training on a suitable dataset. Let’s
assume that :math:`f^*` is the function that we really would like to
find. If it’s in :math:`\mathcal{F}`, we’re in good shape but typically
we won’t be quite so lucky. Instead, we will try to find some
:math:`f^*_\mathcal{F}` which is our best bet within
:math:`\mathcal{F}`. For instance, we might try finding it by solving
the following optimization problem:
.. math:: f^*_\mathcal{F} := \mathop{\mathrm{argmin}}_f L(X, Y, f) \text{ subject to } f \in \mathcal{F}
It is only reasonable to assume that if we design a different and more
powerful architecture :math:`\mathcal{F}'` we should arrive at a better
outcome. In other words, we would expect that :math:`f^*_{\mathcal{F}'}`
is ‘better’ than :math:`f^*_{\mathcal{F}}`. However, if
:math:`\mathcal{F} \not\subseteq \mathcal{F}'` there is no guarantee
that this should even happen. In fact, :math:`f^*_{\mathcal{F}'}` might
well be worse. This is a situation that we often encounter in practice -
adding layers doesn’t only make the network more expressive, it also
changes it in sometimes not quite so predictable ways. The picture below
illustrates this in slightly abstract terms.
.. figure:: ../img/functionclasses.svg
Left: non-nested function classes. The distance may in fact increase
as the complexity increases. Right: with nested function classes this
does not happen.
Only if larger function classes contain the smaller ones are we
guaranteed that increasing them strictly increases the expressive power
of the network. This is the question that He et al, 2016 considered when
working on very deep computer vision models. At the heart of ResNet is
the idea that every additional layer should contain the identity
function as one of its elements. This means that if we can train the
newly-added layer into an identity mapping
:math:`f(\mathbf{x}) = \mathbf{x}`, the new model will be as effective
as the original model. As the new model may get a better solution to fit
the training data set, the added layer might make it easier to reduce
training errors. Even better, the identity function rather than the null
:math:`f(\mathbf{x}) = 0` should be the simplest function within a
layer.
These considerations are rather profound but they led to a surprisingly
simple solution, a residual block. With it,
:cite:`He.Zhang.Ren.ea.2016` won the ImageNet Visual Recognition
Challenge in 2015. The design had a profound influence on how to build
deep neural networks.
Residual Blocks
---------------
Let us focus on a local neural network, as depicted below. Denote the
input by :math:`\mathbf{x}`. We assume that the ideal mapping we want to
obtain by learning is :math:`f(\mathbf{x})`, to be used as the input to
the activation function. The portion within the dotted-line box in the
left image must directly fit the mapping :math:`f(\mathbf{x})`. This can
be tricky if we don’t need that particular layer and we would much
rather retain the input :math:`\mathbf{x}`. The portion within the
dotted-line box in the right image now only needs to parametrize the
*deviation* from the identity, since we return
:math:`\mathbf{x} + f(\mathbf{x})`. In practice, the residual mapping is
often easier to optimize. We only need to set :math:`f(\mathbf{x}) = 0`.
The right image in the figure below illustrates the basic Residual Block
of ResNet. Similar architectures were later proposed for sequence models
which we will study later.
.. figure:: ../img/residual-block.svg
The difference between a regular block (left) and a residual block
(right). In the latter case, we can short-circuit the convolutions.
ResNet follows VGG’s full :math:`3\times 3` convolutional layer design.
The residual block has two :math:`3\times 3` convolutional layers with
the same number of output channels. Each convolutional layer is followed
by a batch normalization layer and a ReLU activation function. Then, we
skip these two convolution operations and add the input directly before
the final ReLU activation function. This kind of design requires that
the output of the two convolutional layers be of the same shape as the
input, so that they can be added together. If we want to change the
number of channels or the stride, we need to introduce an additional
:math:`1\times 1` convolutional layer to transform the input into the
desired shape for the addition operation. Let’s have a look at the code
below.
.. code:: python
import d2l
from mxnet import gluon, nd
from mxnet.gluon import nn
# Save to the d2l package.
class Residual(nn.Block):
def __init__(self, num_channels, use_1x1conv=False, strides=1, **kwargs):
super(Residual, self).__init__(**kwargs)
self.conv1 = nn.Conv2D(num_channels, kernel_size=3, padding=1,
strides=strides)
self.conv2 = nn.Conv2D(num_channels, kernel_size=3, padding=1)
if use_1x1conv:
self.conv3 = nn.Conv2D(num_channels, kernel_size=1,
strides=strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm()
self.bn2 = nn.BatchNorm()
def forward(self, X):
Y = nd.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
return nd.relu(Y + X)
This code generates two types of networks: one where we add the input to
the output before applying the ReLU nonlinearity, and whenever
``use_1x1conv=True``, one where we adjust channels and resolution by
means of a :math:`1 \times 1` convolution before adding. The diagram
below illustrates this:
.. figure:: ../img/resnet-block.svg
Left: regular ResNet block; Right: ResNet block with 1x1 convolution
Now let us look at a situation where the input and output are of the
same shape.
.. code:: python
blk = Residual(3)
blk.initialize()
X = nd.random.uniform(shape=(4, 3, 6, 6))
blk(X).shape
.. parsed-literal::
:class: output
(4, 3, 6, 6)
We also have the option to halve the output height and width while
increasing the number of output channels.
.. code:: python
blk = Residual(6, use_1x1conv=True, strides=2)
blk.initialize()
blk(X).shape
.. parsed-literal::
:class: output
(4, 6, 3, 3)
ResNet Model
------------
The first two layers of ResNet are the same as those of the GoogLeNet we
described before: the :math:`7\times 7` convolutional layer with 64
output channels and a stride of 2 is followed by the :math:`3\times 3`
maximum pooling layer with a stride of 2. The difference is the batch
normalization layer added after each convolutional layer in ResNet.
.. code:: python
net = nn.Sequential()
net.add(nn.Conv2D(64, kernel_size=7, strides=2, padding=3),
nn.BatchNorm(), nn.Activation('relu'),
nn.MaxPool2D(pool_size=3, strides=2, padding=1))
GoogLeNet uses four blocks made up of Inception blocks. However, ResNet
uses four modules made up of residual blocks, each of which uses several
residual blocks with the same number of output channels. The number of
channels in the first module is the same as the number of input
channels. Since a maximum pooling layer with a stride of 2 has already
been used, it is not necessary to reduce the height and width. In the
first residual block for each of the subsequent modules, the number of
channels is doubled compared with that of the previous module, and the
height and width are halved.
Now, we implement this module. Note that special processing has been
performed on the first module.
.. code:: python
def resnet_block(num_channels, num_residuals, first_block=False):
blk = nn.Sequential()
for i in range(num_residuals):
if i == 0 and not first_block:
blk.add(Residual(num_channels, use_1x1conv=True, strides=2))
else:
blk.add(Residual(num_channels))
return blk
Then, we add all the residual blocks to ResNet. Here, two residual
blocks are used for each module.
.. code:: python
net.add(resnet_block(64, 2, first_block=True),
resnet_block(128, 2),
resnet_block(256, 2),
resnet_block(512, 2))
Finally, just like GoogLeNet, we add a global average pooling layer,
followed by the fully connected layer output.
.. code:: python
net.add(nn.GlobalAvgPool2D(), nn.Dense(10))
There are 4 convolutional layers in each module (excluding the
:math:`1\times 1` convolutional layer). Together with the first
convolutional layer and the final fully connected layer, there are 18
layers in total. Therefore, this model is commonly known as ResNet-18.
By configuring different numbers of channels and residual blocks in the
module, we can create different ResNet models, such as the deeper
152-layer ResNet-152. Although the main architecture of ResNet is
similar to that of GoogLeNet, ResNet’s structure is simpler and easier
to modify. All these factors have resulted in the rapid and widespread
use of ResNet. Below is a diagram of the full ResNet-18.
.. figure:: ../img/ResNetFull.svg
ResNet 18
Before training ResNet, let us observe how the input shape changes
between different modules in ResNet. As in all previous architectures,
the resolution decreases while the number of channels increases up until
the point where a global average pooling layer aggregates all features.
.. code:: python
X = nd.random.uniform(shape=(1, 1, 224, 224))
net.initialize()
for layer in net:
X = layer(X)
print(layer.name, 'output shape:\t', X.shape)
.. parsed-literal::
:class: output
conv5 output shape: (1, 64, 112, 112)
batchnorm4 output shape: (1, 64, 112, 112)
relu0 output shape: (1, 64, 112, 112)
pool0 output shape: (1, 64, 56, 56)
sequential1 output shape: (1, 64, 56, 56)
sequential2 output shape: (1, 128, 28, 28)
sequential3 output shape: (1, 256, 14, 14)
sequential4 output shape: (1, 512, 7, 7)
pool1 output shape: (1, 512, 1, 1)
dense0 output shape: (1, 10)
Data Acquisition and Training
-----------------------------
We train ResNet on the Fashion-MNIST data set, just like before. The
only thing that has changed is the learning rate that decreased again,
due to the more complex architecture.
.. code:: python
lr, num_epochs, batch_size = 0.05, 10, 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96)
d2l.train_ch5(net, train_iter, test_iter, num_epochs, lr)
.. parsed-literal::
:class: output
loss 0.013, train acc 0.997, test acc 0.899
4752.9 exampes/sec on gpu(0)
.. image:: output_resnet_a18853_17_1.svg
Summary
-------
- Residual blocks allow for a parametrization relative to the identity
function :math:`f(\mathbf{x}) = \mathbf{x}`.
- Adding residual blocks increases the function complexity in a
well-defined manner.
- We can train an effective deep neural network by having residual
blocks pass through cross-layer data channels.
- ResNet had a major influence on the design of subsequent deep neural
networks, both for convolutional and sequential nature.
Exercises
---------
1. Refer to Table 1 in the :cite:`He.Zhang.Ren.ea.2016` to implement
different variants.
2. For deeper networks, ResNet introduces a “bottleneck” architecture to
reduce model complexity. Try to implement it.
3. In subsequent versions of ResNet, the author changed the
“convolution, batch normalization, and activation” architecture to
the “batch normalization, activation, and convolution” architecture.
Make this improvement yourself. See Figure 1 in
:cite:`He.Zhang.Ren.ea.2016*1` for details.
4. Prove that if :math:`\mathbf{x}` is generated by a ReLU, the ResNet
block does indeed include the identity function.
5. Why can’t we just increase the complexity of functions without bound,
even if the function classes are nested?
Scan the QR Code to `Discuss `__
-----------------------------------------------------------------
|image0|
.. |image0| image:: ../img/qr_resnet.svg