# 8.6. Residual Networks (ResNet) and ResNeXt¶ Open the notebook in SageMaker Studio Lab

As we design increasingly deeper networks it becomes imperative to understand how adding layers can increase the complexity and expressiveness of the network. Even more important is the ability to design networks where adding layers makes networks strictly more expressive rather than just different. To make some progress we need a bit of mathematics.

## 8.6.1. Function Classes¶

Consider \(\mathcal{F}\), the class of functions that a specific network architecture (together with learning rates and other hyperparameter settings) can reach. That is, for all \(f \in \mathcal{F}\) there exists some set of parameters (e.g., weights and biases) that can be obtained through training on a suitable dataset. Let’s assume that \(f^*\) is the “truth” function that we really would like to find. If it is in \(\mathcal{F}\), we are in good shape but typically we will not be quite so lucky. Instead, we will try to find some \(f^*_\mathcal{F}\) which is our best bet within \(\mathcal{F}\). For instance, given a dataset with features \(\mathbf{X}\) and labels \(\mathbf{y}\), we might try finding it by solving the following optimization problem:

We know that regularization [Morozov, 2012, Tikhonov & Arsenin, 1977] may control complexity of \(\mathcal{F}\) and achieve consistency, so a larger size of training data generally leads to better \(f^*_\mathcal{F}\). It is only reasonable to assume that if we design a different and more powerful architecture \(\mathcal{F}'\) we should arrive at a better outcome. In other words, we would expect that \(f^*_{\mathcal{F}'}\) is “better” than \(f^*_{\mathcal{F}}\). However, if \(\mathcal{F} \not\subseteq \mathcal{F}'\) there is no guarantee that this should even happen. In fact, \(f^*_{\mathcal{F}'}\) might well be worse. As illustrated by Fig. 8.6.1, for non-nested function classes, a larger function class does not always move closer to the “truth” function \(f^*\). For instance, on the left of Fig. 8.6.1, though \(\mathcal{F}_3\) is closer to \(f^*\) than \(\mathcal{F}_1\), \(\mathcal{F}_6\) moves away and there is no guarantee that further increasing the complexity can reduce the distance from \(f^*\). With nested function classes where \(\mathcal{F}_1 \subseteq \ldots \subseteq \mathcal{F}_6\) on the right of Fig. 8.6.1, we can avoid the aforementioned issue from the non-nested function classes.

Thus, only if larger function classes contain the smaller ones are we guaranteed that increasing them strictly increases the expressive power of the network. For deep neural networks, if we can train the newly-added layer into an identity function \(f(\mathbf{x}) = \mathbf{x}\), the new model will be as effective as the original model. As the new model may get a better solution to fit the training dataset, the added layer might make it easier to reduce training errors.

This is the question that [He et al., 2016a] considered when
working on very deep computer vision models. At the heart of their
proposed *residual network* (*ResNet*) is the idea that every additional
layer should more easily contain the identity function as one of its
elements. These considerations are rather profound but they led to a
surprisingly simple solution, a *residual block*. With it, ResNet won
the ImageNet Large Scale Visual Recognition Challenge in 2015. The
design had a profound influence on how to build deep neural networks.

## 8.6.2. Residual Blocks¶

Let’s focus on a local part of a neural network, as depicted in
Fig. 8.6.2. Denote the input by \(\mathbf{x}\).
We assume that the desired underlying mapping we want to obtain by
learning is \(f(\mathbf{x})\), to be used as input to the activation
function on the top. On the left, the portion within the dotted-line box
must directly learn the mapping \(f(\mathbf{x})\). On the right, the
portion within the dotted-line box needs to learn the *residual mapping*
\(f(\mathbf{x}) - \mathbf{x}\), which is how the residual block
derives its name. If the identity mapping
\(f(\mathbf{x}) = \mathbf{x}\) is the desired underlying mapping,
the residual mapping is easier to learn: we only need to push the
weights and biases of the upper weight layer (e.g., fully connected
layer and convolutional layer) within the dotted-line box to zero. The
right figure illustrates the *residual block* of ResNet, where the solid
line carrying the layer input \(\mathbf{x}\) to the addition
operator is called a *residual connection* (or *shortcut connection*).
With residual blocks, inputs can forward propagate faster through the
residual connections across layers. In fact, the residual block can be
thought of as a special case of the multi-branch Inception block: it has
two branches one of which is the identity mapping.

ResNet follows VGG’s full \(3\times 3\) convolutional layer design. The residual block has two \(3\times 3\) convolutional layers with the same number of output channels. Each convolutional layer is followed by a batch normalization layer and a ReLU activation function. Then, we skip these two convolution operations and add the input directly before the final ReLU activation function. This kind of design requires that the output of the two convolutional layers has to be of the same shape as the input, so that they can be added together. If we want to change the number of channels, we need to introduce an additional \(1\times 1\) convolutional layer to transform the input into the desired shape for the addition operation. Let’s have a look at the code below.

```
import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l
class Residual(nn.Module): #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1):
super().__init__()
self.conv1 = nn.LazyConv2d(num_channels, kernel_size=3, padding=1,
stride=strides)
self.conv2 = nn.LazyConv2d(num_channels, kernel_size=3, padding=1)
if use_1x1conv:
self.conv3 = nn.LazyConv2d(num_channels, kernel_size=1,
stride=strides)
else:
self.conv3 = None
self.bn1 = nn.LazyBatchNorm2d()
self.bn2 = nn.LazyBatchNorm2d()
def forward(self, X):
Y = F.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
Y += X
return F.relu(Y)
```

```
from mxnet import init, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l
npx.set_np()
class Residual(nn.Block): #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1, **kwargs):
super().__init__(**kwargs)
self.conv1 = nn.Conv2D(num_channels, kernel_size=3, padding=1,
strides=strides)
self.conv2 = nn.Conv2D(num_channels, kernel_size=3, padding=1)
if use_1x1conv:
self.conv3 = nn.Conv2D(num_channels, kernel_size=1,
strides=strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm()
self.bn2 = nn.BatchNorm()
def forward(self, X):
Y = npx.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
return npx.relu(Y + X)
```

```
import tensorflow as tf
from d2l import tensorflow as d2l
class Residual(tf.keras.Model): #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1):
super().__init__()
self.conv1 = tf.keras.layers.Conv2D(num_channels, padding='same',
kernel_size=3, strides=strides)
self.conv2 = tf.keras.layers.Conv2D(num_channels, kernel_size=3,
padding='same')
self.conv3 = None
if use_1x1conv:
self.conv3 = tf.keras.layers.Conv2D(num_channels, kernel_size=1,
strides=strides)
self.bn1 = tf.keras.layers.BatchNormalization()
self.bn2 = tf.keras.layers.BatchNormalization()
def call(self, X):
Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3 is not None:
X = self.conv3(X)
Y += X
return tf.keras.activations.relu(Y)
```

This code generates two types of networks: one where we add the input to
the output before applying the ReLU nonlinearity whenever
`use_1x1conv=False`

, and one where we adjust channels and resolution
by means of a \(1 \times 1\) convolution before adding.
Fig. 8.6.3 illustrates this.

Now let’s look at a situation where the input and output are of the same shape, where \(1 \times 1\) convolution is not needed.

```
blk = Residual(3)
X = torch.randn(4, 3, 6, 6)
blk(X).shape
```

```
/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.8/site-packages/torch/nn/modules/lazy.py:178: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
warnings.warn('Lazy modules are a new feature under heavy development '
```

```
torch.Size([4, 3, 6, 6])
```

```
blk = Residual(3)
blk.initialize()
X = np.random.randn(4, 3, 6, 6)
blk(X).shape
```

```
(4, 3, 6, 6)
```

```
blk = Residual(3)
X = tf.random.normal((4, 6, 6, 3))
Y = blk(X)
Y.shape
```

```
TensorShape([4, 6, 6, 3])
```

We also have the option to halve the output height and width while
increasing the number of output channels. Since the input shape is
changed, `use_1x1conv=True`

is specified.

```
blk = Residual(6, use_1x1conv=True, strides=2)
blk(X).shape
```

```
torch.Size([4, 6, 3, 3])
```

```
blk = Residual(6, use_1x1conv=True, strides=2)
blk.initialize()
blk(X).shape
```

```
(4, 6, 3, 3)
```

```
blk = Residual(6, use_1x1conv=True, strides=2)
blk(X).shape
```

```
TensorShape([4, 3, 3, 6])
```

## 8.6.3. ResNet Model¶

The first two layers of ResNet are the same as those of the GoogLeNet we described before: the \(7\times 7\) convolutional layer with 64 output channels and a stride of 2 is followed by the \(3\times 3\) max-pooling layer with a stride of 2. The difference is the batch normalization layer added after each convolutional layer in ResNet.

```
class ResNet(d2l.Classifier):
def b1(self):
return nn.Sequential(
nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3),
nn.LazyBatchNorm2d(), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
```

```
class ResNet(d2l.Classifier):
def b1(self):
net = nn.Sequential()
net.add(nn.Conv2D(64, kernel_size=7, strides=2, padding=3),
nn.BatchNorm(), nn.Activation('relu'),
nn.MaxPool2D(pool_size=3, strides=2, padding=1))
return net
```

```
class ResNet(d2l.Classifier):
def b1(self):
return tf.keras.models.Sequential([
tf.keras.layers.Conv2D(64, kernel_size=7, strides=2,
padding='same'),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Activation('relu'),
tf.keras.layers.MaxPool2D(pool_size=3, strides=2,
padding='same')])
```

GoogLeNet uses four modules made up of Inception blocks. However, ResNet uses four modules made up of residual blocks, each of which uses several residual blocks with the same number of output channels. The number of channels in the first module is the same as the number of input channels. Since a max-pooling layer with a stride of 2 has already been used, it is not necessary to reduce the height and width. In the first residual block for each of the subsequent modules, the number of channels is doubled compared with that of the previous module, and the height and width are halved.

Now, we implement this module. Note that special processing has been performed on the first module.

```
@d2l.add_to_class(ResNet)
def block(self, num_residuals, num_channels, first_block=False):
blk = []
for i in range(num_residuals):
if i == 0 and not first_block:
blk.append(Residual(num_channels, use_1x1conv=True, strides=2))
else:
blk.append(Residual(num_channels))
return nn.Sequential(*blk)
```

```
@d2l.add_to_class(ResNet)
def block(self, num_residuals, num_channels, first_block=False):
blk = nn.Sequential()
for i in range(num_residuals):
if i == 0 and not first_block:
blk.add(Residual(num_channels, use_1x1conv=True, strides=2))
else:
blk.add(Residual(num_channels))
return blk
```

```
@d2l.add_to_class(ResNet)
def block(self, num_residuals, num_channels, first_block=False):
blk = tf.keras.models.Sequential()
for i in range(num_residuals):
if i == 0 and not first_block:
blk.add(Residual(num_channels, use_1x1conv=True, strides=2))
else:
blk.add(Residual(num_channels))
return blk
```

Then, we add all the modules to ResNet. Here, two residual blocks are used for each module. Finally, just like GoogLeNet, we add a global average pooling layer, followed by the fully connected layer output.

```
@d2l.add_to_class(ResNet)
def __init__(self, arch, lr=0.1, num_classes=10):
super(ResNet, self).__init__()
self.save_hyperparameters()
self.net = nn.Sequential(self.b1())
for i, b in enumerate(arch):
self.net.add_module(f'b{i+2}', self.block(*b, first_block=(i==0)))
self.net.add_module('last', nn.Sequential(
nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(),
nn.LazyLinear(num_classes)))
self.net.apply(d2l.init_cnn)
```

```
@d2l.add_to_class(ResNet)
def __init__(self, arch, lr=0.1, num_classes=10):
super(ResNet, self).__init__()
self.save_hyperparameters()
self.net = nn.Sequential()
self.net.add(self.b1())
for i, b in enumerate(arch):
self.net.add(self.block(*b, first_block=(i==0)))
self.net.add(nn.GlobalAvgPool2D(), nn.Dense(num_classes))
self.net.initialize(init.Xavier())
```

```
@d2l.add_to_class(ResNet)
def __init__(self, arch, lr=0.1, num_classes=10):
super(ResNet, self).__init__()
self.save_hyperparameters()
self.net = tf.keras.models.Sequential(self.b1())
for i, b in enumerate(arch):
self.net.add(self.block(*b, first_block=(i==0)))
self.net.add(tf.keras.models.Sequential([
tf.keras.layers.GlobalAvgPool2D(),
tf.keras.layers.Dense(units=num_classes)]))
```

There are 4 convolutional layers in each module (excluding the \(1\times 1\) convolutional layer). Together with the first \(7\times 7\) convolutional layer and the final fully connected layer, there are 18 layers in total. Therefore, this model is commonly known as ResNet-18. By configuring different numbers of channels and residual blocks in the module, we can create different ResNet models, such as the deeper 152-layer ResNet-152. Although the main architecture of ResNet is similar to that of GoogLeNet, ResNet’s structure is simpler and easier to modify. All these factors have resulted in the rapid and widespread use of ResNet. Fig. 8.6.4 depicts the full ResNet-18.

Before training ResNet, let’s observe how the input shape changes across different modules in ResNet. As in all the previous architectures, the resolution decreases while the number of channels increases up until the point where a global average pooling layer aggregates all features.

```
class ResNet18(ResNet):
def __init__(self, lr=0.1, num_classes=10):
super().__init__(((2, 64), (2, 128), (2, 256), (2, 512)),
lr, num_classes)
ResNet18().layer_summary((1, 1, 96, 96))
```

```
Sequential output shape: torch.Size([1, 64, 24, 24])
Sequential output shape: torch.Size([1, 64, 24, 24])
Sequential output shape: torch.Size([1, 128, 12, 12])
Sequential output shape: torch.Size([1, 256, 6, 6])
Sequential output shape: torch.Size([1, 512, 3, 3])
Sequential output shape: torch.Size([1, 10])
```

```
class ResNet18(ResNet):
def __init__(self, lr=0.1, num_classes=10):
super().__init__(((2, 64), (2, 128), (2, 256), (2, 512)),
lr, num_classes)
ResNet18().layer_summary((1, 1, 96, 96))
```

```
Sequential output shape: (1, 64, 24, 24)
Sequential output shape: (1, 64, 24, 24)
Sequential output shape: (1, 128, 12, 12)
Sequential output shape: (1, 256, 6, 6)
Sequential output shape: (1, 512, 3, 3)
GlobalAvgPool2D output shape: (1, 512, 1, 1)
Dense output shape: (1, 10)
```

```
class ResNet18(ResNet):
def __init__(self, lr=0.1, num_classes=10):
super().__init__(((2, 64), (2, 128), (2, 256), (2, 512)),
lr, num_classes)
ResNet18().layer_summary((1, 96, 96, 1))
```

```
Sequential output shape: (1, 24, 24, 64)
Sequential output shape: (1, 24, 24, 64)
Sequential output shape: (1, 12, 12, 128)
Sequential output shape: (1, 6, 6, 256)
Sequential output shape: (1, 3, 3, 512)
Sequential output shape: (1, 10)
```

## 8.6.4. Training¶

We train ResNet on the Fashion-MNIST dataset, just like before.

```
model = ResNet18(lr=0.01)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
trainer.fit(model, data)
```

```
model = ResNet18(lr=0.01)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
trainer.fit(model, data)
```

```
trainer = d2l.Trainer(max_epochs=10)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
with d2l.try_gpu():
model = ResNet18(lr=0.01)
trainer.fit(model, data)
```

## 8.6.5. ResNeXt¶

Recall Fig. 8.6.3 that each ResNet block simply stacks
layers between residual connections. This design can be varied by
replacing stacked layers with concatenated parallel transformations,
leading to ResNeXt [Xie et al., 2017]. Different from
*a variety of* transformations in multi-branch Inception blocks, ResNeXt
adopts the *same* transformation in all branches, thus minimizing manual
design efforts in each branch.

The left dotted box in Fig. 8.6.5 depicts the added
concatenated parallel transformation strategy in ResNeXt. More
concretely, an input with \(c\) channels is first split into
\(g\) groups via \(g\) branches of \(1 \times 1\)
convolutions followed by \(3 \times 3\) convolutions, all with
\(b/g\) output channels. Concatenating these \(g\) outputs
results in \(b\) output channels, leading to “bottlenecked” (when
\(b < c\)) network width inside the dashed box. This output will
restore the original \(c\) channels of the input via the final
\(1 \times 1\) convolution right before sum with the residual
connection. Notably, the left dotted box is equivalent to the much
*simplified* right dotted box in Fig. 8.6.5, where we
only need to specify that the \(3 \times 3\) convolution is a *group
convolution* with \(g\) groups. In fact, the group convolution dates
back to the idea of distributing the AlexNet model over two GPUs due to
limited GPU memory at that time
[Krizhevsky et al., 2012].

The following implementation of the `ResNeXtBlock`

class treats
`groups`

(\(b/g\) in Fig. 8.6.5) as an argument
so that given `bot_channels`

(\(b\) in
Fig. 8.6.5) bottleneck channels, the
\(3 \times 3\) group convolution will have `bot_channels//groups`

groups. Similar to the residual block implementation in
Section 8.6.2, the residual connection is generalized
with a \(1 \times 1\) convolution (`conv4`

), where setting
`use_1x1conv=True, strides=2`

halves the input height and width.

```
class ResNeXtBlock(nn.Module): #@save
"""The ResNeXt block."""
def __init__(self, num_channels, groups, bot_mul, use_1x1conv=False,
strides=1):
super().__init__()
bot_channels = int(round(num_channels * bot_mul))
self.conv1 = nn.LazyConv2d(bot_channels, kernel_size=1,
stride=1)
self.conv2 = nn.LazyConv2d(bot_channels, kernel_size=3,
stride=strides, padding=1,
groups=bot_channels//groups)
self.conv3 = nn.LazyConv2d(num_channels, kernel_size=1,
stride=1)
self.bn1 = nn.LazyBatchNorm2d()
self.bn2 = nn.LazyBatchNorm2d()
self.bn3 = nn.LazyBatchNorm2d()
if use_1x1conv:
self.conv4 = nn.LazyConv2d(num_channels, kernel_size=1,
stride=strides)
self.bn4 = nn.LazyBatchNorm2d()
else:
self.conv4 = None
def forward(self, X):
Y = F.relu(self.bn1(self.conv1(X)))
Y = F.relu(self.bn2(self.conv2(Y)))
Y = self.bn3(self.conv3(Y))
if self.conv4:
X = self.bn4(self.conv4(X))
return F.relu(Y + X)
```

```
class ResNeXtBlock(nn.Block): #@save
"""The ResNeXt block."""
def __init__(self, num_channels, groups, bot_mul,
use_1x1conv=False, strides=1, **kwargs):
super().__init__(**kwargs)
bot_channels = int(round(num_channels * bot_mul))
self.conv1 = nn.Conv2D(bot_channels, kernel_size=1, padding=0,
strides=1)
self.conv2 = nn.Conv2D(bot_channels, kernel_size=3, padding=1,
strides=strides,
groups=bot_channels//groups)
self.conv3 = nn.Conv2D(num_channels, kernel_size=1, padding=0,
strides=1)
self.bn1 = nn.BatchNorm()
self.bn2 = nn.BatchNorm()
self.bn3 = nn.BatchNorm()
if use_1x1conv:
self.conv4 = nn.Conv2D(num_channels, kernel_size=1,
strides=strides)
self.bn4 = nn.BatchNorm()
else:
self.conv4 = None
def forward(self, X):
Y = npx.relu(self.bn1(self.conv1(X)))
Y = npx.relu(self.bn2(self.conv2(Y)))
Y = self.bn3(self.conv3(Y))
if self.conv4:
X = self.bn4(self.conv4(X))
return npx.relu(Y + X)
```

```
class ResNeXtBlock(tf.keras.Model): #@save
"""The ResNeXt block."""
def __init__(self, num_channels, groups, bot_mul, use_1x1conv=False,
strides=1):
super().__init__()
bot_channels = int(round(num_channels * bot_mul))
self.conv1 = tf.keras.layers.Conv2D(bot_channels, 1, strides=1)
self.conv2 = tf.keras.layers.Conv2D(bot_channels, 3, strides=strides,
padding="same",
groups=bot_channels//groups)
self.conv3 = tf.keras.layers.Conv2D(num_channels, 1, strides=1)
self.bn1 = tf.keras.layers.BatchNormalization()
self.bn2 = tf.keras.layers.BatchNormalization()
self.bn3 = tf.keras.layers.BatchNormalization()
if use_1x1conv:
self.conv4 = tf.keras.layers.Conv2D(num_channels, 1,
strides=strides)
self.bn4 = tf.keras.layers.BatchNormalization()
else:
self.conv4 = None
def call(self, X):
Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
Y = tf.keras.activations.relu(self.bn2(self.conv2(Y)))
Y = self.bn3(self.conv3(Y))
if self.conv4:
X = self.bn4(self.conv4(X))
return tf.keras.activations.relu(Y + X)
```

In the following case (`use_1x1conv=False, strides=1`

), the input and
output are of the same shape.

```
blk = ResNeXtBlock(32, 16, 1)
X = torch.randn(4, 32, 96, 96)
blk(X).shape
```

```
torch.Size([4, 32, 96, 96])
```

```
blk = ResNeXtBlock(32, 16, 1)
blk.initialize()
X = np.random.randn(4, 32, 96, 96)
blk(X).shape
```

```
(4, 32, 96, 96)
```

```
blk = ResNeXtBlock(32, 16, 1)
X = tf.random.normal((4, 96, 96, 32))
Y = blk(X)
Y.shape
```

```
TensorShape([4, 96, 96, 32])
```

Alternatively, setting `use_1x1conv=True, strides=2`

halves the output
height and width.

```
blk = ResNeXtBlock(32, 16, 1, use_1x1conv=True, strides=2)
blk(X).shape
```

```
torch.Size([4, 32, 48, 48])
```

```
blk = ResNeXtBlock(32, 16, 1, use_1x1conv=True, strides=2)
blk.initialize()
blk(X).shape
```

```
(4, 32, 48, 48)
```

```
blk = ResNeXtBlock(32, 16, 1, use_1x1conv=True, strides=2)
X = tf.random.normal((4, 96, 96, 32))
Y = blk(X)
Y.shape
```

```
TensorShape([4, 48, 48, 32])
```

## 8.6.6. Summary and Discussion¶

Nested function classes are desirable. Learning an additional layer in
deep neural networks as an identity function (though this is an extreme
case) should be made easy. The residual mapping can learn the identity
function more easily, such as pushing parameters in the weight layer to
zero. We can train an effective *deep* neural network by having residual
blocks. Inputs can forward propagate faster through the residual
connections across layers.

Before residual connections, bypassing paths with gating units were introduced to effectively train highway networks with over 100 layers [Srivastava et al., 2015]. Using identity functions as bypassing paths, ResNets performed remarkably well on multiple computer vision tasks. Residual connections had a major influence on the design of subsequent deep neural networks, both for convolutional and sequential nature. As we will introduce later, the transformer architecture [Vaswani et al., 2017] adopts residual connections (together with other design choices) and is pervasive in areas as diverse as language, vision, speech, and reinforcement learning. A key advantage of the ResNeXt design is that increasing groups leads to sparser connections (i.e., lower computational complexity) within the block, thus enabling an increase of network width to achieve a better tradeoff between FLOPs and accuracy. ResNeXt-ification is appealing in later convolution network design, such as in the RegNet model [Radosavovic et al., 2020] and the ConvNeXt architecture [Liu et al., 2022]. We will apply the ResNeXt block later in this chapter.

## 8.6.7. Exercises¶

What are the major differences between the Inception block in Fig. 8.4.1 and the residual block? After removing some paths in the Inception block, how are they related to each other?

Refer to Table 1 in the ResNet paper [He et al., 2016a] to implement different variants.

For deeper networks, ResNet introduces a “bottleneck” architecture to reduce model complexity. Try to implement it.

In subsequent versions of ResNet, the authors changed the “convolution, batch normalization, and activation” structure to the “batch normalization, activation, and convolution” structure. Make this improvement yourself. See Figure 1 in [He et al., 2016b] for details.

Why can’t we just increase the complexity of functions without bound, even if the function classes are nested?