# 4.1. Softmax Regression¶ Open the notebook in SageMaker Studio Lab

In Section 3.1, we introduced linear regression, working through implementations from scratch in Section 3.4 and again using high-level APIs of a deep learning framework in Section 3.5 to do the heavy lifting.

Regression is the hammer we reach for when we want to answer *how much?*
or *how many?* questions. If you want to predict the number of dollars
(price) at which a house will be sold, or the number of wins a baseball
team might have, or the number of days that a patient will remain
hospitalized before being discharged, then you are probably looking for
a regression model. However, even within regression models, there are
important distinctions. For instance, the price of a house will never be
negative and changes might often be *relative* to its baseline price. As
such, it might be more effective to regress on the logarithm of the
price. Likewise, the number of days a patient spends in hospital is a
*discrete nonnegative* random variable. As such, least mean squares
might not be an ideal approach either. This sort of time-to-event
modeling comes with a host of other complications that are dealt with in
a specialized subfield called *survival modeling*.

The point here is not to overwhelm you but just to let you know that
there is a lot more to estimation than simply minimizing squared errors.
And more broadly, there’s a lot more to supervised learning than
regression. In this section, we focus on *classification* problems where
we put aside *how much?* questions and instead focus on *which
category?* questions.

Does this email belong in the spam folder or the inbox?

Is this customer more likely to sign up or not to sign up for a subscription service?

Does this image depict a donkey, a dog, a cat, or a rooster?

Which movie is Aston most likely to watch next?

Which section of the book are you going to read next?

Colloquially, machine learning practitioners overload the word
*classification* to describe two subtly different problems: (i) those
where we are interested only in hard assignments of examples to
categories (classes); and (ii) those where we wish to make soft
assignments, i.e., to assess the probability that each category applies.
The distinction tends to get blurred, in part, because often, even when
we only care about hard assignments, we still use models that make soft
assignments.

Even more, there are cases where more than one label might be true. For instance, a news article might simultaneously cover the topics of entertainment, business, and space flight, but not the topics of medicine or sports. Thus, categorizing it into one of the above categories on their own would not be very useful. This problem is commonly known as multi-label classification. See [Tsoumakas & Katakis, 2007] for an overview and [Huang et al., 2015] for an effective algorithm when tagging images.

## 4.1.1. Classification¶

To get our feet wet, let’s start with a simple image classification problem. Here, each input consists of a \(2\times2\) grayscale image. We can represent each pixel value with a single scalar, giving us four features \(x_1, x_2, x_3, x_4\). Further, let’s assume that each image belongs to one among the categories “cat”, “chicken”, and “dog”.

Next, we have to choose how to represent the labels. We have two obvious
choices. Perhaps the most natural impulse would be to choose
\(y \in \{1, 2, 3\}\), where the integers represent
\(\{\text{dog}, \text{cat}, \text{chicken}\}\) respectively. This is
a great way of *storing* such information on a computer. If the
categories had some natural ordering among them, say if we were trying
to predict
\(\{\text{baby}, \text{toddler}, \text{adolescent}, \text{young adult}, \text{adult}, \text{geriatric}\}\),
then it might even make sense to cast this as an ordinal
regression problem
and keep the labels in this format. See
[Moon et al., 2010] for an overview of different types of
ranking loss functions and [Beutel et al., 2014] for
a Bayesian approach that addresses responses with more than one mode.

In general, classification problems do not come with natural orderings
among the classes. Fortunately, statisticians long ago invented a simple
way to represent categorical data: the *one-hot encoding*. A one-hot
encoding is a vector with as many components as we have categories. The
component corresponding to a particular instance’s category is set to 1
and all other components are set to 0. In our case, a label \(y\)
would be a three-dimensional vector, with \((1, 0, 0)\)
corresponding to “cat”, \((0, 1, 0)\) to “chicken”, and
\((0, 0, 1)\) to “dog”:

### 4.1.1.1. Linear Model¶

In order to estimate the conditional probabilities associated with all the possible classes, we need a model with multiple outputs, one per class. To address classification with linear models, we will need as many affine functions as we have outputs. Strictly speaking, we only need one fewer, since the last category has to be the difference between \(1\) and the sum of the other categories but for reasons of symmetry we use a slightly redundant parametrization. Each output corresponds to its own affine function. In our case, since we have 4 features and 3 possible output categories, we need 12 scalars to represent the weights (\(w\) with subscripts), and 3 scalars to represent the biases (\(b\) with subscripts). This yields:

The corresponding neural network diagram is shown in
Fig. 4.1.1. Just as in linear regression, we use a
single-layer neural network. And since the calculation of each output,
\(o_1, o_2\), and \(o_3\), depends on all inputs, \(x_1\),
\(x_2\), \(x_3\), and \(x_4\), the output layer can also be
described as a *fully connected layer*.

For a more concise notation we use vectors and matrices: \(\mathbf{o} = \mathbf{W} \mathbf{x} + \mathbf{b}\) is much better suited for mathematics and code. Note that we have gathered all of our weights into a \(3 \times 4\) matrix and all biases \(\mathbf{b} \in \mathbb{R}^3\) in a vector.

### 4.1.1.2. The Softmax¶

Assuming a suitable loss function, we could try, directly, to minimize the difference between \(\mathbf{o}\) and the labels \(\mathbf{y}\). While it turns out that treating classification as a vector-valued regression problem works surprisingly well, it is nonetheless lacking in the following ways:

There is no guarantee that the outputs \(o_i\) sum up to \(1\) in the way we expect probabilities to behave.

There is no guarantee that the outputs \(o_i\) are even nonnegative, even if their outputs sum up to \(1\), or that they do not exceed \(1\).

Both aspects render the estimation problem difficult to solve and the solution very brittle to outliers. For instance, if we assume that there is a positive linear dependency between the number of bedrooms and the likelihood that someone will buy a house, the probability might exceed \(1\) when it comes to buying a mansion! As such, we need a mechanism to “squish” the outputs.

There are many ways we might to accomplish this goal. For instance, we could assume that the outputs \(\mathbf{o}\) are corrupted versions of \(\mathbf{y}\), where the corruption occurs by means of adding noise \(\mathbf{\epsilon}\) drawn from a normal distribution. In other words, \(\mathbf{y} = \mathbf{o} + \mathbf{\epsilon}\), where \(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\). This is the so-called probit model, first introduced by Fechner [Fechner, 1860]. While appealing, it doesn’t work quite as well or lead to a particularly nice optimization problem, when compared to the softmax.

Another way to accomplish this goal (and to ensure nonnegativity) is to
use an exponential function \(P(y = i) \propto \exp o_i\). This does
indeed satisfy the requirement that the conditional class probability
increases with increasing \(o_i\), it is monotonic, and all
probabilities are nonnegative. We can then transform these values so
that they add up to \(1\) by dividing each by their sum. This
process is called *normalization*. Putting these two pieces together
gives us the *softmax* function:

Note that the largest coordinate of \(\mathbf{o}\) corresponds to the most likely class according to \(\hat{\mathbf{y}}\). Moreover, because the softmax operation preserves the ordering among its arguments, we do not need to compute the softmax to determine which class has been assigned the highest probability.

The idea of a softmax dates back to Gibbs, who adapted ideas from physics [Gibbs, 1902]. Dating even further back, Boltzmann, the father of modern thermodynamics, used this trick to model a distribution over energy states in gas molecules. In particular, he discovered that the prevalence of a state of energy in a thermodynamic ensemble, such as the molecules in a gas, is proportional to \(\exp(-E/kT)\). Here, \(E\) is the energy of a state, \(T\) is the temperature, and \(k\) is the Boltzmann constant. When statisticians talk about increasing or decreasing the “temperature” of a statistical system, they refer to changing \(T\) in order to favor lower or higher energy states. Following Gibbs’ idea, energy equates to error. Energy-based models [Ranzato et al., 2007] use this point of view when describing problems in deep learning.

### 4.1.1.3. Vectorization¶

To improve computational efficiency, we vectorize calculations in minibatches of data. Assume that we are given a minibatch \(\mathbf{X} \in \mathbb{R}^{n \times d}\) of \(n\) features with dimensionality (number of inputs) \(d\). Moreover, assume that we have \(q\) categories in the output. Then the weights satisfy \(\mathbf{W} \in \mathbb{R}^{d \times q}\) and the bias satisfies \(\mathbf{b} \in \mathbb{R}^{1\times q}\).

This accelerates the dominant operation into a matrix-matrix product
\(\mathbf{X} \mathbf{W}\). Moreover, since each row in
\(\mathbf{X}\) represents a data example, the softmax operation
itself can be computed *rowwise*: for each row of \(\mathbf{O}\),
exponentiate all entries and then normalize them by the sum. Note,
though, that care must be taken to avoid exponentiating and taking
logarithms of large numbers, since this can cause numerical overflow or
underflow. Deep learning frameworks take care of this automatically.

## 4.1.2. Loss Function¶

Now that we have a mapping from features \(\mathbf{x}\) to probabilities \(\mathbf{\hat{y}}\), we need a way to optimize the accuracy of this mapping. We will rely on maximum likelihood estimation, the very same concept that we encountered when providing a probabilistic justification for the mean squared error loss in Section 3.1.3.

### 4.1.2.1. Log-Likelihood¶

The softmax function gives us a vector \(\hat{\mathbf{y}}\), which we can interpret as (estimated) conditional probabilities of each class, given any input \(\mathbf{x}\), such as \(\hat{y}_1\) = \(P(y=\text{cat} | \mathbf{x})\). In the following we assume that for a dataset with features \(\mathbf{X}\) the labels \(\mathbf{Y}\) are represented using a one-hot encoding label vector. We can compare the estimates with reality by checking how probable the actual classes are according to our model, given the features:

We are allowed to use the factorization since we assume that each label is drawn independently from its respective distribution \(P(\mathbf{y}|\mathbf{x}^{(i)})\). Since maximizing the product of terms is awkward, we take the negative logarithm to obtain the equivalent problem of minimizing the negative log-likelihood:

where for any pair of label \(\mathbf{y}\) and model prediction \(\hat{\mathbf{y}}\) over \(q\) classes, the loss function \(l\) is

For reasons explained later on, the loss function in
(4.1.8) is commonly called the *cross-entropy
loss*. Since \(\mathbf{y}\) is a one-hot vector of length \(q\),
the sum over all its coordinates \(j\) vanishes for all but one
term. Note that the loss \(l(\mathbf{y}, \hat{\mathbf{y}})\) is
bounded from below by \(0\) whenever \(\hat{y}\) is a
probability vector: no single entry is larger than \(1\), hence
their negative logarithm cannot be lower than \(0\);
\(l(\mathbf{y}, \hat{\mathbf{y}}) = 0\) only if we predict the
actual label with *certainty*. This can never happen for any finite
setting of the weights because taking a softmax output towards \(1\)
requires taking the corresponding input \(o_i\) to infinity (or all
other outputs \(o_j\) for \(j \neq i\) to negative infinity).
Even if our model could assign an output probability of \(0\), any
error made when assigning such high confidence would incur infinite loss
(\(-\log 0 = \infty\)).

### 4.1.2.2. Softmax and Cross-Entropy Loss¶

Since the softmax function and the corresponding cross-entropy loss are so common, it is worth understanding a bit better how they are computed. Plugging (4.1.3) into the definition of the loss in (4.1.8) and using the definition of the softmax we obtain:

To understand a bit better what is going on, consider the derivative with respect to any logit \(o_j\). We get

In other words, the derivative is the difference between the probability assigned by our model, as expressed by the softmax operation, and what actually happened, as expressed by elements in the one-hot label vector. In this sense, it is very similar to what we saw in regression, where the gradient was the difference between the observation \(y\) and estimate \(\hat{y}\). This is not coincidence. In any exponential family model, the gradients of the log-likelihood are given by precisely this term. This fact makes computing gradients easy in practice.

Now consider the case where we observe not just a single outcome but an
entire distribution over outcomes. We can use the same representation as
before for the label \(\mathbf{y}\). The only difference is that
rather than a vector containing only binary entries, say
\((0, 0, 1)\), we now have a generic probability vector, say
\((0.1, 0.2, 0.7)\). The math that we used previously to define the
loss \(l\) in (4.1.8) still works out fine,
just that the interpretation is slightly more general. It is the
expected value of the loss for a distribution over labels. This loss is
called the *cross-entropy loss* and it is one of the most commonly used
losses for classification problems. We can demystify the name by
introducing just the basics of information theory. In a nutshell, it
measures the number of bits to encode what we see \(\mathbf{y}\)
relative to what we predict that should happen \(\hat{\mathbf{y}}\).
We provide a very basic explanation in the following. For further
details on information theory see [Cover & Thomas, 1999] or
[MacKay & MacKay, 2003].

## 4.1.3. Information Theory Basics¶

Many deep learning papers use intuition and terms from information
theory. To make sense of them, we need some common language. This is a
survival guide. *Information theory* deals with the problem of encoding,
decoding, transmitting, and manipulating information (also known as
data).

### 4.1.3.1. Entropy¶

The central idea in information theory is to quantify the amount of
information contained in data. This places a limit on our ability to
compress data. For a distribution \(P\) its *entropy* is defined as:

One of the fundamental theorems of information theory states that in order to encode data drawn randomly from the distribution \(P\), we need at least \(H[P]\) “nats” to encode it [Shannon, 1948]. If you wonder what a “nat” is, it is the equivalent of bit but when using a code with base \(e\) rather than one with base 2. Thus, one nat is \(\frac{1}{\log(2)} \approx 1.44\) bit.

### 4.1.3.2. Surprisal¶

You might be wondering what compression has to do with prediction. Imagine that we have a stream of data that we want to compress. If it is always easy for us to predict the next token, then this data is easy to compress. Take the extreme example where every token in the stream always takes the same value. That is a very boring data stream! And not only it is boring, but it is also easy to predict. Because they are always the same, we do not have to transmit any information to communicate the contents of the stream. Easy to predict, easy to compress.

However if we cannot perfectly predict every event, then we might
sometimes be surprised. Our surprise is greater when we assigned an
event lower probability. Claude Shannon settled on
\(\log \frac{1}{P(j)} = -\log P(j)\) to quantify one’s *surprisal*
at observing an event \(j\) having assigned it a (subjective)
probability \(P(j)\). The entropy defined in
(4.1.11) is then the *expected surprisal* when
one assigned the correct probabilities that truly match the
data-generating process.

### 4.1.3.3. Cross-Entropy Revisited¶

So if entropy is the level of surprise experienced by someone who knows
the true probability, then you might be wondering, what is
cross-entropy? The cross-entropy *from* \(P\) *to* \(Q\),
denoted \(H(P, Q)\), is the expected surprisal of an observer with
subjective probabilities \(Q\) upon seeing data that was actually
generated according to probabilities \(P\). This is given by
\(H(P, Q) \stackrel{\mathrm{def}}{=} \sum_j - P(j) \log Q(j)\). The
lowest possible cross-entropy is achieved when \(P=Q\). In this
case, the cross-entropy from \(P\) to \(Q\) is
\(H(P, P)= H(P)\).

In short, we can think of the cross-entropy classification objective in two ways: (i) as maximizing the likelihood of the observed data; and (ii) as minimizing our surprisal (and thus the number of bits) required to communicate the labels.

## 4.1.4. Summary and Discussion¶

In this section, we encountered the first nontrivial loss function,
allowing us to optimize over *discrete* output spaces. Key in its design
was that we took a probabilistic approach, treating discrete categories
as instances of draws from a probability distribution. As a side effect,
we encountered the softmax, a convenient activation function that
transforms outputs of an ordinary neural network layer into valid
discrete probability distributions. We saw that the derivative of the
cross entropy loss when combined with softmax behaves very similarly to
the derivative of squared error, namely by taking the difference between
the expected behavior and its prediction. And, while we were only able
to scratch the very surface of it, we encountered exciting connections
to statistical physics and information theory.

While this is enough to get you on your way, and hopefully enough to whet your appetite, we hardly dived deep here. Among other things, we skipped over computational considerations. Specifically, for any fully connected layer with \(d\) inputs and \(q\) outputs, the parameterization and computational cost is \(\mathcal{O}(dq)\), which can be prohibitively high in practice. Fortunately, this cost of transforming \(d\) inputs into \(q\) outputs can be reduced through approximation and compression. For instance Deep Fried Convnets [Yang et al., 2015] uses a combination of permutations, Fourier transforms, and scaling to reduce the cost from quadratic to log-linear. Similar techniques work for more advanced structural matrix approximations [Sindhwani et al., 2015]. Lastly, we can use Quaternion-like decompositions to reduce the cost to \(\mathcal{O}(\frac{dq}{n})\), again if we are willing to trade off a small amount of accuracy for computational and storage cost [Zhang et al., 2021a] based on a compression factor \(n\). This is an active area of research. What makes it challenging is that we do not necessarily strive for the most compact representation or the smallest number of floating point operations but rather for the solution that can be executed most efficiently on modern GPUs.

## 4.1.5. Exercises¶

We can explore the connection between exponential families and the softmax in some more depth.

Compute the second derivative of the cross-entropy loss \(l(\mathbf{y},\hat{\mathbf{y}})\) for the softmax.

Compute the variance of the distribution given by \(\mathrm{softmax}(\mathbf{o})\) and show that it matches the second derivative computed above.

Assume that we have three classes which occur with equal probability, i.e., the probability vector is \((\frac{1}{3}, \frac{1}{3}, \frac{1}{3})\).

What is the problem if we try to design a binary code for it?

Can you design a better code? Hint: what happens if we try to encode two independent observations? What if we encode \(n\) observations jointly?

When encoding signals transmitted over a physical wire, engineers don’t always use binary codes. For instance, PAM-3 uses three signal levels \(\{-1, 0, 1\}\) as opposed to two levels \(\{0, 1\}\). How many ternary units do you need to transmit an integer in the range \(\{0, \ldots, 7\}\)? Why might this be a better idea in terms of electronics?

The Bradley-Terry model uses a logistic model to capture preferences. For a user to choose between apples and oranges one assumes scores \(o_{\mathrm{apple}}\) and \(o_{\mathrm{orange}}\). Our requirements are that larger scores should lead to a higher likelihood in choosing the associated item and that the item with the largest score is the most likely one to be chosen [Bradley & Terry, 1952].

Prove that the softmax satisfies this requirement.

What happens if you want to allow for a default option of choosing neither apples nor oranges? Hint: now the user has 3 choices.

Softmax derives its name from the following mapping: \(\mathrm{RealSoftMax}(a, b) = \log (\exp(a) + \exp(b))\).

Prove that \(\mathrm{RealSoftMax}(a, b) > \mathrm{max}(a, b)\).

How small can you make the difference between both functions? Hint: without loss of generality you can set \(b = 0\) and \(a \geq b\).

Prove that this holds for \(\lambda^{-1} \mathrm{RealSoftMax}(\lambda a, \lambda b)\), provided that \(\lambda > 0\).

Show that for \(\lambda \to \infty\) we have \(\lambda^{-1} \mathrm{RealSoftMax}(\lambda a, \lambda b) \to \mathrm{max}(a, b)\).

What does the soft-min look like?

Extend this to more than two numbers.

The function \(g(\mathbf{x}) \stackrel{\mathrm{def}}{=} \log \sum_i \exp x_i\) is sometimes also referred to as the log-partition function.

Prove that the function is convex. Hint: to do so, use the fact that the first derivative amounts to the probabilities from the softmax function and show that the second derivative is the variance.

Show that \(g\) is translation invariant, i.e., \(g(\mathbf{x} + b) = g(\mathbf{x})\).

What happens if some of the coordinates \(x_i\) are very large? What happens if they’re all very small?

Show that if we choose \(b = \mathrm{max}_i x_i\) we end up with a numerically stable implementation.

Assume that we have some probability distribution \(P\). Suppose we pick another distribution \(Q\) with \(Q(i) \propto P(i)^\alpha\) for \(\alpha > 0\).

Which choice of \(\alpha\) corresponds to doubling the temperature? Which choice corresponds to halving it?

What happens if we let the temperature converge to \(0\)?

What happens if we let the temperature converge to \(\infty\)?