.. _sec_naive_bayes:
Naive Bayes
===========
Throughout the previous sections, we learned about the theory of
probability and random variables. To put this theory to work, let’s
introduce the *naive Bayes* classifier. This uses nothing but
probabilistic fundamentals to allow us to perform classification of
digits.
Learning is all about making assumptions. If we want to classify a new
data example that we have never seen before we have to make some
assumptions about which data examples are similar to each other. The
naive Bayes classifier, a popular and remarkably clear algorithm,
assumes all features are independent from each other to simplify the
computation. In this section, we will apply this model to recognize
characters in images.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
import math
import torch
import torchvision
from d2l import torch as d2l
d2l.use_svg_display()
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
import math
from mxnet import gluon, np, npx
from d2l import mxnet as d2l
npx.set_np()
d2l.use_svg_display()
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
import math
import tensorflow as tf
from d2l import tensorflow as d2l
d2l.use_svg_display()
.. raw:: html
.. raw:: html
Optical Character Recognition
-----------------------------
MNIST :cite:`LeCun.Bottou.Bengio.ea.1998` is one of widely used
datasets. It contains 60,000 images for training and 10,000 images for
validation. Each image contains a handwritten digit from 0 to 9. The
task is classifying each image into the corresponding digit.
Gluon provides a ``MNIST`` class in the ``data.vision`` module to
automatically retrieve the dataset from the Internet. Subsequently,
Gluon will use the already-downloaded local copy. We specify whether we
are requesting the training set or the test set by setting the value of
the parameter ``train`` to ``True`` or ``False``, respectively. Each
image is a grayscale image with both width and height of :math:`28` with
shape (:math:`28`,\ :math:`28`,\ :math:`1`). We use a customized
transformation to remove the last channel dimension. In addition, the
dataset represents each pixel by an unsigned :math:`8`-bit integer. We
quantize them into binary features to simplify the problem.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
data_transform = torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
lambda x: torch.floor(x * 255 / 128).squeeze(dim=0)
])
mnist_train = torchvision.datasets.MNIST(
root='./temp', train=True, transform=data_transform, download=True)
mnist_test = torchvision.datasets.MNIST(
root='./temp', train=False, transform=data_transform, download=True)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./temp/MNIST/raw/train-images-idx3-ubyte.gz
100%|██████████| 9912422/9912422 [00:00<00:00, 115752065.81it/s]
Extracting ./temp/MNIST/raw/train-images-idx3-ubyte.gz to ./temp/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./temp/MNIST/raw/train-labels-idx1-ubyte.gz
100%|██████████| 28881/28881 [00:00<00:00, 5234904.66it/s]
Extracting ./temp/MNIST/raw/train-labels-idx1-ubyte.gz to ./temp/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./temp/MNIST/raw/t10k-images-idx3-ubyte.gz
100%|██████████| 1648877/1648877 [00:00<00:00, 43715298.68it/s]Extracting ./temp/MNIST/raw/t10k-images-idx3-ubyte.gz to ./temp/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./temp/MNIST/raw/t10k-labels-idx1-ubyte.gz
100%|██████████| 4542/4542 [00:00<00:00, 21501725.47it/s]
Extracting ./temp/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./temp/MNIST/raw
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def transform(data, label):
return np.floor(data.astype('float32') / 128).squeeze(axis=-1), label
mnist_train = gluon.data.vision.MNIST(train=True, transform=transform)
mnist_test = gluon.data.vision.MNIST(train=False, transform=transform)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
Downloading /opt/mxnet/datasets/mnist/train-images-idx3-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/train-images-idx3-ubyte.gz...
Downloading /opt/mxnet/datasets/mnist/train-labels-idx1-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/train-labels-idx1-ubyte.gz...
[22:05:00] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU
Downloading /opt/mxnet/datasets/mnist/t10k-images-idx3-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/t10k-images-idx3-ubyte.gz...
Downloading /opt/mxnet/datasets/mnist/t10k-labels-idx1-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/t10k-labels-idx1-ubyte.gz...
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
((train_images, train_labels), (
test_images, test_labels)) = tf.keras.datasets.mnist.load_data()
# Original pixel values of MNIST range from 0-255 (as the digits are stored as
# uint8). For this section, pixel values that are greater than 128 (in the
# original image) are converted to 1 and values that are less than 128 are
# converted to 0. See section 18.9.2 and 18.9.3 for why
train_images = tf.floor(tf.constant(train_images / 128, dtype = tf.float32))
test_images = tf.floor(tf.constant(test_images / 128, dtype = tf.float32))
train_labels = tf.constant(train_labels, dtype = tf.int32)
test_labels = tf.constant(test_labels, dtype = tf.int32)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 [==============================] - 0s 0us/step
.. raw:: html
.. raw:: html
We can access a particular example, which contains the image and the
corresponding label.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
image, label = mnist_train[2]
image.shape, label
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(torch.Size([28, 28]), 4)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
image, label = mnist_train[2]
image.shape, label
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
((28, 28), array(4, dtype=int32))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
image, label = train_images[2], train_labels[2]
image.shape, label.numpy()
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(TensorShape([28, 28]), 4)
.. raw:: html
.. raw:: html
Our example, stored here in the variable ``image``, corresponds to an
image with a height and width of :math:`28` pixels.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
image.shape, image.dtype
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(torch.Size([28, 28]), torch.float32)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
image.shape, image.dtype
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
((28, 28), dtype('float32'))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
image.shape, image.dtype
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(TensorShape([28, 28]), tf.float32)
.. raw:: html
.. raw:: html
Our code stores the label of each image as a scalar. Its type is a
:math:`32`-bit integer.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
label, type(label)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(4, int)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
label, type(label), label.dtype
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(array(4, dtype=int32), mxnet.numpy.ndarray, dtype('int32'))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
label.numpy(), label.dtype
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(4, tf.int32)
.. raw:: html
.. raw:: html
We can also access multiple examples at the same time.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
images = torch.stack([mnist_train[i][0] for i in range(10, 38)], dim=0)
labels = torch.tensor([mnist_train[i][1] for i in range(10, 38)])
images.shape, labels.shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(torch.Size([28, 28, 28]), torch.Size([28]))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
images, labels = mnist_train[10:38]
images.shape, labels.shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
((28, 28, 28), (28,))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
images = tf.stack([train_images[i] for i in range(10, 38)], axis=0)
labels = tf.constant([train_labels[i].numpy() for i in range(10, 38)])
images.shape, labels.shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(TensorShape([28, 28, 28]), TensorShape([28]))
.. raw:: html
.. raw:: html
Let’s visualize these examples.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
d2l.show_images(images, 2, 9);
.. figure:: output_naive-bayes_6e475d_75_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
d2l.show_images(images, 2, 9);
.. figure:: output_naive-bayes_6e475d_78_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
d2l.show_images(images, 2, 9);
.. figure:: output_naive-bayes_6e475d_81_0.svg
.. raw:: html
.. raw:: html
The Probabilistic Model for Classification
------------------------------------------
In a classification task, we map an example into a category. Here an
example is a grayscale :math:`28\times 28` image, and a category is a
digit. (Refer to :numref:`sec_softmax` for a more detailed
explanation.) One natural way to express the classification task is via
the probabilistic question: what is the most likely label given the
features (i.e., image pixels)? Denote by :math:`\mathbf x\in\mathbb R^d`
the features of the example and :math:`y\in\mathbb R` the label. Here
features are image pixels, where we can reshape a :math:`2`-dimensional
image to a vector so that :math:`d=28^2=784`, and labels are digits. The
probability of the label given the features is
:math:`p(y \mid \mathbf{x})`. If we are able to compute these
probabilities, which are :math:`p(y \mid \mathbf{x})` for
:math:`y=0, \ldots,9` in our example, then the classifier will output
the prediction :math:`\hat{y}` given by the expression:
.. math:: \hat{y} = \mathrm{argmax} \> p(y \mid \mathbf{x}).
Unfortunately, this requires that we estimate
:math:`p(y \mid \mathbf{x})` for every value of
:math:`\mathbf{x} = x_1, ..., x_d`. Imagine that each feature could take
one of :math:`2` values. For example, the feature :math:`x_1 = 1` might
signify that the word apple appears in a given document and
:math:`x_1 = 0` would signify that it does not. If we had :math:`30`
such binary features, that would mean that we need to be prepared to
classify any of :math:`2^{30}` (over 1 billion!) possible values of the
input vector :math:`\mathbf{x}`.
Moreover, where is the learning? If we need to see every single possible
example in order to predict the corresponding label then we are not
really learning a pattern but just memorizing the dataset.
The Naive Bayes Classifier
--------------------------
Fortunately, by making some assumptions about conditional independence,
we can introduce some inductive bias and build a model capable of
generalizing from a comparatively modest selection of training examples.
To begin, let’s use Bayes theorem, to express the classifier as
.. math:: \hat{y} = \mathrm{argmax}_y \> p(y \mid \mathbf{x}) = \mathrm{argmax}_y \> \frac{p( \mathbf{x} \mid y) p(y)}{p(\mathbf{x})}.
Note that the denominator is the normalizing term :math:`p(\mathbf{x})`
which does not depend on the value of the label :math:`y`. As a result,
we only need to worry about comparing the numerator across different
values of :math:`y`. Even if calculating the denominator turned out to
be intractable, we could get away with ignoring it, so long as we could
evaluate the numerator. Fortunately, even if we wanted to recover the
normalizing constant, we could. We can always recover the normalization
term since :math:`\sum_y p(y \mid \mathbf{x}) = 1`.
Now, let’s focus on :math:`p( \mathbf{x} \mid y)`. Using the chain rule
of probability, we can express the term :math:`p( \mathbf{x} \mid y)` as
.. math:: p(x_1 \mid y) \cdot p(x_2 \mid x_1, y) \cdot ... \cdot p( x_d \mid x_1, ..., x_{d-1}, y).
By itself, this expression does not get us any further. We still must
estimate roughly :math:`2^d` parameters. However, if we assume that *the
features are conditionally independent of each other, given the label*,
then suddenly we are in much better shape, as this term simplifies to
:math:`\prod_i p(x_i \mid y)`, giving us the predictor
.. math:: \hat{y} = \mathrm{argmax}_y \> \prod_{i=1}^d p(x_i \mid y) p(y).
If we can estimate :math:`p(x_i=1 \mid y)` for every :math:`i` and
:math:`y`, and save its value in :math:`P_{xy}[i, y]`, here
:math:`P_{xy}` is a :math:`d\times n` matrix with :math:`n` being the
number of classes and :math:`y\in\{1, \ldots, n\}`, then we can also use
this to estimate :math:`p(x_i = 0 \mid y)`, i.e.,
.. math::
p(x_i = t_i \mid y) =
\begin{cases}
P_{xy}[i, y] & \textrm{for } t_i=1 ;\\
1 - P_{xy}[i, y] & \textrm{for } t_i = 0 .
\end{cases}
In addition, we estimate :math:`p(y)` for every :math:`y` and save it in
:math:`P_y[y]`, with :math:`P_y` a :math:`n`-length vector. Then, for
any new example :math:`\mathbf t = (t_1, t_2, \ldots, t_d)`, we could
compute
.. math:: \begin{aligned}\hat{y} &= \mathrm{argmax}_ y \ p(y)\prod_{i=1}^d p(x_t = t_i \mid y) \\ &= \mathrm{argmax}_y \ P_y[y]\prod_{i=1}^d \ P_{xy}[i, y]^{t_i}\, \left(1 - P_{xy}[i, y]\right)^{1-t_i}\end{aligned}
:label: eq_naive_bayes_estimation
for any :math:`y`. So our assumption of conditional independence has
taken the complexity of our model from an exponential dependence on the
number of features :math:`\mathcal{O}(2^dn)` to a linear dependence,
which is :math:`\mathcal{O}(dn)`.
Training
--------
The problem now is that we do not know :math:`P_{xy}` and :math:`P_y`.
So we need to estimate their values given some training data first. This
is *training* the model. Estimating :math:`P_y` is not too hard. Since
we are only dealing with :math:`10` classes, we may count the number of
occurrences :math:`n_y` for each of the digits and divide it by the
total amount of data :math:`n`. For instance, if digit 8 occurs
:math:`n_8 = 5,800` times and we have a total of :math:`n = 60,000`
images, the probability estimate is :math:`p(y=8) = 0.0967`.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X = torch.stack([mnist_train[i][0] for i in range(len(mnist_train))], dim=0)
Y = torch.tensor([mnist_train[i][1] for i in range(len(mnist_train))])
n_y = torch.zeros(10)
for y in range(10):
n_y[y] = (Y == y).sum()
P_y = n_y / n_y.sum()
P_y
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
tensor([0.0987, 0.1124, 0.0993, 0.1022, 0.0974, 0.0904, 0.0986, 0.1044, 0.0975,
0.0992])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X, Y = mnist_train[:] # All training examples
n_y = np.zeros((10))
for y in range(10):
n_y[y] = (Y == y).sum()
P_y = n_y / n_y.sum()
P_y
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
array([0.09871667, 0.11236667, 0.0993 , 0.10218333, 0.09736667,
0.09035 , 0.09863333, 0.10441667, 0.09751666, 0.09915 ])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X = train_images
Y = train_labels
n_y = tf.Variable(tf.zeros(10))
for y in range(10):
n_y[y].assign(tf.reduce_sum(tf.cast(Y == y, tf.float32)))
P_y = n_y / tf.reduce_sum(n_y)
P_y
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. raw:: html
.. raw:: html
Now on to slightly more difficult things :math:`P_{xy}`. Since we picked
black and white images, :math:`p(x_i \mid y)` denotes the probability
that pixel :math:`i` is switched on for class :math:`y`. Just like
before we can go and count the number of times :math:`n_{iy}` such that
an event occurs and divide it by the total number of occurrences of
:math:`y`, i.e., :math:`n_y`. But there is something slightly troubling:
certain pixels may never be black (e.g., for well cropped images the
corner pixels might always be white). A convenient way for statisticians
to deal with this problem is to add pseudo counts to all occurrences.
Hence, rather than :math:`n_{iy}` we use :math:`n_{iy}+1` and instead of
:math:`n_y` we use :math:`n_{y}+2` (since there are two possible values
pixel :math:`i` can take - it can either be black or white). This is
also called *Laplace Smoothing*. It may seem ad-hoc, however it can be
motivated from a Bayesian point-of-view by a Beta-binomial model.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
n_x = torch.zeros((10, 28, 28))
for y in range(10):
n_x[y] = torch.tensor(X.numpy()[Y.numpy() == y].sum(axis=0))
P_xy = (n_x + 1) / (n_y + 2).reshape(10, 1, 1)
d2l.show_images(P_xy, 2, 5);
.. figure:: output_naive-bayes_6e475d_99_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
n_x = np.zeros((10, 28, 28))
for y in range(10):
n_x[y] = np.array(X.asnumpy()[Y.asnumpy() == y].sum(axis=0))
P_xy = (n_x + 1) / (n_y + 2).reshape(10, 1, 1)
d2l.show_images(P_xy, 2, 5);
.. figure:: output_naive-bayes_6e475d_102_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
n_x = tf.Variable(tf.zeros((10, 28, 28)))
for y in range(10):
n_x[y].assign(tf.cast(tf.reduce_sum(
X.numpy()[Y.numpy() == y], axis=0), tf.float32))
P_xy = (n_x + 1) / tf.reshape((n_y + 2), (10, 1, 1))
d2l.show_images(P_xy, 2, 5);
.. figure:: output_naive-bayes_6e475d_105_0.svg
.. raw:: html
.. raw:: html
By visualizing these :math:`10\times 28\times 28` probabilities (for
each pixel for each class) we could get some mean looking digits.
Now we can use :eq:`eq_naive_bayes_estimation` to predict a new
image. Given :math:`\mathbf x`, the following functions computes
:math:`p(\mathbf x \mid y)p(y)` for every :math:`y`.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def bayes_pred(x):
x = x.unsqueeze(0) # (28, 28) -> (1, 28, 28)
p_xy = P_xy * x + (1 - P_xy)*(1 - x)
p_xy = p_xy.reshape(10, -1).prod(dim=1) # p(x|y)
return p_xy * P_y
image, label = mnist_test[0]
bayes_pred(image)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def bayes_pred(x):
x = np.expand_dims(x, axis=0) # (28, 28) -> (1, 28, 28)
p_xy = P_xy * x + (1 - P_xy)*(1 - x)
p_xy = p_xy.reshape(10, -1).prod(axis=1) # p(x|y)
return np.array(p_xy) * P_y
image, label = mnist_test[0]
bayes_pred(image)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def bayes_pred(x):
x = tf.expand_dims(x, axis=0) # (28, 28) -> (1, 28, 28)
p_xy = P_xy * x + (1 - P_xy)*(1 - x)
p_xy = tf.math.reduce_prod(tf.reshape(p_xy, (10, -1)), axis=1) # p(x|y)
return p_xy * P_y
image, label = train_images[0], train_labels[0]
bayes_pred(image)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. raw:: html
.. raw:: html
This went horribly wrong! To find out why, let’s look at the per pixel
probabilities. They are typically numbers between :math:`0.001` and
:math:`1`. We are multiplying :math:`784` of them. At this point it is
worth mentioning that we are calculating these numbers on a computer,
hence with a fixed range for the exponent. What happens is that we
experience *numerical underflow*, i.e., multiplying all the small
numbers leads to something even smaller until it is rounded down to
zero. We discussed this as a theoretical issue in
:numref:`sec_maximum_likelihood`, but we see the phenomena clearly
here in practice.
As discussed in that section, we fix this by use the fact that
:math:`\log a b = \log a + \log b`, i.e., we switch to summing
logarithms. Even if both :math:`a` and :math:`b` are small numbers, the
logarithm values should be in a proper range.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
a = 0.1
print('underflow:', a**784)
print('logarithm is normal:', 784*math.log(a))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
underflow: 0.0
logarithm is normal: -1805.2267129073316
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
a = 0.1
print('underflow:', a**784)
print('logarithm is normal:', 784*math.log(a))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
underflow: 0.0
logarithm is normal: -1805.2267129073316
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
a = 0.1
print('underflow:', a**784)
print('logarithm is normal:', 784*tf.math.log(a).numpy())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
underflow: 0.0
logarithm is normal: -1805.2267379760742
.. raw:: html
.. raw:: html
Since the logarithm is an increasing function, we can rewrite
:eq:`eq_naive_bayes_estimation` as
.. math:: \hat{y} = \mathrm{argmax}_y \ \log P_y[y] + \sum_{i=1}^d \Big[t_i\log P_{xy}[x_i, y] + (1-t_i) \log (1 - P_{xy}[x_i, y]) \Big].
We can implement the following stable version:
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
log_P_xy = torch.log(P_xy)
log_P_xy_neg = torch.log(1 - P_xy)
log_P_y = torch.log(P_y)
def bayes_pred_stable(x):
x = x.unsqueeze(0) # (28, 28) -> (1, 28, 28)
p_xy = log_P_xy * x + log_P_xy_neg * (1 - x)
p_xy = p_xy.reshape(10, -1).sum(axis=1) # p(x|y)
return p_xy + log_P_y
py = bayes_pred_stable(image)
py
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
tensor([-268.9725, -301.7044, -245.1951, -218.8738, -193.4570, -206.0909,
-292.5226, -114.6257, -220.3313, -163.1784])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
log_P_xy = np.log(P_xy)
log_P_xy_neg = np.log(1 - P_xy)
log_P_y = np.log(P_y)
def bayes_pred_stable(x):
x = np.expand_dims(x, axis=0) # (28, 28) -> (1, 28, 28)
p_xy = log_P_xy * x + log_P_xy_neg * (1 - x)
p_xy = p_xy.reshape(10, -1).sum(axis=1) # p(x|y)
return p_xy + log_P_y
py = bayes_pred_stable(image)
py
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
array([-268.97253, -301.7044 , -245.19514, -218.87384, -193.45703,
-206.09088, -292.52264, -114.62566, -220.33133, -163.17842])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
log_P_xy = tf.math.log(P_xy)
log_P_xy_neg = tf.math.log(1 - P_xy)
log_P_y = tf.math.log(P_y)
def bayes_pred_stable(x):
x = tf.expand_dims(x, axis=0) # (28, 28) -> (1, 28, 28)
p_xy = log_P_xy * x + log_P_xy_neg * (1 - x)
p_xy = tf.math.reduce_sum(tf.reshape(p_xy, (10, -1)), axis=1) # p(x|y)
return p_xy + log_P_y
py = bayes_pred_stable(image)
py
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. raw:: html
.. raw:: html
We may now check if the prediction is correct.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
py.argmax(dim=0) == label
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
tensor(True)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Convert label which is a scalar tensor of int32 dtype to a Python scalar
# integer for comparison
py.argmax(axis=0) == int(label)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
array(True)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
tf.argmax(py, axis=0, output_type = tf.int32) == label
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. raw:: html
.. raw:: html
If we now predict a few validation examples, we can see the Bayes
classifier works pretty well.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def predict(X):
return [bayes_pred_stable(x).argmax(dim=0).type(torch.int32).item()
for x in X]
X = torch.stack([mnist_test[i][0] for i in range(18)], dim=0)
y = torch.tensor([mnist_test[i][1] for i in range(18)])
preds = predict(X)
d2l.show_images(X, 2, 9, titles=[str(d) for d in preds]);
.. figure:: output_naive-bayes_6e475d_159_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def predict(X):
return [bayes_pred_stable(x).argmax(axis=0).astype(np.int32) for x in X]
X, y = mnist_test[:18]
preds = predict(X)
d2l.show_images(X, 2, 9, titles=[str(d) for d in preds]);
.. figure:: output_naive-bayes_6e475d_162_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def predict(X):
return [tf.argmax(
bayes_pred_stable(x), axis=0, output_type = tf.int32).numpy()
for x in X]
X = tf.stack([train_images[i] for i in range(10, 38)], axis=0)
y = tf.constant([train_labels[i].numpy() for i in range(10, 38)])
preds = predict(X)
d2l.show_images(X, 2, 9, titles=[str(d) for d in preds]);
.. figure:: output_naive-bayes_6e475d_165_0.svg
.. raw:: html
.. raw:: html
Finally, let’s compute the overall accuracy of the classifier.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X = torch.stack([mnist_test[i][0] for i in range(len(mnist_test))], dim=0)
y = torch.tensor([mnist_test[i][1] for i in range(len(mnist_test))])
preds = torch.tensor(predict(X), dtype=torch.int32)
float((preds == y).sum()) / len(y) # Validation accuracy
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
0.8427
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X, y = mnist_test[:]
preds = np.array(predict(X), dtype=np.int32)
float((preds == y).sum()) / len(y) # Validation accuracy
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
0.8427
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X = test_images
y = test_labels
preds = tf.constant(predict(X), dtype=tf.int32)
# Validation accuracy
tf.reduce_sum(tf.cast(preds == y, tf.float32)).numpy() / len(y)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
0.8427
.. raw:: html
.. raw:: html
Modern deep networks achieve error rates of less than :math:`0.01`. The
relatively poor performance is due to the incorrect statistical
assumptions that we made in our model: we assumed that each and every
pixel are *independently* generated, depending only on the label. This
is clearly not how humans write digits, and this wrong assumption led to
the downfall of our overly naive (Bayes) classifier.
Summary
-------
- Using Bayes’ rule, a classifier can be made by assuming all observed
features are independent.
- This classifier can be trained on a dataset by counting the number of
occurrences of combinations of labels and pixel values.
- This classifier was the gold standard for decades for tasks such as
spam detection.
Exercises
---------
1. Consider the dataset :math:`[[0,0], [0,1], [1,0], [1,1]]` with labels
given by the XOR of the two elements :math:`[0,1,1,0]`. What are the
probabilities for a Naive Bayes classifier built on this dataset.
Does it successfully classify our points? If not, what assumptions
are violated?
2. Suppose that we did not use Laplace smoothing when estimating
probabilities and a data example arrived at testing time which
contained a value never observed in training. What would the model
output?
3. The naive Bayes classifier is a specific example of a Bayesian
network, where the dependence of random variables are encoded with a
graph structure. While the full theory is beyond the scope of this
section (see :cite:t:`Koller.Friedman.2009` for full details),
explain why allowing explicit dependence between the two input
variables in the XOR model allows for the creation of a successful
classifier.
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html