.. _sec_use_gpu: GPUs ==== In :numref:`tab_intro_decade`, we illustrated the rapid growth of computation over the past two decades. In a nutshell, GPU performance has increased by a factor of 1000 every decade since 2000. This offers great opportunities but it also suggests that there was significant demand for such performance. In this section, we begin to discuss how to harness this computational performance for your research. First by using a single GPU and at a later point, how to use multiple GPUs and multiple servers (with multiple GPUs). Specifically, we will discuss how to use a single NVIDIA GPU for calculations. First, make sure you have at least one NVIDIA GPU installed. Then, download the `NVIDIA driver and CUDA `__ and follow the prompts to set the appropriate path. Once these preparations are complete, the ``nvidia-smi`` command can be used to view the graphics card information. .. raw:: html
.. raw:: html
In PyTorch, every array has a device; we often refer it as a *context*. So far, by default, all variables and associated computation have been assigned to the CPU. Typically, other contexts might be various GPUs. Things can get even hairier when we deploy jobs across multiple servers. By assigning arrays to contexts intelligently, we can minimize the time spent transferring data between devices. For example, when training neural networks on a server with a GPU, we typically prefer for the model’s parameters to live on the GPU. .. raw:: html
.. raw:: html
You might have noticed that a MXNet tensor looks almost identical to a NumPy ``ndarray``. But there are a few crucial differences. One of the key features that distinguishes MXNet from NumPy is its support for diverse hardware devices. In MXNet, every array has a context. So far, by default, all variables and associated computation have been assigned to the CPU. Typically, other contexts might be various GPUs. Things can get even hairier when we deploy jobs across multiple servers. By assigning arrays to contexts intelligently, we can minimize the time spent transferring data between devices. For example, when training neural networks on a server with a GPU, we typically prefer for the model’s parameters to live on the GPU. Next, we need to confirm that the GPU version of MXNet is installed. If a CPU version of MXNet is already installed, we need to uninstall it first. For example, use the ``pip uninstall mxnet`` command, then install the corresponding MXNet version according to your CUDA version. Assuming you have CUDA 10.0 installed, you can install the MXNet version that supports CUDA 10.0 via ``pip install mxnet-cu100``. .. raw:: html
.. raw:: html
To run the programs in this section, you need at least two GPUs. Note that this might be extravagant for most desktop computers but it is easily available in the cloud, e.g., by using the AWS EC2 multi-GPU instances. Almost all other sections do *not* require multiple GPUs, but here we simply wish to illustrate data flow between different devices. .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python import torch from torch import nn from d2l import torch as d2l .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python from mxnet import np, npx from mxnet.gluon import nn from d2l import mxnet as d2l npx.set_np() .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python import jax from flax import linen as nn from jax import numpy as jnp from d2l import jax as d2l .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python import tensorflow as tf from d2l import tensorflow as d2l .. raw:: html
.. raw:: html
Computing Devices ----------------- We can specify devices, such as CPUs and GPUs, for storage and calculation. By default, tensors are created in the main memory and then the CPU is used for calculations. .. raw:: html
.. raw:: html
In PyTorch, the CPU and GPU can be indicated by ``torch.device('cpu')`` and ``torch.device('cuda')``. It should be noted that the ``cpu`` device means all physical CPUs and memory. This means that PyTorch’s calculations will try to use all CPU cores. However, a ``gpu`` device only represents one card and the corresponding memory. If there are multiple GPUs, we use ``torch.device(f'cuda:{i}')`` to represent the :math:`i^\textrm{th}` GPU (:math:`i` starts at 0). Also, ``gpu:0`` and ``gpu`` are equivalent. .. raw:: latex \diilbookstyleinputcell .. code:: python def cpu(): #@save """Get the CPU device.""" return torch.device('cpu') def gpu(i=0): #@save """Get a GPU device.""" return torch.device(f'cuda:{i}') cpu(), gpu(), gpu(1) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (device(type='cpu'), device(type='cuda', index=0), device(type='cuda', index=1)) .. raw:: html
.. raw:: html
In MXNet, the CPU and GPU can be indicated by ``cpu()`` and ``gpu()``. It should be noted that ``cpu()`` (or any integer in the parentheses) means all physical CPUs and memory. This means that MXNet’s calculations will try to use all CPU cores. However, ``gpu()`` only represents one card and the corresponding memory. If there are multiple GPUs, we use ``gpu(i)`` to represent the :math:`i^\textrm{th}` GPU (:math:`i` starts from 0). Also, ``gpu(0)`` and ``gpu()`` are equivalent. .. raw:: latex \diilbookstyleinputcell .. code:: python def cpu(): #@save """Get the CPU device.""" return npx.cpu() def gpu(i=0): #@save """Get a GPU device.""" return npx.gpu(i) cpu(), gpu(), gpu(1) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (cpu(0), gpu(0), gpu(1)) .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python def cpu(): #@save """Get the CPU device.""" return jax.devices('cpu')[0] def gpu(i=0): #@save """Get a GPU device.""" return jax.devices('gpu')[i] cpu(), gpu(), gpu(1) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (CpuDevice(id=0), gpu(id=0), gpu(id=1)) .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python def cpu(): #@save """Get the CPU device.""" return tf.device('/CPU:0') def gpu(i=0): #@save """Get a GPU device.""" return tf.device(f'/GPU:{i}') cpu(), gpu(), gpu(1) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (, , ) .. raw:: html
.. raw:: html
We can query the number of available GPUs. .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python def num_gpus(): #@save """Get the number of available GPUs.""" return torch.cuda.device_count() num_gpus() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output 2 .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python def num_gpus(): #@save """Get the number of available GPUs.""" return npx.num_gpus() num_gpus() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output 2 .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python def num_gpus(): #@save """Get the number of available GPUs.""" try: return jax.device_count('gpu') except: return 0 # No GPU backend found num_gpus() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output 2 .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python def num_gpus(): #@save """Get the number of available GPUs.""" return len(tf.config.experimental.list_physical_devices('GPU')) num_gpus() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output 2 .. raw:: html
.. raw:: html
Now we define two convenient functions that allow us to run code even if the requested GPUs do not exist. .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python def try_gpu(i=0): #@save """Return gpu(i) if exists, otherwise return cpu().""" if num_gpus() >= i + 1: return gpu(i) return cpu() def try_all_gpus(): #@save """Return all available GPUs, or [cpu(),] if no GPU exists.""" return [gpu(i) for i in range(num_gpus())] try_gpu(), try_gpu(10), try_all_gpus() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (device(type='cuda', index=0), device(type='cpu'), [device(type='cuda', index=0), device(type='cuda', index=1)]) .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python def try_gpu(i=0): #@save """Return gpu(i) if exists, otherwise return cpu().""" if num_gpus() >= i + 1: return gpu(i) return cpu() def try_all_gpus(): #@save """Return all available GPUs, or [cpu(),] if no GPU exists.""" return [gpu(i) for i in range(num_gpus())] try_gpu(), try_gpu(10), try_all_gpus() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (gpu(0), cpu(0), [gpu(0), gpu(1)]) .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python def try_gpu(i=0): #@save """Return gpu(i) if exists, otherwise return cpu().""" if num_gpus() >= i + 1: return gpu(i) return cpu() def try_all_gpus(): #@save """Return all available GPUs, or [cpu(),] if no GPU exists.""" return [gpu(i) for i in range(num_gpus())] try_gpu(), try_gpu(10), try_all_gpus() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (gpu(id=0), CpuDevice(id=0), [gpu(id=0), gpu(id=1)]) .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python def try_gpu(i=0): #@save """Return gpu(i) if exists, otherwise return cpu().""" if num_gpus() >= i + 1: return gpu(i) return cpu() def try_all_gpus(): #@save """Return all available GPUs, or [cpu(),] if no GPU exists.""" return [gpu(i) for i in range(num_gpus())] try_gpu(), try_gpu(10), try_all_gpus() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (, , [, ]) .. raw:: html
.. raw:: html
Tensors and GPUs ---------------- .. raw:: html
.. raw:: html
By default, tensors are created on the CPU. We can query the device where the tensor is located. .. raw:: latex \diilbookstyleinputcell .. code:: python x = torch.tensor([1, 2, 3]) x.device .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output device(type='cpu') .. raw:: html
.. raw:: html
By default, tensors are created on the CPU. We can query the device where the tensor is located. .. raw:: latex \diilbookstyleinputcell .. code:: python x = np.array([1, 2, 3]) x.ctx .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output [22:01:52] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output cpu(0) .. raw:: html
.. raw:: html
By default, tensors are created on the GPU/TPU if they are available, else CPU is used if not available. We can query the device where the tensor is located. .. raw:: latex \diilbookstyleinputcell .. code:: python x = jnp.array([1, 2, 3]) x.device() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output gpu(id=0) .. raw:: html
.. raw:: html
By default, tensors are created on the GPU/TPU if they are available, else CPU is used if not available. We can query the device where the tensor is located. .. raw:: latex \diilbookstyleinputcell .. code:: python x = tf.constant([1, 2, 3]) x.device .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output '/job:localhost/replica:0/task:0/device:GPU:0' .. raw:: html
.. raw:: html
It is important to note that whenever we want to operate on multiple terms, they need to be on the same device. For instance, if we sum two tensors, we need to make sure that both arguments live on the same device—otherwise the framework would not know where to store the result or even how to decide where to perform the computation. Storage on the GPU ~~~~~~~~~~~~~~~~~~ There are several ways to store a tensor on the GPU. For example, we can specify a storage device when creating a tensor. Next, we create the tensor variable ``X`` on the first ``gpu``. The tensor created on a GPU only consumes the memory of this GPU. We can use the ``nvidia-smi`` command to view GPU memory usage. In general, we need to make sure that we do not create data that exceeds the GPU memory limit. .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python X = torch.ones(2, 3, device=try_gpu()) X .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output tensor([[1., 1., 1.], [1., 1., 1.]], device='cuda:0') .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python X = np.ones((2, 3), ctx=try_gpu()) X .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output [22:01:53] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for GPU .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output array([[1., 1., 1.], [1., 1., 1.]], ctx=gpu(0)) .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python # By default JAX puts arrays to GPUs or TPUs if available X = jax.device_put(jnp.ones((2, 3)), try_gpu()) X .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output Array([[1., 1., 1.], [1., 1., 1.]], dtype=float32) .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python with try_gpu(): X = tf.ones((2, 3)) X .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output .. raw:: html
.. raw:: html
Assuming that you have at least two GPUs, the following code will create a random tensor, ``Y``, on the second GPU. .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python Y = torch.rand(2, 3, device=try_gpu(1)) Y .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output tensor([[0.0022, 0.5723, 0.2890], [0.1456, 0.3537, 0.7359]], device='cuda:1') .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python Y = np.random.uniform(size=(2, 3), ctx=try_gpu(1)) Y .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output [22:01:54] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for GPU .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output array([[0.67478997, 0.07540122, 0.9956977 ], [0.09488854, 0.415456 , 0.11231736]], ctx=gpu(1)) .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python Y = jax.device_put(jax.random.uniform(jax.random.PRNGKey(0), (2, 3)), try_gpu(1)) Y .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output Array([[0.57450044, 0.09968603, 0.7419659 ], [0.8941783 , 0.59656656, 0.45325184]], dtype=float32) .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python with try_gpu(1): Y = tf.random.uniform((2, 3)) Y .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output .. raw:: html
.. raw:: html
Copying ~~~~~~~ If we want to compute ``X + Y``, we need to decide where to perform this operation. For instance, as shown in :numref:`fig_copyto`, we can transfer ``X`` to the second GPU and perform the operation there. *Do not* simply add ``X`` and ``Y``, since this will result in an exception. The runtime engine would not know what to do: it cannot find data on the same device and it fails. Since ``Y`` lives on the second GPU, we need to move ``X`` there before we can add the two. .. _fig_copyto: .. figure:: ../img/copyto.svg Copy data to perform an operation on the same device. .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python Z = X.cuda(1) print(X) print(Z) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output tensor([[1., 1., 1.], [1., 1., 1.]], device='cuda:0') tensor([[1., 1., 1.], [1., 1., 1.]], device='cuda:1') .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python Z = X.copyto(try_gpu(1)) print(X) print(Z) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output [[1. 1. 1.] [1. 1. 1.]] @gpu(0) [[1. 1. 1.] [1. 1. 1.]] @gpu(1) .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python Z = jax.device_put(X, try_gpu(1)) print(X) print(Z) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output [[1. 1. 1.] [1. 1. 1.]] [[1. 1. 1.] [1. 1. 1.]] .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python with try_gpu(1): Z = X print(X) print(Z) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output tf.Tensor( [[1. 1. 1.] [1. 1. 1.]], shape=(2, 3), dtype=float32) tf.Tensor( [[1. 1. 1.] [1. 1. 1.]], shape=(2, 3), dtype=float32) .. raw:: html
.. raw:: html
Now that the data (both ``Z`` and ``Y``) are on the same GPU), we can add them up. .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python Y + Z .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output tensor([[1.0022, 1.5723, 1.2890], [1.1456, 1.3537, 1.7359]], device='cuda:1') But what if your variable ``Z`` already lived on your second GPU? What happens if we still call ``Z.cuda(1)``? It will return ``Z`` instead of making a copy and allocating new memory. .. raw:: latex \diilbookstyleinputcell .. code:: python Z.cuda(1) is Z .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output True .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python Y + Z .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output array([[1.6747899, 1.0754012, 1.9956977], [1.0948886, 1.415456 , 1.1123173]], ctx=gpu(1)) Imagine that your variable ``Z`` already lives on your second GPU. What happens if we still call ``Z.copyto(gpu(1))``? It will make a copy and allocate new memory, even though that variable already lives on the desired device. There are times where, depending on the environment our code is running in, two variables may already live on the same device. So we want to make a copy only if the variables currently live in different devices. In these cases, we can call ``as_in_ctx``. If the variable already live in the specified device then this is a no-op. Unless you specifically want to make a copy, ``as_in_ctx`` is the method of choice. .. raw:: latex \diilbookstyleinputcell .. code:: python Z.as_in_ctx(try_gpu(1)) is Z .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output True .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python Y + Z .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output Array([[1.5745004, 1.099686 , 1.7419659], [1.8941783, 1.5965666, 1.4532518]], dtype=float32) Imagine that your variable ``Z`` already lives on your second GPU. What happens if we still call ``Z2 = Z`` under the same device scope? It will return ``Z`` instead of making a copy and allocating new memory. .. raw:: latex \diilbookstyleinputcell .. code:: python Z2 = jax.device_put(Z, try_gpu(1)) Z2 is Z .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output False .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python Y + Z .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output Imagine that your variable ``Z`` already lives on your second GPU. What happens if we still call ``Z2 = Z`` under the same device scope? It will return ``Z`` instead of making a copy and allocating new memory. .. raw:: latex \diilbookstyleinputcell .. code:: python with try_gpu(1): Z2 = Z Z2 is Z .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output True .. raw:: html
.. raw:: html
Side Notes ~~~~~~~~~~ People use GPUs to do machine learning because they expect them to be fast. But transferring variables between devices is slow: much slower than computation. So we want you to be 100% certain that you want to do something slow before we let you do it. If the deep learning framework just did the copy automatically without crashing then you might not realize that you had written some slow code. Transferring data is not only slow, it also makes parallelization a lot more difficult, since we have to wait for data to be sent (or rather to be received) before we can proceed with more operations. This is why copy operations should be taken with great care. As a rule of thumb, many small operations are much worse than one big operation. Moreover, several operations at a time are much better than many single operations interspersed in the code unless you know what you are doing. This is the case since such operations can block if one device has to wait for the other before it can do something else. It is a bit like ordering your coffee in a queue rather than pre-ordering it by phone and finding out that it is ready when you are. Last, when we print tensors or convert tensors to the NumPy format, if the data is not in the main memory, the framework will copy it to the main memory first, resulting in additional transmission overhead. Even worse, it is now subject to the dreaded global interpreter lock that makes everything wait for Python to complete. Neural Networks and GPUs ------------------------ Similarly, a neural network model can specify devices. The following code puts the model parameters on the GPU. .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python net = nn.Sequential(nn.LazyLinear(1)) net = net.to(device=try_gpu()) .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python net = nn.Sequential() net.add(nn.Dense(1)) net.initialize(ctx=try_gpu()) .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python net = nn.Sequential([nn.Dense(1)]) key1, key2 = jax.random.split(jax.random.PRNGKey(0)) x = jax.random.normal(key1, (10,)) # Dummy input params = net.init(key2, x) # Initialization call .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python strategy = tf.distribute.MirroredStrategy() with strategy.scope(): net = tf.keras.models.Sequential([ tf.keras.layers.Dense(1)]) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1') .. raw:: html
.. raw:: html
We will see many more examples of how to run models on GPUs in the following chapters, simply because the models will become somewhat more computationally intensive. For example, when the input is a tensor on the GPU, the model will calculate the result on the same GPU. .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python net(X) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output tensor([[0.7802], [0.7802]], device='cuda:0', grad_fn=) .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python net(X) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output array([[0.04995865], [0.04995865]], ctx=gpu(0)) .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python net.apply(params, x) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output Array([-1.2849933], dtype=float32) .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python net(X) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output .. raw:: html
.. raw:: html
Let’s confirm that the model parameters are stored on the same GPU. .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python net[0].weight.data.device .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output device(type='cuda', index=0) .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python net[0].weight.data().ctx .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output gpu(0) .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python print(jax.tree_util.tree_map(lambda x: x.device(), params)) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output FrozenDict({ params: { layers_0: { bias: gpu(id=0), kernel: gpu(id=0), }, }, }) .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python net.layers[0].weights[0].device, net.layers[0].weights[1].device .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:0') .. raw:: html
.. raw:: html
Let the trainer support GPU. .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python @d2l.add_to_class(d2l.Trainer) #@save def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0): self.save_hyperparameters() self.gpus = [d2l.gpu(i) for i in range(min(num_gpus, d2l.num_gpus()))] @d2l.add_to_class(d2l.Trainer) #@save def prepare_batch(self, batch): if self.gpus: batch = [a.to(self.gpus[0]) for a in batch] return batch @d2l.add_to_class(d2l.Trainer) #@save def prepare_model(self, model): model.trainer = self model.board.xlim = [0, self.max_epochs] if self.gpus: model.to(self.gpus[0]) self.model = model .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python @d2l.add_to_class(d2l.Module) #@save def set_scratch_params_device(self, device): for attr in dir(self): a = getattr(self, attr) if isinstance(a, np.ndarray): with autograd.record(): setattr(self, attr, a.as_in_ctx(device)) getattr(self, attr).attach_grad() if isinstance(a, d2l.Module): a.set_scratch_params_device(device) if isinstance(a, list): for elem in a: elem.set_scratch_params_device(device) @d2l.add_to_class(d2l.Trainer) #@save def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0): self.save_hyperparameters() self.gpus = [d2l.gpu(i) for i in range(min(num_gpus, d2l.num_gpus()))] @d2l.add_to_class(d2l.Trainer) #@save def prepare_batch(self, batch): if self.gpus: batch = [a.as_in_context(self.gpus[0]) for a in batch] return batch @d2l.add_to_class(d2l.Trainer) #@save def prepare_model(self, model): model.trainer = self model.board.xlim = [0, self.max_epochs] if self.gpus: model.collect_params().reset_ctx(self.gpus[0]) model.set_scratch_params_device(self.gpus[0]) self.model = model .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python @d2l.add_to_class(d2l.Trainer) #@save def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0): self.save_hyperparameters() self.gpus = [d2l.gpu(i) for i in range(min(num_gpus, d2l.num_gpus()))] @d2l.add_to_class(d2l.Trainer) #@save def prepare_batch(self, batch): if self.gpus: batch = [jax.device_put(a, self.gpus[0]) for a in batch] return batch .. raw:: html
.. raw:: html
In short, as long as all data and parameters are on the same device, we can learn models efficiently. In the following chapters we will see several such examples. Summary ------- We can specify devices for storage and calculation, such as the CPU or GPU. By default, data is created in the main memory and then uses the CPU for calculations. The deep learning framework requires all input data for calculation to be on the same device, be it CPU or the same GPU. You can lose significant performance by moving data without care. A typical mistake is as follows: computing the loss for every minibatch on the GPU and reporting it back to the user on the command line (or logging it in a NumPy ``ndarray``) will trigger a global interpreter lock which stalls all GPUs. It is much better to allocate memory for logging inside the GPU and only move larger logs. Exercises --------- 1. Try a larger computation task, such as the multiplication of large matrices, and see the difference in speed between the CPU and GPU. What about a task with a small number of calculations? 2. How should we read and write model parameters on the GPU? 3. Measure the time it takes to compute 1000 matrix–matrix multiplications of :math:`100 \times 100` matrices and log the Frobenius norm of the output matrix one result at a time. Compare it with keeping a log on the GPU and transferring only the final result. 4. Measure how much time it takes to perform two matrix–matrix multiplications on two GPUs at the same time. Compare it with computing in in sequence on one GPU. Hint: you should see almost linear scaling. .. raw:: html
.. raw:: html
`Discussions `__ .. raw:: html
.. raw:: html
`Discussions `__ .. raw:: html
.. raw:: html
`Discussions `__ .. raw:: html
.. raw:: html
`Discussions `__ .. raw:: html
.. raw:: html