.. _chapter_numerical_stability:
Numerical Stability and Initialization
======================================
In the past few sections, each model that we implemented required
initializing our parameters according to some specified distribution.
However, until now, we glossed over the details, taking the
initialization hyperparameters for granted. You might even have gotten
the impression that these choices are not especially important. However,
the choice of initialization scheme plays a significant role in neural
network learning, and can prove essentially to maintaining numerical
stability. Moreover, these choices can be tied up in interesting ways
with the choice of the activation function. Which nonlinear activation
function we choose, and how we decide to initialize our parameters can
play a crucial role in making the optimization algorithm converge
rapidly. Failure to be mindful of these issues can lead to either
exploding or vanishing gradients. In this section, we delve into these
topics with greater detail and discuss some useful heuristics that you
may use frequently throughout your career in deep learning.
Vanishing and Exploding Gradients
---------------------------------
Consider a deep network with :math:`d` layers, input :math:`\mathbf{x}`
and output :math:`\mathbf{o}`. Each layer satisfies:
.. math:: \mathbf{h}^{t+1} = f_t (\mathbf{h}^t) \text{ and thus } \mathbf{o} = f_d \circ \ldots \circ f_1(\mathbf{x})
If all activations and inputs are vectors, we can write the gradient of
:math:`\mathbf{o}` with respect to any set of parameters
:math:`\mathbf{W}_t` associated with the function :math:`f_t` at layer
:math:`t` simply as
.. math:: \partial_{\mathbf{W}_t} \mathbf{o} = \underbrace{\partial_{\mathbf{h}^{d-1}} \mathbf{h}^d}_{:= \mathbf{M}_d} \cdot \ldots \cdot \underbrace{\partial_{\mathbf{h}^{t}} \mathbf{h}^{t+1}}_{:= \mathbf{M}_t} \underbrace{\partial_{\mathbf{W}_t} \mathbf{h}^t}_{:= \mathbf{v}_t}.
In other words, it is the product of :math:`d-t` matrices
:math:`\mathbf{M}_d \cdot \ldots \cdot \mathbf{M}_t` and the gradient
vector :math:`\mathbf{v}_t`. What happens is similar to the situation
when we experienced numerical underflow when multiplying too many
probabilities. At the time, we were able to mitigate the problem by
switching from into log-space, i.e. by shifting the problem from the
mantissa to the exponent of the numerical representation. Unfortunately
the problem outlined in the equation above is much more serious:
initially the matrices :math:`M_t` may well have a wide variety of
eigenvalues. They might be small, they might be large, and in
particular, their product might well be *very large* or *very small*.
This is not (only) a problem of numerical representation but it means
that the optimization algorithm is bound to fail. It receives gradients
that are either excessively large or excessively small. As a result the
steps taken are either (i) excessively large (the *exploding* gradient
problem), in which case the parameters blow up in magnitude rendering
the model useless, or (ii) excessively small, (the *vanishing gradient
problem*), in which case the parameters hardly move at all, and thus the
learning process makes no progress.
Vanishing Gradients
~~~~~~~~~~~~~~~~~~~
One major culprit in the vanishing gradient problem is the choices of
the activation functions :math:`\sigma` that are interleaved with the
linear operations in each layer. Historically, a the sigmoid function
:math:`(1 + \exp(-x))` (introduced in :numref:`chapter_mlp`) was a
popular choice owing to its similarity to a thresholding function. Since
early artificial neural networks were inspired by biological neural
networks, the idea of neurons that either fire or do not fire
(biological neurons do not partially fire) seemed appealing. Let’s take
a closer look at the function to see why picking it might be problematic
vis-a-vis vanishing gradients.
.. code:: python
%matplotlib inline
import d2l
from mxnet import nd, autograd
x = nd.arange(-8.0, 8.0, 0.1)
x.attach_grad()
with autograd.record():
y = x.sigmoid()
y.backward()
d2l.plot(x, [y, x.grad], legend=['sigmoid', 'gradient'], figsize=(4.5, 2.5))
.. figure:: output_numerical-stability-and-init_d5976f_1_0.svg
As we can see, the gradient of the sigmoid vanishes both when its inputs
are large and when they are small. Moreover, when we excecute backward
propagation, due to the chain rule, this means that unless we are in the
Goldilocks zone, where the inputs to most of the sigmoids are in the
range of, say :math:`[-4, 4]`, the gradients of the overall product may
vanish. When we have many layers, unless we are especially careful, we
are likely to find that our gradient is cut off at *some* layer. Before
ReLUs (:math:`\max(0,x)`) were proposed as an alternative to squashing
functions, this problem used to plague deep network training. As a
consequence, ReLUs have become the default choice when designing
activation functions in deep networks.
Exploding Gradients
~~~~~~~~~~~~~~~~~~~
The opposite problem, when gradients explode, can be similarly vexing.
To illustrate this a bit better, we draw :math:`100` Gaussian random
matrices and multiply them with some initial matrix. For the scale that
we picked (the choice of the variance :math:`\sigma^2=1`), the matrix
product explodes. If this were to happen to us with a deep network, we
would have no realistic chance of getting a gradient descent optimizer
to converge.
.. code:: python
M = nd.random.normal(shape=(4,4))
print('A single matrix', M)
for i in range(100):
M = nd.dot(M, nd.random.normal(shape=(4,4)))
print('After multiplying 100 matrices', M)
.. parsed-literal::
:class: output
A single matrix
[[ 2.2122064 0.7740038 1.0434405 1.1839255 ]
[ 1.8917114 -1.2347414 -1.771029 -0.45138445]
[ 0.57938355 -1.856082 -1.9768796 -0.20801921]
[ 0.2444218 -0.03716067 -0.48774993 -0.02261727]]
After multiplying 100 matrices
[[ 3.1575275e+20 -5.0052276e+19 2.0565092e+21 -2.3741922e+20]
[-4.6332600e+20 7.3445046e+19 -3.0176513e+21 3.4838066e+20]
[-5.8487235e+20 9.2711797e+19 -3.8092853e+21 4.3977330e+20]
[-6.2947415e+19 9.9783660e+18 -4.0997977e+20 4.7331174e+19]]
Symmetry
~~~~~~~~
Another problem in deep network design is the symmetry inherent in their
parametrization. Assume that we have a deep network with one hidden
layer with two units, say :math:`h_1` and :math:`h_2`. In this case, we
could permute the weights :math:`\mathbf{W}_1` of the first layer and
likewise permute the weights of the output layer to obtain the same
function function. There is nothing special differentiating the first
hidden unit vs the second hidden unit. In other words, we have
permutation symmetry among the hidden units of each layer.
This is more than just a theoretical nuisance. Imagine what would happen
if we initialized all of the parameters of some layer as
:math:`\mathbf{W}_l = c` for some constant :math:`c`. In this case, the
gradients for all dimensions are identical: thus not only would each
unit take the same value, but it would receive the same update.
Stochastic gradient descent would never break the symmetry on its own
and we might never be able to realize the networks expressive power. The
hidden layer would behave as if it had only a single unit. As an aside,
note that while SGD would not break this symmetry, dropout
regularization would!
Parameter Initialization
------------------------
One way of addressing, or at least mitigating the issues raised above is
through careful initialization of the weight vectors. This way we can
ensure that (at least initially) the gradients do not vanish a and that
they maintain a reasonable scale where the network weights do not
diverge. Additional care during optimization and suitable regularization
ensures that things never get too bad.
Default Initialization
~~~~~~~~~~~~~~~~~~~~~~
In the previous sections, e.g., in :numref:`chapter_linear_gluon`, we
used ``net.initialize(init.Normal(sigma=0.01))`` to initialize the
values of our weights. If the initialization method is not specified,
such as ``net.initialize()``, MXNet will use the default random
initialization method: each element of the weight parameter is randomly
sampled with a uniform distribution :math:`U[-0.07, 0.07]` and the bias
parameters are all set to :math:`0`. Both choices tend to work well in
practice for moderate problem sizes.
Xavier Initialization
~~~~~~~~~~~~~~~~~~~~~
Let’s look at the scale distribution of the activations of the hidden
units :math:`h_{i}` for some layer. They are given by
.. math:: h_{i} = \sum_{j=1}^{n_\mathrm{in}} W_{ij} x_j
The weights :math:`W_{ij}` are all drawn independently from the same
distribution. Furthermore, let’s assume that this distribution has zero
mean and variance :math:`\sigma^2` (this doesn’t mean that the
distribution has to be Gaussian, just that mean and variance need to
exist). We don’t really have much control over the inputs into the layer
:math:`x_j` but let’s proceed with the somewhat unrealistic assumption
that they also have zero mean and variance :math:`\gamma^2` and that
they’re independent of :math:`\mathbf{W}`. In this case, we can compute
mean and variance of :math:`h_i` as follows:
.. math::
\begin{aligned}
\mathbf{E}[h_i] & = \sum_{j=1}^{n_\mathrm{in}} \mathbf{E}[W_{ij} x_j] = 0 \\
\mathbf{E}[h_i^2] & = \sum_{j=1}^{n_\mathrm{in}} \mathbf{E}[W^2_{ij} x^2_j] \\
& = \sum_{j=1}^{n_\mathrm{in}} \mathbf{E}[W^2_{ij}] \mathbf{E}[x^2_j] \\
& = n_\mathrm{in} \sigma^2 \gamma^2
\end{aligned}
One way to keep the variance fixed is to set
:math:`n_\mathrm{in} \sigma^2 = 1`. Now consider backpropagation. There
we face a similar problem, albeit with gradients being propagated from
the top layers. That is, instead of :math:`\mathbf{W} \mathbf{w}`, we
need to deal with :math:`\mathbf{W}^\top \mathbf{g}`, where
:math:`\mathbf{g}` is the incoming gradient from the layer above. Using
the same reasoning as for forward propagation, we see that the
gradients’ variance can blow up unless
:math:`n_\mathrm{out} \sigma^2 = 1`. This leaves us in a dilemma: we
cannot possibly satisfy both conditions simultaneously. Instead, we
simply try to satisfy:
.. math::
\begin{aligned}
\frac{1}{2} (n_\mathrm{in} + n_\mathrm{out}) \sigma^2 = 1 \text{ or equivalently }
\sigma = \sqrt{\frac{2}{n_\mathrm{in} + n_\mathrm{out}}}
\end{aligned}.
This is the reasoning underlying the eponymous Xavier initialization
:cite:`Glorot.Bengio.2010`. It works well enough in practice. For
Gaussian random variables, the Xavier initialization picks a normal
distribution with zero mean and variance
:math:`\sigma^2 = 2/(n_\mathrm{in} + n_\mathrm{out})`. For uniformly
distributed random variables :math:`U[-a, a]`, note that their variance
is given by :math:`a^2/3`. Plugging :math:`a^2/3` into the condition on
:math:`\sigma^2` yields that we should initialize uniformly with
:math:`U\left[-\sqrt{6/(n_\mathrm{in} + n_\mathrm{out})}, \sqrt{6/(n_\mathrm{in} + n_\mathrm{out})}\right]`.
Beyond
~~~~~~
The reasoning above barely scratches the surface of modern approaches to
parameter initialization. In fact, MXNet has an entire
``mxnet.initializer`` module implementing over a dozen different
heuristics. Moreover, intialization continues to be a hot area of
inquiry within research into the fundamental theory of neural network
optimization. Some of these heuristics are especially suited for when
parameters are tied (i.e., when parameters of in different parts the
network are shared), for superresolution, sequence models, and related
problems. We recommend that the interested reader take a closer look at
what is offered as part of this module, and investigate the recent
research on parameter initialization. Perhaps you may come across a
recent clever idea and contribute its implementation to MXNet, or you
may even invent your own scheme!
Summary
-------
- Vanishing and exploding gradients are common issues in very deep
networks, unless great care is taking to ensure that gradients and
parameters remain well controlled.
- Initialization heuristics are needed to ensure that at least the
initial gradients are neither too large nor too small.
- The ReLU addresses one of the vanishing gradient problems, namely
that gradients vanish for very large inputs. This can accelerate
convergence significantly.
- Random initialization is key to ensure that symmetry is broken before
optimization.
Exercises
---------
1. Can you design other cases of symmetry breaking besides the
permutation symmetry?
2. Can we initialize all weight parameters in linear regression or in
softmax regression to the same value?
3. Look up analytic bounds on the eigenvalues of the product of two
matrices. What does this tell you about ensuring that gradients are
well conditioned?
4. If we know that some terms diverge, can we fix this after the fact?
Look at the paper on LARS by `You, Gitman and Ginsburg,
2017 `__ for inspiration.
Scan the QR Code to `Discuss `__
-----------------------------------------------------------------
|image0|
.. |image0| image:: ../img/qr_numerical-stability-and-init.svg