Basics of Optimization Problems#
In scientific machine learning, we typically have to find the minimum of a \(n\)-dimensional function. What we are optimizing over is usually the parameters of the model. The function we are optimizing could be:
the loss function of a neural network,
the negative log-likelihood of a probabilistic model,
the negative log-posterior of a Bayesian model,
the negative evidence lower bound of a variational formulation of a probabilistic model.
So, we need to learn how to solve such optimization problems. There are several complications that are particular to optimization problems in scientific machine learning:
the number of parameters is very large (millions or billions in neural networks),
the function we are optimizing is very complex (non-convex, non-smooth, non-differentiable),
the function we are optimizing is very expensive to evaluate (it requires a lot of computation and memory).
We need taylor-made optimization algorithms that can deal with these complications.
Instead of me reinventing the wheel, I will just point you to the best introduction to optimization in the context of machine learning that I can think of:
Chapter 8 of Deep Learning by Goodfellow, Bengio, and Courville.
Please read it very carefully. It is part of the required reading. Here are some of the key points.
What is a stationary point?#
A stationary point is a point where the gradient is zero. It could be a local minimum, a local maximum, or a saddle point.
What is a local minimum?#
It is a stationary point where the function has a lower value than at any other point in a small neighborhood around it. The Hessians of local minima are positive semi-definite. That is, all their eigenvalues are non-negative.
What is a saddle point?#
A saddle point is a stationary point where the function has a lower value than at any other point in some directions and a higher value than at any other point in other directions. The Hessians of saddle points have both positive and negative eigenvalues.
Good and bad local minima#
There are good and bad local minima. Bad local minima are the ones that have a high value of the objective function (think loss function). Good local minima are the ones that have a low value of the objective function. Finding any good local minimum is usually good enough.
Local minima are not a problem#
Neural networks typically have a lot of local minima and saddle points. Local minima are abundant in neural networks. Take any neural network, relabel the neurons, and you will get a different neural network with the same loss function. This means that there are many local minima that are equivalent. Most machine learning practitioners believe that bad local minima are not a problem in practice, especially for deep neural networks.
Saddle points are exponentially more common than local minima#
Pick a random loss function. Think about the Hessian of a stationary point. Suppose the eigenvalues are picked randomly. To get a local minimum (or maximum) you have to pick all the eigenvalues to be positive (or negative). To get a saddle point, you have to pick some eigenvalues to be positive and some to be negative. The probability of getting a local minimum (or maximum) is exponentially smaller than the probability of getting a saddle point as the dimensionality of the problem increases.
Stochastic gradient descent escapes saddle points quickly#
Empirical evidence suggests stochastic gradient descent escapes saddle points quickly.
Initialization of neural network weights should break symmetry between neurons, avoid saturation of activation functions, and preserve variance of activations and gradients.
Some remaining problems#
Flat regions or plateaus are the big problem for optimization. We deal with them by using momentum and adaptive learning rates.
Cliff regions are also a problem for optimization. They are more common in recurrent neural networks. We deal with them with gradient clipping.
Vanishing or exploding gradients#
Vanishing or exploding gradients are also a problem for optimization. We deal with them by using better activation functions and better initialization methods.