Basics of Optimization Problems#

In scientific machine learning, we typically have to find the minimum of a \(n\)-dimensional function. What we are optimizing over is usually the parameters of the model. The function we are optimizing could be:

  • the loss function of a neural network,

  • the negative log-likelihood of a probabilistic model,

  • the negative log-posterior of a Bayesian model,

  • the negative evidence lower bound of a variational formulation of a probabilistic model.

So, we need to learn how to solve such optimization problems. There are several complications that are particular to optimization problems in scientific machine learning:

  • the number of parameters is very large (millions or billions in neural networks),

  • the function we are optimizing is very complex (non-convex, non-smooth, non-differentiable),

  • the function we are optimizing is very expensive to evaluate (it requires a lot of computation and memory).

We need taylor-made optimization algorithms that can deal with these complications.


Instead of me reinventing the wheel, I will just point you to the best introduction to optimization in the context of machine learning that I can think of:

  • Chapter 8 of Deep Learning by Goodfellow, Bengio, and Courville.

Please read it very carefully. It is part of the required reading. Here are some of the key points.

What is a stationary point?#

A stationary point is a point where the gradient is zero. It could be a local minimum, a local maximum, or a saddle point.

What is a local minimum?#

It is a stationary point where the function has a lower value than at any other point in a small neighborhood around it. The Hessians of local minima are positive semi-definite. That is, all their eigenvalues are non-negative.

What is a saddle point?#

A saddle point is a stationary point where the function has a lower value than at any other point in some directions and a higher value than at any other point in other directions. The Hessians of saddle points have both positive and negative eigenvalues.

Good and bad local minima#

There are good and bad local minima. Bad local minima are the ones that have a high value of the objective function (think loss function). Good local minima are the ones that have a low value of the objective function. Finding any good local minimum is usually good enough.

Local minima are not a problem#

Neural networks typically have a lot of local minima and saddle points. Local minima are abundant in neural networks. Take any neural network, relabel the neurons, and you will get a different neural network with the same loss function. This means that there are many local minima that are equivalent. Most machine learning practitioners believe that bad local minima are not a problem in practice, especially for deep neural networks.

Saddle points are exponentially more common than local minima#

Pick a random loss function. Think about the Hessian of a stationary point. Suppose the eigenvalues are picked randomly. To get a local minimum (or maximum) you have to pick all the eigenvalues to be positive (or negative). To get a saddle point, you have to pick some eigenvalues to be positive and some to be negative. The probability of getting a local minimum (or maximum) is exponentially smaller than the probability of getting a saddle point as the dimensionality of the problem increases.

Stochastic gradient descent escapes saddle points quickly#

Empirical evidence suggests stochastic gradient descent escapes saddle points quickly.

Initialization matters#

Initialization of neural network weights should break symmetry between neurons, avoid saturation of activation functions, and preserve variance of activations and gradients.

Some remaining problems#


Flat regions or plateaus are the big problem for optimization. We deal with them by using momentum and adaptive learning rates.


Cliff regions are also a problem for optimization. They are more common in recurrent neural networks. We deal with them with gradient clipping.

Vanishing or exploding gradients#

Vanishing or exploding gradients are also a problem for optimization. We deal with them by using better activation functions and better initialization methods.