Deep Neural Networks#

These notes are incomplete. They merely provide a summary. Please consult Chapters 6, 7, and 8 of [Goodfellow et al., 2016] for more details.

Deep neural networks as universal function approximators#

Deep neural networks (DNN) are function approximators expressing hierarchically layered information. You can use DNNs to approximate a function of \(d\) inputs to \(q\) outputs using some parameters \(\theta\). We, typicaly write \(\mathbf{y} = f(\mathbf{x};\theta)\). Here \(f(\mathbf{x};\theta)\) is the DNN, and \(\theta\) are its parameters. We will clarify both these concepts below.

Mathematically, deep neural networks are compositions of simpler one-layer neural networks:

\[ f(\mathbf{x};\theta) = (f_L \circ f_{L-1} \circ \cdots \circ f_1)( \mathbf{x}). \]

In the simplest setting, the layers \(f_l\)s are a composition of an elementwise nonlinearity with a linear transformation:

\[ f_i ( \mathbf{z} ) = h^{(i)} ( \mathbf{W}^{(i)} \mathbf{z} + \mathbf{b}^{(i)} ), \]

where, \(\mathbf{W}^{(i)}\) is a matrix of parameters, \(\mathbf{b}^{(i)}\) is a vector of parameters, and \(h^{(i)}\) is a nonlinear function applied in an elementwise fashion (i.e., applied separately to each one of the inputs that are provided to it). A DNN with this structure is called a fully-connected DNN.

In deep learning terminology, the matrix \(\mathbf{W}^{(i)}\) is referred to as a weight matrix, the vector \(\mathbf{b}^{(i)}\) is referred to as a bias. The function \(h^{(i)}(\cdot)\) is called the activation function. It is typical for all but the last layer of a DNN to have the same activation function.

At the final layer, the output’s dimensionality and the activation function’s choice are dictated by constraints on the final output of the function \(f\). For example:

  1. If the output from \(f\) is a real number with no constraints, the output dimensions are \(d_L=1\) and \(h^{(L)}(\mathbf{z}) = 1\).

  2. If the output from \(f\) is a positive real, \(n^{(L)} = q = 1\) and \(\sigma_L(x) = \exp(x)\).

  3. If the output from \(f\) is a probability mass function on \(K\) categories, \(n^{(L)} = q = K\) and \(h^{(L)}(\mathbf{z}) = \frac{\exp(z_i)}{\sum_{j=1}^{K} \exp(z_j)}, i=1, 2, \dots, K\). Don’t try to memorize this. We will revisit it in the following lecture.

Different ways of constructing the compositional structure of \(f\) lead to different architectures such as fully connected networks (shown above), recurrent neural networks, convolutional neural networks, autoencoders, residual networks, etc.

Activation functions#

The most common activation functions include the rectified Linear Units or ReLU (and variants), sigmoid functions, hyperbolic tangents, sinusoids, step functions, etc. We will visualize them in the hands-on activity.

Universal theorem for neural networks#

The universal approximation theorem guarantees that DNNs are good function approximators. In plain English, the (original) theorem states that if you take any decent activation function and build a dense neural network with which you can approximate any continuous function (defined on a compact input domain) arbitrarily well, if you keep increasing the number of neurons you use. Recently, researchers have proven similar theorems for deep neural networks. In general, if you grow your network by adding neurons and layers, it can approximate almost anything you need. That’s one of the reasons why deep neural networks have been (re)gaining momentum recently. In practice, however, it is a bit more difficult than that.

Training regression networks - Loss function view#

Assume that you want to solve a regression problem. You have input data:

\[ \mathbf{x}_{1:n} = (\mathbf{x}_1,\dots,\mathbf{x}_n), \]

and output data:

\[ \mathbf{y}_{1:n} = (y_1,\dots,y_n). \]

You want to use them to find the map between \(\mathbf{x}\) and \(y\) using DNNs.

Well, you start by using a DNN \(y=f(\mathbf{x};\theta)\) to represent the map from \(\mathbf{x}\) to \(y\). Here \(\theta\) are the network parameters (the layers’ weights and biases). Your problem is to fit \(\theta\) to the available data.

The simplest way forward is to follow a least-squares approach. First, define a so-called loss function:

\[ L(\theta) = \frac{1}{n}\sum_{i=1}^n\left(y_i-f(\mathbf{x}_i;\theta)\right)^2. \]

This loss function is the sum of the squares of the prediction error of the DNN for a given \(\theta\). Once you have the loss function, you can fit \(\theta\) by minimizing it:

\[ \theta^* = \arg\min L(\theta). \]

However, this minimization problem does not have an analytical solution. Neither does it have a unique solution. It is a non-linear, non-convex optimization problem. It requires special treatment. We will talk about it in a while.

Training regression networks - Probabilistic view#

Sometimes, it is not straightforward how to come up with loss functions. In such situations, we can employ a probabilistic view. We must develop a likelihood function that helps us connect the model to the observed data. So, in general, we need to come up with: \( p(y_{1:n}|\mathbf{x}_{1:n},\theta). \) Then, we can fit the parameters by maximizing the log-likelihood, which is the same as minimizing the “loss” function:

\[ L(\theta) = -\log p(y_{1:n}|\mathbf{x}_{1:n},\theta). \]

This approach is going to give you the same thing as the loss function approach under the following assumptions:

  • The observations are independent (conditional on the model); and

  • The measurement noise is Gaussian with a mean given by the DNN and a constant variance.

Let’s show this. Take:

\begin{split} p(y_i|\mathbf{x}_i,\theta) &= N(y_i | f(\mathbf{x}_i;\theta), \sigma^2)\ &= \frac{1}{\sqrt{2\pi}\sigma}\exp\left{-\frac{\left(y_i-f(\mathbf{x}_i;\theta)\right)^2}{2\sigma^2}\right}, \end{split}

where \(\sigma^2\) is the measurement noise variance. Then, from independence, we have:

\[ p(y_{1:n}|\mathbf{x}_{1:n},\theta) = \prod_{i=1}^np(y_i|\mathbf{x}_i,\theta). \]

So, we should be minimizing:

\begin{split} L’(\theta) &= -\log p(y_{1:n}|\mathbf{x}{1:n},\theta)\ &= -\sum{i=1}^n\log p(y_i|\mathbf{x}i,\theta)\ &= \frac{1}{2\sigma^2}\sum{i=1}^n\left(y_i-f(\mathbf{x}_i;\theta)\right)^2 + \text{const}. \end{split}

That’s the same (up to an additive constant) as the \(L(\theta)\) we had before. The benefit of the probabilistic approach is that it allows you to be more flexible with how you model the measurement process.

The minimization problem as a stochastic optimization problem#

As mentioned, \(L(\theta)\) is non-linear and non-convex. Classic, gradient-based optimization techniques do not work well on it. They tend to get trapped in bad local minima. Adding stochasticity to the optimization algorithm helps it avoid these bad local minima. Such stochastic optimization algorithms are still finding local minima, but they are better ones!

Another potential problem is that \(L(\theta)\) may involve a summation over millions of observations (in the case of big data). In this regime, gradient-based optimization algorithms are also computationally inefficient. Stochastic optimization algorithms subsample the available data, allowing you to break them down into computationally digestible batches.

Let’s first say what a stochastic optimization problem is. Then, we will show how to recast a typical \(\min L(\theta)\) problem as a stochastic optimization problem. A stochastic optimization problem is a problem of the form:

\[ \min_\theta \mathbb{E}_Z[\ell(\theta;Z)], \]

where \(\ell(\theta;Z)\) is some scalar function of \(\theta\) and the random vector \(Z\). The expectation is over \(Z\). You want to minimize the expectation over \(Z\) of \(\ell(\theta;Z)\). That’s it.

Okay. Back to our original problem. Take:

\[ L(\theta) = \frac{1}{n}\sum_{i=1}^n \left(y_i-f(\mathbf{x}_i;\theta)\right)^2. \]

We need to write this as an expectation of something. But an expectation of what? Well, it will be an expectation over randomly selected batches of the observed data. This is by no means the only choice. But it is a very useful choice. Let’s see how we can do this.

First, let’s visit the observations one by one. Take \(I\) to be a Categorical random variable that picks with equal probability the index of one of the \(n\) observations, i.e.,

\[ I \sim \operatorname{Categorical}\left(\frac{1}{n},\dots,\frac{1}{n}\right). \]

Take:

\[ \ell(\theta;I) = \left(y_I-f(\mathbf{x}_I;\theta)\right)^2. \]

So, here \(Z = I\). Let’s take the expectation over \(I\) and see what it is going to give us:

\begin{split} \mathbb{E}I[\ell(\theta;I)] &= \sum{i=1}^np(I=i)\ell(\theta;i)\ &= \sum_{i=1}^n\frac{1}{n}\left(y_i-f(\mathbf{x}i;\theta)\right)^2\ &= \frac{1}{n}\sum{i=1}^n\left(y_i-f(\mathbf{x}_i;\theta)\right)^2\ &= L(\theta). \end{split}

Great! Minimizing \(L(\theta)\) is the same as minimizing \(\mathbb{E}_I[\ell(\theta;I)]\).

Let’s now do it again, using an \(m\)-sized randomly selected batch from the observed data. Take \(I_1,I_2,\dots,I_m\) to be independent and identically distributed Categoricals that choose with equal probability an index from 1 to \(n\). Then define:

\[ \ell_m(\theta;I_{1:m}) = \frac{1}{m}\sum_{j=1}^m\left(y_{I_j}-f(\mathbf{x}_{I_j};\theta)\right)^2. \]

So, here \(Z = (I_1,\dots,I_m)\). Now take the expectation of this over the \(I\)’s:

\begin{split} \mathbb{E}[\ell_m(\theta;I_{1:m})] &= \mathbb{E}\left[\frac{1}{m}\sum_{j=1}^m\left(y_{I_j}-f(\mathbf{x}{I_j};\theta)\right)^2\right]\ &= \frac{1}{m}\sum{j=1}^m\mathbb{E}\left[\left(y_{I_j}-f(\mathbf{x}{I_j};\theta)\right)^2\right]\ &= \frac{1}{m}\sum{j=1}^m L(\theta)\ &= \frac{m}{m}L(\theta)\ &= L(\theta), \end{split}

where we have used that \(\mathbb{E}\left[\left(y_{I_j}-f(\mathbf{x}_{I_j};\theta)\right)^2\right] = L(\theta)\) since it follows from our previous analysis. Therefore, minimizing \(L(\theta)\) is the same as minimizing the expectation of \(\ell_m(\theta;I_{1:m})\).

The Robbins-Monro algorithm#

We reached the point where we can discuss the simplest variant of a stochastic optimization algorithm. It is the stochastic gradient descent or the Robbins-Monro algorithm, [Robbins and Monro, 1951]. It goes as follows. Take the stochastic optimization problem:

\[ \min_\theta \mathbb{E}_Z[\ell(\theta;Z)]. \]

and consider the RM algorithm:

  • initialize \(\theta\) to \(\theta_0\)

  • Iterate:

\[ \theta_{t+1} = \theta_t - \alpha_t \nabla_{\theta}\ell(\theta_t,z_t), \]

where \(z_t\) are independent samples of \(Z\).

In this algorithm, \(\theta_t\) is gradually evolved following a noisy gradient signal. The sequence \(\alpha_t\) is our choice, known as the learning rate. The Robbins-Monro theorem guarantees that the RM algorithm converges to a local minimum of the expectation \(\mathbb{E}[\ell(\theta, Z)]\) if the learning rate satisfies the following properties:

\[ \sum_{t=1}^\infty \alpha_t = +\infty, \]

and

\[ \sum_{t=1}^\infty \alpha_t^2 < +\infty. \]

Intuitively, these properties say that the learning rate should converge to zero (an implication of the convergence of the second series) but not too fast (an implication of the divergence of the first series). Many sequences of learning rates satisfy these constraints. Here is a very commonly used one:

\[ \alpha_t = \frac{A}{(Bt + C)^\rho}, \]

with \(\rho\) a number between \(0.5\) and \(1\) (exclusive).

Application of the Robbins-Monro algorithm to training regression networks#

The algorithm for training regression networks becomes:

\[ \theta_{t+1} = \theta_t - \alpha_t\nabla_{\theta} \frac{1}{m}\sum_{j=1}^m\left(y_{i_{tj}}-f(\mathbf{x}_{i_{tj}};\theta_t)\right)^2, \]

where \(i_{t1},\dots,i_{tm}\) are randomly selected indices of the observation data. Using properties of the gradient, you can also write this as:

\[ \theta_{t+1} = \theta_t - 2\alpha_t \frac{1}{m}\sum_{j=1}^m\left(y_{i_{tj}}-f(\mathbf{x}_{i_{tj}};\theta_t)\right)\nabla_{\theta}f(\mathbf{x}_{i_{tj}};\theta_t). \]

That’s it.

Notice that to carry out the algorithm, we need to \(\nabla_{\theta}f(\mathbf{x}_{i_{tj}};\theta_t)\), i.e., the gradient of the neural network output with respect to the parameters (weights and biases). This is done using the chain rule. The algorithm is known as the back-propagation algorithm. We are not going to cover it. Nowadays, you don’t have to worry about derivatives. Software like PyTorch and JAX can find the derivatives for you. In the hands-on activity, I will introduce you to PyTorch.

Advanced variations of stochastic gradient descent#

The RM algorithm is the simplest stochastic optimization algorithm I could explain in a lecture. It works, but it is only one of the most commonly used. More powerful algorithms like stochastic gradient descent with momentum, AdaGrad, or Adam (adaptive moment estimation) exist. I will show you in the hands-on activities how to use these algorithms as implemented in PyTorch, but I will not explain their details. If you want to know the details, please read Chapter 8 of [Goodfellow et al., 2016].