Hide code cell source
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
import seaborn as sns
sns.set_context("paper")
sns.set_style("ticks")

Tuning the Hyperparameters#

We tune the hyper-parameters of a Gaussian process using maximum marginal likelihood. See Chapter 5 of [Rasmussen and Williams, 2005] for details.

Again, you may need to install the gpytorch package.

!pip install gpytorch

Example: Gaussian process regression with fitted hyperparameters#

Let’s generate synthetic data:

Hide code cell source
np.random.seed(1234)

n = 10
X = np.random.rand(n)
sigma = 0.4
f_true = lambda x: -np.cos(np.pi * x) + np.sin(4. * np.pi * x)
Y = f_true(X) + sigma * np.random.randn(X.shape[0])

fig, ax = plt.subplots()
ax.plot(X, Y, 'kx', markersize=10, markeredgewidth=2)
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
sns.despine(trim=True);
../_images/a85c06bc813ab4c4c68cb2c0899484bfd875d8d0bafc4efe9839687b21c9768c.svg

We repeat some of the code from the previous section:

Hide code cell source
def plot_1d_regression(
    x_star,
    model,
    ax=None,
    f_true=None,
    num_samples=10
):
    """Plot the posterior predictive.
    
    Arguments
    x_start  --  The test points on which to evaluate.
    model    --  The trained model.
    
    Keyword Arguments
    ax          --  An axes object to write on.
    f_true      --  The true function.
    num_samples --  The number of samples.
    """
    f_star = model(x_star)
    m_star = f_star.mean
    v_star = f_star.variance
    y_star = model.likelihood(f_star)
    yv_star = y_star.variance

    f_lower = (
        m_star - 2.0 * torch.sqrt(v_star)
    )
    f_upper = (
        m_star + 2.0 * torch.sqrt(v_star)
    )
    
    y_lower = m_star - 2.0 * torch.sqrt(yv_star)
    y_upper = m_star + 2.0 * torch.sqrt(yv_star)

    if ax is None:
        fig, ax = plt.subplots()
    
    ax.plot(model.train_inputs[0].flatten().detach(),
            model.train_targets.detach(),
            'kx',
            markersize=10,
            markeredgewidth=2,
            label='Observations'
    )

    ax.plot(
        x_star,
        m_star.detach(),
        lw=2,
        label='$m_n(x)$',
        color=sns.color_palette()[0]
    )
    
    ax.fill_between(
        x_star.flatten().detach(),
        f_lower.flatten().detach(),
        f_upper.flatten().detach(),
        alpha=0.5,
        label='$f(\mathbf{x}^*)$ 95% pred.',
        color=sns.color_palette()[0]
    )

    ax.fill_between(
        x_star.detach().flatten(),
        y_lower.detach().flatten(),
        f_lower.detach().flatten(),
        color=sns.color_palette()[1],
        alpha=0.5,
        label='$y^*$ 95% pred.'
    )
    ax.fill_between(
        x_star.detach().flatten(),
        f_upper.detach().flatten(),
        y_upper.detach().flatten(),
        color=sns.color_palette()[1],
        alpha=0.5,
        label=None
    )

    
    if f_true is not None:
        ax.plot(
            x_star,
            f_true(x_star),
            'm-.',
            label='True function'
        )
        
    if num_samples > 0:
        f_post_samples = f_star.sample(
            sample_shape=torch.Size([10])
        )
        ax.plot(
            x_star,
            f_post_samples.T.detach(),
            color="red",
            lw=0.5
        )
        # This is just to add the legend entry
        ax.plot(
            [],
            [],
            color="red",
            lw=0.5,
            label="Posterior samples"
        )
        
    ax.set_xlabel('$x$')
    ax.set_ylabel('$y$')

    plt.legend(loc='best', frameon=False)
    sns.despine(trim=True);

Let’s build a Gaussian process model as usual:

Hide code cell source
import torch
import gpytorch
from gpytorch.kernels import RBFKernel, ScaleKernel

class ExactGP(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super().__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = ScaleKernel(RBFKernel())

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

likelihood = gpytorch.likelihoods.GaussianLikelihood(
    noise_constraint=gpytorch.constraints.GreaterThan(0.0)
)
train_x = torch.from_numpy(X).float()
train_y = torch.from_numpy(Y).float()
model = ExactGP(train_x, train_y, likelihood)

The model has the following parameters:

  • model.mean.constant: the constant of the mean function.

  • model.covar_module.outputscale: the output scale of the kernel (variance or signal strength).

  • model.covar_module.base_kernel.lengthscale: the length scale of the kernel.

  • model.likelihood.noise: the standard deviation of the noise.

The mean constant can be any real number. The output scale, the length scale, and the noise standard deviation must be positive. Typically, one explicitly enforces the constraints by working with a transformed version of the parameters. For example, one works with its logarithm instead of the length scale.

GPyTorch automatically does this for us. If you want to learn more about this, go over over this. GPyTorch calls the transformed parameters raw_<NAME>. For example, the raw length scale is model.covar_module.base_kernel.raw_lengthscale. The raw parameters are the ones that are optimized. Here is how you can see them:

for name, param in model.named_parameters():
    print(f'Parameter name: {name:42} value = {param.item():1.2f}')
Parameter name: likelihood.noise_covar.raw_noise           value = 0.00
Parameter name: mean_module.raw_constant                   value = 0.00
Parameter name: covar_module.raw_outputscale               value = 0.00
Parameter name: covar_module.base_kernel.raw_lengthscale   value = 0.00

Here is now how we can maximize the marginal log likelihood:

# THis is necessary to initialize the model
# but it is not all we need to do
model.train()

# This is the optimizer that we will use
optimizer = torch.optim.LBFGS(model.parameters(), line_search_fn='strong_wolfe')

# This is the loss function
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

# This functions runs at each iteration of the optimizer
# and it calculates the loss and its gradients
def closure():
    optimizer.zero_grad()
    output = model(train_x)
    loss = -mll(output, train_y)
    loss.backward()
    return loss

# This is the training loop
n_iter = 10
for i in range(n_iter):
    loss = optimizer.step(closure)
    if (i + 1) % 1 == 0:
        print(f'Iter {i + 1:3d}/{n_iter} - Loss: {loss.item():.3f}')

# We are done training, so we can switch to eval mode
model.eval();
Iter   1/10 - Loss: 1.195
Iter   2/10 - Loss: 0.936
Iter   3/10 - Loss: 0.936
Iter   4/10 - Loss: 0.936
Iter   5/10 - Loss: 0.936
Iter   6/10 - Loss: 0.936
Iter   7/10 - Loss: 0.936
Iter   8/10 - Loss: 0.936
Iter   9/10 - Loss: 0.936
Iter  10/10 - Loss: 0.936

It is convenient to organize this into a function that we can reuse:

def train(model, train_x, train_y, n_iter=10, lr=0.1):
    """Train the model.

    Arguments
    model   --  The model to train.
    train_x --  The training inputs.
    train_y --  The training labels.
    n_iter  --  The number of iterations.
    """
    model.train()
    optimizer = torch.optim.LBFGS(model.parameters(), line_search_fn='strong_wolfe')
    mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
    def closure():
        optimizer.zero_grad()
        output = model(train_x)
        loss = -mll(output, train_y)
        loss.backward()
        return loss
    for i in range(n_iter):
        loss = optimizer.step(closure)
        if (i + 1) % 1 == 0:
            print(f'Iter {i + 1:3d}/{n_iter} - Loss: {loss.item():.3f}')
    model.eval()

Here are the trained parameters:

for name, param in model.named_parameters():
    print(f'Parameter name: {name:42} value = {param.item():1.2f}')
Parameter name: likelihood.noise_covar.raw_noise           value = -2.49
Parameter name: mean_module.raw_constant                   value = -0.04
Parameter name: covar_module.raw_outputscale               value = -0.63
Parameter name: covar_module.base_kernel.raw_lengthscale   value = -3.73

In terms of the original parameters:

print(f"mean constant: {model.mean_module.constant.item():.3f}")
print(f"output scale: {model.covar_module.outputscale.item():.3f}")
print(f"length scale: {model.covar_module.base_kernel.lengthscale.item():.3f}")
print(f"noise: {model.likelihood.noise.item():.3f}")
mean constant: -0.041
output scale: 0.428
length scale: 0.024
noise: 0.080

Let’s plot the predictions:

x_star = torch.linspace(0, 1, 100)
plot_1d_regression(x_star, model, f_true=f_true, num_samples=0)
../_images/88b1b993c6df3f4159c2667be830a8a19837b339c2834e22fb574bf001fd3fa6.svg

Admittedly, this could look better. (We can tell this only because we know the truth). The assigned length scale is too small. Also, the likelihood variance seems smaller than it is. What do we do now? You have two choices:

  • You encode some prior knowledge and repeat.

  • You add some more data and repeat.

Let’s start with prior knowledge and leave the other item for the questions section. Let’s say that we know that the noise variance. How do we encode this? Here you go:

model.likelihood.noise_covar.noise = 0.1

# Verify that the change went through
print(f"noise: {model.likelihood.noise.item():.3f}")

# To make sure that the noise is not optimized, we need to do:
model.likelihood.raw_noise.requires_grad_(False);
noise: 0.100

This makes the optimizer ignore the noise parameter. Let’s retrain:

train(model, train_x, train_y)
Iter   1/10 - Loss: 0.939
Iter   2/10 - Loss: 0.939
Iter   3/10 - Loss: 0.939
Iter   4/10 - Loss: 0.939
Iter   5/10 - Loss: 0.939
Iter   6/10 - Loss: 0.939
Iter   7/10 - Loss: 0.939
Iter   8/10 - Loss: 0.939
Iter   9/10 - Loss: 0.939
Iter  10/10 - Loss: 0.939
plot_1d_regression(x_star, model, f_true=f_true, num_samples=0)
../_images/d3b144489aaf0f3724b7752236e5d73af312d56db44ffa8c475263ebb8a991db.svg

This looks better. But, the automatically selected length scale seems smaller than the true one. Let’s assign a prior probability density on the length scale so that it is forced to be greater. Since we are dealing with a positive parameter and we don’t know much about it, let’s assign an exponential prior with a rate of 2 (which will yield an expectation of 0.5): $\( \ell \sim \operatorname{Log-N}(0.2, 1). \)$

model.covar_module.register_prior(
    "lengthscale_prior",
    gpytorch.priors.LogNormalPrior(0.5, 1),
    lambda module: module.base_kernel.lengthscale
)

Let’s see what difference this makes:

train(model, train_x, train_y)
Iter   1/10 - Loss: 1.510
Iter   2/10 - Loss: 1.226
Iter   3/10 - Loss: 1.226
Iter   4/10 - Loss: 1.226
Iter   5/10 - Loss: 1.226
Iter   6/10 - Loss: 1.226
Iter   7/10 - Loss: 1.226
Iter   8/10 - Loss: 1.226
Iter   9/10 - Loss: 1.226
Iter  10/10 - Loss: 1.226

Now here is how you can set it:

print(f"mean constant: {model.mean_module.constant.item():.3f}")
print(f"output scale: {model.covar_module.outputscale.item():.3f}")
print(f"length scale: {model.covar_module.base_kernel.lengthscale.item():.3f}")
print(f"noise: {model.likelihood.noise.item():.3f}")
mean constant: 0.081
output scale: 0.589
length scale: 0.111
noise: 0.100
plot_1d_regression(x_star, model, f_true=f_true, num_samples=0)
../_images/c5189b3af44474682981eac95b70f5f4bd4e73a235d029cb26e3f3cdc5cf783d.svg

That’s better, but not perfect. But remember: You don’t know what the truth is. There is only so much you can get without additional data.

Questions#

Let’s investigate what happens to the previous examples as you increase the observations.

  • Rerun everything gradually increasing the number of samples from \(n=10\) to \(n=100\). Notice that as the number of samples increases, it doesn’t matter what your prior knowledge is. As a matter of fact, for the largest number of samples you try, pick a very wrong prior probability for \(\ell\). See what happens.

  • Rerun everything with \(\sigma=0.01\) (small noise) and gradually increasing the number of samples from \(n=10\) to \(n=100\). For small noise, the model is trying to interpolate. Is it capable of detecting that the noise is small when the number of observations is limited? When does the method realize the noise is indeed small?