Tuning the Hyperparameters

MAKE_BOOK_FIGURES=False
import numpy as np
import scipy.stats as st

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
import seaborn as sns
sns.set_context("paper")
sns.set_style("ticks")

def set_book_style():
    plt.style.use('seaborn-v0_8-white') 
    sns.set_style("ticks")
    sns.set_palette("deep")

    mpl.rcParams.update({
        # Font settings
        'font.family': 'serif',  # For academic publishing
        'font.size': 8,  # As requested, 10pt font
        'axes.labelsize': 8,
        'axes.titlesize': 8,
        'xtick.labelsize': 7,  # Slightly smaller for better readability
        'ytick.labelsize': 7,
        'legend.fontsize': 7,
        
        # Line and marker settings for consistency
        'axes.linewidth': 0.5,
        'grid.linewidth': 0.5,
        'lines.linewidth': 1.0,
        'lines.markersize': 4,
        
        # Layout to prevent clipped labels
        'figure.constrained_layout.use': True,
        
        # Default DPI (will override when saving)
        'figure.dpi': 600,
        'savefig.dpi': 600,
        
        # Despine - remove top and right spines
        'axes.spines.top': False,
        'axes.spines.right': False,
        
        # Remove legend frame
        'legend.frameon': False,
        
        # Additional trim settings
        'figure.autolayout': True,  # Alternative to constrained_layout
        'savefig.bbox': 'tight',    # Trim when saving
        'savefig.pad_inches': 0.1   # Small padding to ensure nothing gets cut off
    })

def set_notebook_style():
    plt.style.use('seaborn-v0_8-white')
    sns.set_style("ticks")
    sns.set_palette("deep")

    mpl.rcParams.update({
        # Font settings - using default sizes
        'font.family': 'serif',
        'axes.labelsize': 10,
        'axes.titlesize': 10,
        'xtick.labelsize': 9,
        'ytick.labelsize': 9,
        'legend.fontsize': 9,
        
        # Line and marker settings
        'axes.linewidth': 0.5,
        'grid.linewidth': 0.5,
        'lines.linewidth': 1.0,
        'lines.markersize': 4,
        
        # Layout settings
        'figure.constrained_layout.use': True,
        
        # Remove only top and right spines
        'axes.spines.top': False,
        'axes.spines.right': False,
        
        # Remove legend frame
        'legend.frameon': False,
        
        # Additional settings
        'figure.autolayout': True,
        'savefig.bbox': 'tight',
        'savefig.pad_inches': 0.1
    })

def save_for_book(fig, filename, is_vector=True, **kwargs):
    """
    Save a figure with book-optimized settings.
    
    Parameters:
    -----------
    fig : matplotlib figure
        The figure to save
    filename : str
        Filename without extension
    is_vector : bool
        If True, saves as vector at 1000 dpi. If False, saves as raster at 600 dpi.
    **kwargs : dict
        Additional kwargs to pass to savefig
    """    
    # Set appropriate DPI and format based on figure type
    if is_vector:
        dpi = 1000
        ext = '.pdf'
    else:
        dpi = 600
        ext = '.tif'
    
    # Save the figure with book settings
    fig.savefig(f"{filename}{ext}", dpi=dpi, **kwargs)

def make_full_width_fig():
    return plt.subplots(figsize=(4.7, 2.9), constrained_layout=True)

def make_half_width_fig():
    return plt.subplots(figsize=(2.35, 1.45), constrained_layout=True)

if MAKE_BOOK_FIGURES:
    set_book_style()
else:
    set_notebook_style()

make_full_width_fig = make_full_width_fig if MAKE_BOOK_FIGURES else lambda: plt.subplots()
make_half_width_fig = make_half_width_fig if MAKE_BOOK_FIGURES else lambda: plt.subplots()

Tuning the Hyperparameters#

We tune the hyper-parameters of a Gaussian process using maximum marginal likelihood. See Chapter 5 of [Rasmussen and Williams, 2005] for details.

Again, you may need to install the gpytorch package.

!pip install gpytorch

Show code cell output Hide code cell output

Requirement already satisfied: gpytorch in /Users/ibilion/.pyenv/versions/3.12.5/lib/python3.12/site-packages (1.14)
Requirement already satisfied: jaxtyping in /Users/ibilion/.pyenv/versions/3.12.5/lib/python3.12/site-packages (from gpytorch) (0.2.33)
Requirement already satisfied: mpmath<=1.3,>=0.19 in /Users/ibilion/.pyenv/versions/3.12.5/lib/python3.12/site-packages (from gpytorch) (1.3.0)
Requirement already satisfied: scikit-learn in /Users/ibilion/.pyenv/versions/3.12.5/lib/python3.12/site-packages (from gpytorch) (1.6.1)
Requirement already satisfied: scipy>=1.6.0 in /Users/ibilion/.pyenv/versions/3.12.5/lib/python3.12/site-packages (from gpytorch) (1.12.0)
Requirement already satisfied: linear-operator>=0.6 in /Users/ibilion/.pyenv/versions/3.12.5/lib/python3.12/site-packages (from gpytorch) (0.6)
Requirement already satisfied: torch>=2.0 in /Users/ibilion/.pyenv/versions/3.12.5/lib/python3.12/site-packages (from linear-operator>=0.6->gpytorch) (2.7.1)
Requirement already satisfied: numpy<1.29.0,>=1.22.4 in /Users/ibilion/.pyenv/versions/3.12.5/lib/python3.12/site-packages (from scipy>=1.6.0->gpytorch) (1.26.4)
Requirement already satisfied: filelock in /Users/ibilion/.pyenv/versions/3.12.5/lib/python3.12/site-packages (from torch>=2.0->linear-operator>=0.6->gpytorch) (3.18.0)
Requirement already satisfied: typing-extensions>=4.10.0 in /Users/ibilion/.pyenv/versions/3.12.5/lib/python3.12/site-packages (from torch>=2.0->linear-operator>=0.6->gpytorch) (4.12.2)
Requirement already satisfied: setuptools in /Users/ibilion/.pyenv/versions/3.12.5/lib/python3.12/site-packages (from torch>=2.0->linear-operator>=0.6->gpytorch) (73.0.1)
Requirement already satisfied: sympy>=1.13.3 in /Users/ibilion/.pyenv/versions/3.12.5/lib/python3.12/site-packages (from torch>=2.0->linear-operator>=0.6->gpytorch) (1.13.3)
Requirement already satisfied: networkx in /Users/ibilion/.pyenv/versions/3.12.5/lib/python3.12/site-packages (from torch>=2.0->linear-operator>=0.6->gpytorch) (3.4.2)
Requirement already satisfied: jinja2 in /Users/ibilion/.pyenv/versions/3.12.5/lib/python3.12/site-packages (from torch>=2.0->linear-operator>=0.6->gpytorch) (3.1.4)
Requirement already satisfied: fsspec in /Users/ibilion/.pyenv/versions/3.12.5/lib/python3.12/site-packages (from torch>=2.0->linear-operator>=0.6->gpytorch) (2024.12.0)
Requirement already satisfied: typeguard==2.13.3 in /Users/ibilion/.pyenv/versions/3.12.5/lib/python3.12/site-packages (from jaxtyping->gpytorch) (2.13.3)
Requirement already satisfied: MarkupSafe>=2.0 in /Users/ibilion/.pyenv/versions/3.12.5/lib/python3.12/site-packages (from jinja2->torch>=2.0->linear-operator>=0.6->gpytorch) (2.1.5)
Requirement already satisfied: joblib>=1.2.0 in /Users/ibilion/.pyenv/versions/3.12.5/lib/python3.12/site-packages (from scikit-learn->gpytorch) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /Users/ibilion/.pyenv/versions/3.12.5/lib/python3.12/site-packages (from scikit-learn->gpytorch) (3.5.0)

Example: Gaussian process regression with fitted hyperparameters#

Let’s generate synthetic data:

../_images/fe1739fbee9250ef6e7d4452e708c0c13c6c29c28cdc8c7ff262bcb56516e5f4.svg

We repeat some of the code from the previous section:

Show code cell source Hide code cell source

def plot_1d_regression(
    x_star,
    model,
    ax=None,
    f_true=None,
    num_samples=10
):
    """Plot the posterior predictive.
    
    Arguments
    x_start  --  The test points on which to evaluate.
    model    --  The trained model.
    
    Keyword Arguments
    ax          --  An axes object to write on.
    f_true      --  The true function.
    num_samples --  The number of samples.
    """
    f_star = model(x_star)
    m_star = f_star.mean
    v_star = f_star.variance
    y_star = model.likelihood(f_star)
    yv_star = y_star.variance

    f_lower = (
        m_star - 2.0 * torch.sqrt(v_star)
    )
    f_upper = (
        m_star + 2.0 * torch.sqrt(v_star)
    )
    
    y_lower = m_star - 2.0 * torch.sqrt(yv_star)
    y_upper = m_star + 2.0 * torch.sqrt(yv_star)

    if ax is None:
        fig, ax = plt.subplots()
    
    ax.plot(model.train_inputs[0].flatten().detach(),
            model.train_targets.detach(),
            'kx',
            markersize=10,
            markeredgewidth=2,
            label='Observations'
    )

    ax.plot(
        x_star,
        m_star.detach(),
        lw=2,
        label='$m_n(x)$',
        color=sns.color_palette()[0]
    )
    
    ax.fill_between(
        x_star.flatten().detach(),
        f_lower.flatten().detach(),
        f_upper.flatten().detach(),
        alpha=0.5,
        label=r'$f(\mathbf{x}^*)$ 95% pred.',
        color=sns.color_palette()[0]
    )

    ax.fill_between(
        x_star.detach().flatten(),
        y_lower.detach().flatten(),
        f_lower.detach().flatten(),
        color=sns.color_palette()[1],
        alpha=0.5,
        label='$y^*$ 95% pred.'
    )
    ax.fill_between(
        x_star.detach().flatten(),
        f_upper.detach().flatten(),
        y_upper.detach().flatten(),
        color=sns.color_palette()[1],
        alpha=0.5,
        label=None
    )

    
    if f_true is not None:
        ax.plot(
            x_star,
            f_true(x_star),
            'm-.',
            label='True function'
        )
        
    if num_samples > 0:
        f_post_samples = f_star.sample(
            sample_shape=torch.Size([10])
        )
        ax.plot(
            x_star,
            f_post_samples.T.detach(),
            color="red",
            lw=0.5
        )
        # This is just to add the legend entry
        ax.plot(
            [],
            [],
            color="red",
            lw=0.5,
            label="Posterior samples"
        )
        
    ax.set_xlabel('$x$')
    ax.set_ylabel('$y$')

    plt.legend(loc='best', frameon=False)
    sns.despine(trim=True);

Let’s build a Gaussian process model as usual:

The model has the following parameters:

model.mean.constant: the constant of the mean function.
model.covar_module.outputscale: the output scale of the kernel (variance or signal strength).
model.covar_module.base_kernel.lengthscale: the length scale of the kernel.
model.likelihood.noise: the standard deviation of the noise.

The mean constant can be any real number. The output scale, the length scale, and the noise standard deviation must be positive. Typically, one explicitly enforces the constraints by working with a transformed version of the parameters. For example, one works with its logarithm instead of the length scale.

GPyTorch automatically does this for us. If you want to learn more about this, go over over this. GPyTorch calls the transformed parameters raw_<NAME>. For example, the raw length scale is model.covar_module.base_kernel.raw_lengthscale. The raw parameters are the ones that are optimized. Here is how you can see them:

for name, param in model.named_parameters():
    print(f'Parameter name: {name:42} value = {param.item():1.2f}')

Parameter name: likelihood.noise_covar.raw_noise           value = 0.00
Parameter name: mean_module.raw_constant                   value = 0.00
Parameter name: covar_module.raw_outputscale               value = 0.00
Parameter name: covar_module.base_kernel.raw_lengthscale   value = 0.00

Here is now how we can maximize the marginal log likelihood:

# THis is necessary to initialize the model
# but it is not all we need to do
model.train()

# This is the optimizer that we will use
optimizer = torch.optim.LBFGS(model.parameters(), line_search_fn='strong_wolfe')

# This is the loss function
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

# This functions runs at each iteration of the optimizer
# and it calculates the loss and its gradients
def closure():
    optimizer.zero_grad()
    output = model(train_x)
    loss = -mll(output, train_y)
    loss.backward()
    return loss

# This is the training loop
n_iter = 10
for i in range(n_iter):
    loss = optimizer.step(closure)
    if (i + 1) % 1 == 0:
        print(f'Iter {i + 1:3d}/{n_iter} - Loss: {loss.item():.3f}')

# We are done training, so we can switch to eval mode
model.eval();

Iter   1/10 - Loss: 1.195
Iter   2/10 - Loss: 0.936
Iter   3/10 - Loss: 0.936
Iter   4/10 - Loss: 0.936
Iter   5/10 - Loss: 0.936
Iter   6/10 - Loss: 0.936
Iter   7/10 - Loss: 0.936
Iter   8/10 - Loss: 0.936
Iter   9/10 - Loss: 0.936
Iter  10/10 - Loss: 0.936

It is convenient to organize this into a function that we can reuse:

def train(model, train_x, train_y, n_iter=10, lr=0.1):
    """Train the model.

    Arguments
    model   --  The model to train.
    train_x --  The training inputs.
    train_y --  The training labels.
    n_iter  --  The number of iterations.
    """
    model.train()
    optimizer = torch.optim.LBFGS(model.parameters(), line_search_fn='strong_wolfe')
    mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
    def closure():
        optimizer.zero_grad()
        output = model(train_x)
        loss = -mll(output, train_y)
        loss.backward()
        return loss
    for i in range(n_iter):
        loss = optimizer.step(closure)
        if (i + 1) % 1 == 0:
            print(f'Iter {i + 1:3d}/{n_iter} - Loss: {loss.item():.3f}')
    model.eval()

Here are the trained parameters:

for name, param in model.named_parameters():
    print(f'Parameter name: {name:42} value = {param.item():1.2f}')

Parameter name: likelihood.noise_covar.raw_noise           value = -2.49
Parameter name: mean_module.raw_constant                   value = -0.04
Parameter name: covar_module.raw_outputscale               value = -0.63
Parameter name: covar_module.base_kernel.raw_lengthscale   value = -3.73

In terms of the original parameters:

print(f"mean constant: {model.mean_module.constant.item():.3f}")
print(f"output scale: {model.covar_module.outputscale.item():.3f}")
print(f"length scale: {model.covar_module.base_kernel.lengthscale.item():.3f}")
print(f"noise: {model.likelihood.noise.item():.3f}")

mean constant: -0.041
output scale: 0.428
length scale: 0.024
noise: 0.080

Let’s plot the predictions:

x_star = torch.linspace(0, 1, 100)
plot_1d_regression(x_star, model, f_true=f_true, num_samples=0)

../_images/21661b7f265d7a5f34bb9d9d5d1c9f1180215cee06497430e59bc4b9f2623795.svg

Admittedly, this could look better. (We can tell this only because we know the truth). The assigned length scale is too small. Also, the likelihood variance seems smaller than it is. What do we do now? You have two choices:

You encode some prior knowledge and repeat.
You add some more data and repeat.

Let’s start with prior knowledge and leave the other item for the questions section. Let’s say that we know that the noise variance. How do we encode this? Here you go:

model.likelihood.noise_covar.noise = 0.1

# Verify that the change went through
print(f"noise: {model.likelihood.noise.item():.3f}")

# To make sure that the noise is not optimized, we need to do:
model.likelihood.raw_noise.requires_grad_(False);

noise: 0.100

This makes the optimizer ignore the noise parameter. Let’s retrain:

train(model, train_x, train_y)

Iter   1/10 - Loss: 0.939
Iter   2/10 - Loss: 0.939
Iter   3/10 - Loss: 0.939
Iter   4/10 - Loss: 0.939
Iter   5/10 - Loss: 0.939
Iter   6/10 - Loss: 0.939
Iter   7/10 - Loss: 0.939
Iter   8/10 - Loss: 0.939
Iter   9/10 - Loss: 0.939
Iter  10/10 - Loss: 0.939

plot_1d_regression(x_star, model, f_true=f_true, num_samples=0)

../_images/2126418617089499c1f6d5c807c2b09cb6ec1806ec5fa5c9fdec9097cfd4dbd4.svg

This looks better. But, the automatically selected length scale seems smaller than the true one. Let’s assign a prior probability density on the length scale so that it is forced to be greater. Since we are dealing with a positive parameter and we don’t know much about it, let’s assign an exponential prior with a rate of 2 (which will yield an expectation of 0.5): $$ \ell \sim \operatorname{Log-N}(0.2, 1). $$

model.covar_module.register_prior(
    "lengthscale_prior",
    gpytorch.priors.LogNormalPrior(0.5, 1),
    lambda module: module.base_kernel.lengthscale
)

Let’s see what difference this makes:

train(model, train_x, train_y)

Iter   1/10 - Loss: 1.510
Iter   2/10 - Loss: 1.226
Iter   3/10 - Loss: 1.226
Iter   4/10 - Loss: 1.226
Iter   5/10 - Loss: 1.226
Iter   6/10 - Loss: 1.226
Iter   7/10 - Loss: 1.226
Iter   8/10 - Loss: 1.226
Iter   9/10 - Loss: 1.226
Iter  10/10 - Loss: 1.226

Now here is how you can set it:

print(f"mean constant: {model.mean_module.constant.item():.3f}")
print(f"output scale: {model.covar_module.outputscale.item():.3f}")
print(f"length scale: {model.covar_module.base_kernel.lengthscale.item():.3f}")
print(f"noise: {model.likelihood.noise.item():.3f}")

mean constant: 0.081
output scale: 0.589
length scale: 0.111
noise: 0.100

plot_1d_regression(x_star, model, f_true=f_true, num_samples=0)

../_images/1953f8c3d50090b238620b9f894db4d742a6c83c101c22586ef2d1c9a61d7895.svg

That’s better, but not perfect. But remember: You don’t know what the truth is. There is only so much you can get without additional data.

Questions#

Let’s investigate what happens to the previous examples as you increase the observations.

Rerun everything gradually increasing the number of samples from $n=10$ to $n=100$. Notice that as the number of samples increases, it doesn’t matter what your prior knowledge is. As a matter of fact, for the largest number of samples you try, pick a very wrong prior probability for $\ell$. See what happens.
Rerun everything with $\sigma=0.01$ (small noise) and gradually increasing the number of samples from $n=10$ to $n=100$. For small noise, the model is trying to interpolate. Is it capable of detecting that the noise is small when the number of observations is limited? When does the method realize the noise is indeed small?

Tuning the Hyperparameters

Contents

Tuning the Hyperparameters#

Example: Gaussian process regression with fitted hyperparameters#

Questions#