Hide code cell source
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
import seaborn as sns
sns.set_context("paper")
sns.set_style("ticks");

Linear regression with a single variable#

We will now look at a simple linear regression example with a single variable.

An example where things work as expected#

Let’s create a synthetic dataset to introduce the basic concepts. It must be synthetic because we want to know the ground truth. Let’s start with pairs of \(x\) and \(y\), which have a linear relationship. We contaminate \(y\) with Gaussian noise. In particular, we generate the data from:

\[ y_i = -0.5 + 2 x_i + 0.1\epsilon_i, \]

where \(\epsilon_i \sim N(0,1)\) and where we sample \(x_i \sim U([0,1])\). Here is how to generate this synthetic dataset and what it looks like:

np.random.seed(12345)

num_obs = 10
x = np.random.rand(num_obs)
w0_true = -0.5
w1_true = 2.0
sigma_true = 0.1
y = (
    w0_true
    + w1_true * x
    + sigma_true * np.random.randn(num_obs)
)

Let’s plot the data:

fig, ax = plt.subplots()
ax.plot(x, y, 'x', label='Observed data')
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
plt.legend(loc='best', frameon=True)
sns.despine(trim=True);
../_images/76c9c91efea41a80e09feb09af28b6a50a12c87294b3dc133f26fe23486b305a.svg

We will now use least squares to fit the data to this linear model:

\[ y = w_0 + w_1 x. \]

Least squares minimize the square loss:

\[ L(\mathbf{w}) = \sum_{i=1}^n(y_i - w_0 - w_1 x_i)^2 = \parallel \mathbf{y} - \mathbf{X}\mathbf{w}\parallel^2, \]

where \(\mathbf{y} = (y_1,\dots,y_n)\) is the vector of observations, \(\mathbf{w} = (w_0, w_1)\) is the weight vector, and the \(n\times 2\) design matrix \(\mathbf{X}\) is:

\[\begin{split} \mathbf{X} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}. \end{split}\]

We need to make the design matrix \(\mathbf{X}\):

# Put together a column of ones next to the observed x's
X = np.hstack(
    [np.ones((num_obs, 1)), x.reshape((num_obs, 1))]
)
X
array([[1.        , 0.92961609],
       [1.        , 0.31637555],
       [1.        , 0.18391881],
       [1.        , 0.20456028],
       [1.        , 0.56772503],
       [1.        , 0.5955447 ],
       [1.        , 0.96451452],
       [1.        , 0.6531771 ],
       [1.        , 0.74890664],
       [1.        , 0.65356987]])

Once we have this, we can use numpy.linalg.lstsq to solve the least squares problem. This function solves the linear system we derived in the previous section, i.e.,

\[ \mathbf{X}^T\mathbf{X}\mathbf{w} = \mathbf{X}^T\mathbf{y}. \]

It works as follows:

w, _, _, _ = np.linalg.lstsq(X, y, rcond=None)
print(f'w_0 = {w[0]:1.2f}')
print(f'w_1 = {w[1]:1.2f}')
w_0 = -0.48
w_1 = 1.99

So, the values we found for \(w_0\) and \(w_1\) are close to the correct values. The agreement is not perfect. There is noise in the data, and we have only used ten observations. The more noise there is, the more observations it would take to identify the regression coefficients correctly.

Let’s now plot the regression function against the data:

# Make predictions
# Some points on which to evaluate the regression function
xx = np.linspace(0, 1, 100)
# The true connection between x and y
yy_true = w0_true + w1_true * xx
# The model we just fitted
yy = w[0] + w[1] * xx

# Plot them
fig, ax = plt.subplots()
# plot the data again
ax.plot(x, y, 'x', label='Observed data')
# overlay the true 
ax.plot(xx, yy_true, label='True response surface')
# overlay our prediction
ax.plot(xx, yy, '--', label='Fitted model')
plt.legend(loc='best', frameon=False)
sns.despine(trim=True);
../_images/ce3c47b845aa9ae00cd53df449704d10f6242f183e234c7182c71cc0f044b415.svg

Questions#

  • Try increasing num_obs to 100. Does the fit improve? Conclusion: When training with least squares, more data is better.

  • Try decreasing num_obs to 2. What is happening here? This is an example of fitting the noise.

An example where things do not work as expected: underfitting#

Let’s try to fit a linear regression model to data generated from:

\[ y_i = -0.5 + 2x_i + 2x_i^2 + \epsilon_i, \]

where \(\epsilon_i \sim N(0, 1)\) and where we sample \(x_i \sim U([-1,1])\):

Hide code cell source
np.random.seed(12345)

num_obs = 10
x = -1.0 + 2 * np.random.rand(num_obs)
w0_true = -0.5
w1_true = 2.0
w2_true = 2.0
sigma_true = 0.1
y = (
    w0_true
    + w1_true * x
    + w2_true * x ** 2
    + sigma_true * np.random.randn(num_obs)
)

fig, ax = plt.subplots()
ax.plot(x, y, 'x', label='Observed data')
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
plt.legend(loc='best', frameon=False)
sns.despine(trim=True);
../_images/584b017f82943bfef6ee41640e5623ba1e5bd97bce830e9f495012d28d931436.svg

We will still fit a linear model to this dataset. We know it will not work well, but let’s try it anyway. First, create the design matrix just like before:

X = np.hstack(
    [np.ones((num_obs, 1)), x.reshape((num_obs, 1))]
)

w, _, _, _ = np.linalg.lstsq(X, y, rcond=None)

print(f'w_0 = {w[0]:1.2f}')
print(f'w_1 = {w[1]:1.2f}')
w_0 = 0.03
w_1 = 2.46
# Make predictions
xx = np.linspace(-1, 1, 100)
yy_true = w0_true + w1_true * xx + w2_true * xx ** 2
yy = w[0] + w[1] * xx

# Plot them
fig, ax = plt.subplots()
ax.plot(x, y, 'x', label='Observed data')
ax.plot(xx, yy_true, label='True response surface')
ax.plot(xx, yy, '--', label='Fitted model')
plt.legend(loc='best', frameon=False)
sns.despine(trim=True);
../_images/ed83a808c1243a65f0dece7e3354ce690f36f321f6ec64c666c360530d5e27ca.svg

Questions#

  • Experiment with very small num_obs. If you did not know the true response surface, can you say whether or not the fit is good?

  • Experiment with a big num_obs. Does the fit improve? This is an example of underfitting. Your model does not have enough expressivity to capture the data.