Hide code cell source
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
import seaborn as sns
sns.set_context("paper")
sns.set_style("ticks")

import urllib.request
import os

def download(
    url : str,
    local_filename : str = None
):
    """Download a file from a url.
    
    Arguments
    url            -- The url we want to download.
    local_filename -- The filemame to write on. If not
                      specified 
    """
    if local_filename is None:
        local_filename = os.path.basename(url)
    urllib.request.urlretrieve(url, local_filename)

Regression with Deep Neural Networks#

We want to understand the basics of PyTorch and set up and train regression DNNs with it. Some useful references on PyTorch are:

What is PyTorch, and why are we using it?#

  • PyTorch is an alternative to Numpy that can harness the power of GPUs.

  • PyTorch provides some core functionality for Neural Networks:

    • Some essential elements for building them up, like linear layers, activation functions, etc.

    • Automatic differentiation for getting the derivative of loss functions with respect to parameters.

    • Some stochastic optimization algorithms for minimizing loss functions

I am not going to provide here a complete tutorial on PyTorch. You are advised to go over the first three topics of the Deep Learning with PyTorch: A 60-minute blitz before beginning this hands-on activity. Otherwise, it is unlikely that you understand the code that follows.

The Facebook AI Research Group (now Meta AI Research Group, I guess) developed PyTorch. Another powerful alternative Google Brain developed is TensorFlow. However, TensorFlow is gradually losing popularity. Instead of TensorFlow, Google Brain is now developing JAX. There are pros and cons to each of these frameworks. In research, I use either PyTorch or JAX depending on what we are doing. It is worth learning both of them, but we will stick to PyTorch in this course.

Making neural networks in PyTorch#

PyTorch is pretty flexible in allowing you to make any neural network you like. You have absolute freedom on how your model looks like. However, it does provide a super easy way to make dense neural networks with a fixed activation function. That’s what we are going to start with. First, import torch:

import torch

The submodule torch.nn is where the neural network building blocks reside:

import torch.nn as nn

First, let me show you how you can make a single linear layer:

\[ y = Wx + b. \]

The weights are selected randomly if not specified. Here you go:

layer = nn.Linear(1, 20)

This function now takes one-dimensional inputs and spits out 20-dimensional outputs. Here is how it works:

x = torch.rand(10, 1) # 10 randomly sampled one dimensinal inputs
print(x)
tensor([[0.4473],
        [0.4693],
        [0.6867],
        [0.4315],
        [0.1168],
        [0.1775],
        [0.1886],
        [0.8446],
        [0.0689],
        [0.2172]])
y = layer(x)
print(y)
Hide code cell output
tensor([[-0.2990,  0.0449, -0.5824, -0.8878, -1.0085,  0.2322,  0.6669,  0.2586,
          1.1834, -0.4081,  0.1665, -0.5788,  0.5780, -0.7827, -0.3398, -0.4342,
          0.4280,  0.5903,  0.0366,  0.8191],
        [-0.2787,  0.0371, -0.5976, -0.8999, -1.0157,  0.2318,  0.6516,  0.2463,
          1.2003, -0.3960,  0.1828, -0.5794,  0.5986, -0.7824, -0.3380, -0.4339,
          0.4212,  0.5941,  0.0284,  0.8406],
        [-0.0779, -0.0405, -0.7482, -1.0199, -1.0872,  0.2272,  0.4997,  0.1253,
          1.3670, -0.2764,  0.3447, -0.5856,  0.8022, -0.7801, -0.3200, -0.4315,
          0.3540,  0.6314, -0.0525,  1.0533],
        [-0.3136,  0.0505, -0.5715, -0.8791, -1.0033,  0.2326,  0.6780,  0.2673,
          1.1713, -0.4167,  0.1547, -0.5784,  0.5632, -0.7828, -0.3411, -0.4343,
          0.4328,  0.5876,  0.0425,  0.8037],
        [-0.6044,  0.1628, -0.3535, -0.7055, -0.8998,  0.2392,  0.8979,  0.4426,
          0.9300, -0.5899, -0.0797, -0.5695,  0.2684, -0.7862, -0.3671, -0.4378,
          0.5301,  0.5335,  0.1596,  0.4956],
        [-0.5483,  0.1411, -0.3956, -0.7390, -0.9198,  0.2379,  0.8554,  0.4088,
          0.9766, -0.5565, -0.0344, -0.5712,  0.3253, -0.7856, -0.3621, -0.4371,
          0.5113,  0.5440,  0.1370,  0.5551],
        [-0.5381,  0.1372, -0.4032, -0.7451, -0.9234,  0.2377,  0.8477,  0.4026,
          0.9850, -0.5504, -0.0262, -0.5715,  0.3357, -0.7854, -0.3612, -0.4370,
          0.5079,  0.5458,  0.1329,  0.5659],
        [ 0.0680, -0.0969, -0.8576, -1.1070, -1.1392,  0.2239,  0.3894,  0.0373,
          1.4881, -0.1896,  0.4623, -0.5900,  0.9501, -0.7784, -0.3070, -0.4298,
          0.3052,  0.6585, -0.1112,  1.2079],
        [-0.6487,  0.1799, -0.3203, -0.6790, -0.8840,  0.2402,  0.9313,  0.4693,
          0.8932, -0.6162, -0.1154, -0.5682,  0.2236, -0.7867, -0.3711, -0.4383,
          0.5449,  0.5253,  0.1774,  0.4488],
        [-0.5116,  0.1270, -0.4231, -0.7609, -0.9328,  0.2371,  0.8277,  0.3867,
          1.0070, -0.5346, -0.0049, -0.5723,  0.3625, -0.7851, -0.3588, -0.4367,
          0.4990,  0.5508,  0.1222,  0.5939]], grad_fn=<AddmmBackward0>)
print(y.shape)
torch.Size([10, 20])

So, this took us to 10, 20 dimensional outputs. Looks good.

But where are the weights and the bias term? Here they are:

layer.weight
Parameter containing:
tensor([[ 0.9240],
        [-0.3568],
        [-0.6926],
        [-0.5517],
        [-0.3290],
        [-0.0210],
        [-0.6986],
        [-0.5568],
        [ 0.7669],
        [ 0.5500],
        [ 0.7447],
        [-0.0282],
        [ 0.9366],
        [ 0.0107],
        [ 0.0826],
        [ 0.0110],
        [-0.3089],
        [ 0.1718],
        [-0.3721],
        [ 0.9787]], requires_grad=True)
layer.bias
Parameter containing:
tensor([-0.7123,  0.2045, -0.2726, -0.6410, -0.8614,  0.2416,  0.9794,  0.5076,
         0.8404, -0.6541, -0.1667, -0.5662,  0.1591, -0.7875, -0.3768, -0.4391,
         0.5661,  0.5135,  0.2030,  0.3813], requires_grad=True)

You can directly change them if you wish. Notice the requires_grad=True flag. This is because PyTorch knows that these are parameters to be optimized.

There is a little bit of flexibility on nn.Linear. For example, you can altogether drop the bias if you wish. For the complete list of possibilities, you should always check the docs.

Now, let’s get to the activation functions. There are a lot already in torch.nn. Here is the sigmoid:

h = nn.Sigmoid()

fig, ax = plt.subplots()
x = torch.linspace(-5, 5, 100)[:, None]
ax.plot(x, h(x))
ax.set_xlabel('$x$')
ax.set_ylabel('$z=h(x)$')
ax.set_title('Activation function: ' + str(h))
sns.despine(trim=True);
../_images/9598c66a43bb0ccecfca68ab070964915203d61f5f00a88873975755f99f6bde.svg

You could also implement the activation function by hand. The only restriction is that you should use’ PyTorch’ instead of numpy functions. Here is how we would do it for the sigmoid:

# Here is how you could do this by hand:
h_by_hand = lambda x: torch.exp(x) / (1.0 + torch.exp(x))

fig, ax = plt.subplots(dpi=100)
ax.plot(x, h_by_hand(x))
ax.set_xlabel('$x$')
ax.set_ylabel('$z=h(x)$')
ax.set_title('Activation function: Sigmoid (hand version)')
sns.despine(trim=True);
../_images/4f7d89ef3d3e1489fc4056c185bdab86c2953a64787be1f5a1b4f15ba16d5b28.svg

Here are now some of the most commonly used activation functions in torch.nn:

fig, ax = plt.subplots(dpi=100)
ax.plot(x, h_by_hand(x))

for Func in [nn.Sigmoid, nn.ReLU, nn.Tanh]:
    h = Func()
    ax.plot(x, h(x), label=str(h))
    
ax.set_xlabel('$x$')
ax.set_ylabel('$z=h(x)$')
plt.legend(loc='best', frameon=False)
sns.despine(trim=True);
../_images/1e77f8192d975bb7b77ebe9fa7df4ac6040bb03944ff8c5b56b519908a28a3b6.svg

Now that we have a linear layer and an activation function, here is how we can combine them to make a function that takes us from the input to the internal neurons:

h = nn.Sigmoid()
z_func = lambda x: h(layer(x))

This is pretty much it. And that’s now a function:

z_func(x)
tensor([[0.0048, 0.8796, 0.9605,  ..., 0.4145, 0.8873, 0.0109],
        [0.0053, 0.8757, 0.9577,  ..., 0.4187, 0.8835, 0.0120],
        [0.0058, 0.8717, 0.9548,  ..., 0.4229, 0.8796, 0.0132],
        ...,
        [0.9764, 0.1813, 0.0267,  ..., 0.7921, 0.1705, 0.9938],
        [0.9784, 0.1760, 0.0249,  ..., 0.7950, 0.1652, 0.9944],
        [0.9803, 0.1709, 0.0233,  ..., 0.7978, 0.1601, 0.9949]],
       grad_fn=<SigmoidBackward0>)

Now, for regression, we would like to bring this back to a scalar output. We need to add one more linear layer to take the 20 internal neurons back to one dimension to do this.

final_layer = nn.Linear(20, 1)
f = lambda x: final_layer(z_func(x))
print(f(x).shape)
torch.Size([100, 1])

Instead of doing this manually, we can use the class nn.Sequential of PyTorch:

f = nn.Sequential(layer, nn.Sigmoid(), final_layer)

This is recommended because nn.Sequential adds some additional functionality, which I will show you in a while. You can evaluate this as a function, and you can also plot it. But to plot it, you have to turn the output into a proper numpy array. This is because matplotlib doesn’t like PyTorch tensors that depend on parameters. Here is what you need to do:

y = f(x).detach().numpy() # detach freezes the parameters to whatever they are
                          # numpy returns a proper numpy array
print(type(y))
print(y.shape)
<class 'numpy.ndarray'>
(100, 1)

And here is what it looks like (remember the weights are random):

fig, ax = plt.subplots(dpi=100)
ax.plot(x, f(x).detach().numpy())
ax.set_xlabel('$x$')
ax.set_ylabel('$f(x)$')
sns.despine(trim=True);
../_images/55d457ec5348e5a67d4b6ded382af5cf82ba19d7b58f4342c0f259bf2aae07cb.svg

The class nn.Sequential is convenient because it allows us to build deep networks quickly. Here is a 5-layer network that starts from one input, takes us through 3 layers with 20 neurons each, and ends on a single output:

f = nn.Sequential(nn.Linear(1, 20),
                  nn.ReLU(),
                  nn.Linear(20, 20),
                  nn.ReLU(),
                  nn.Linear(20, 20),
                  nn.ReLU(),
                  nn.Linear(20, 20),
                  nn.ReLU(),
                  nn.Linear(20, 1))

Where are the parameters in an object created in this way? Here they are:

for theta in f.named_parameters():
    print(theta)
Hide code cell output
('0.weight', Parameter containing:
tensor([[ 0.0484],
        [ 0.3192],
        [ 0.4466],
        [ 0.1472],
        [ 0.4608],
        [-0.9621],
        [-0.1062],
        [-0.6968],
        [-0.8970],
        [ 0.5296],
        [ 0.8169],
        [-0.8385],
        [-0.3517],
        [ 0.0616],
        [ 0.6742],
        [-0.1458],
        [ 0.1257],
        [ 0.7280],
        [-0.2660],
        [-0.9855]], requires_grad=True))
('0.bias', Parameter containing:
tensor([ 0.2031,  0.9820, -0.0771,  0.3477,  0.9455, -0.4372,  0.8274, -0.3303,
         0.3185, -0.5828,  0.0613,  0.4423,  0.4911, -0.7005, -0.3472, -0.0119,
         0.8369,  0.8621,  0.5197,  0.3878], requires_grad=True))
('2.weight', Parameter containing:
tensor([[ 0.0729,  0.2085,  0.0879, -0.1473,  0.0428,  0.1368, -0.1934, -0.2158,
         -0.0307,  0.1823,  0.0022,  0.0410, -0.1349,  0.0262,  0.1258,  0.1117,
         -0.2200,  0.1948,  0.0226, -0.2147],
        [ 0.1955, -0.2236, -0.1189, -0.0993, -0.0362, -0.0485, -0.1715, -0.0989,
         -0.1529,  0.1377,  0.0143, -0.1529, -0.2085,  0.1617,  0.2103, -0.0337,
          0.1109,  0.1444, -0.0600,  0.0806],
        [-0.1332, -0.0025,  0.0540,  0.2167, -0.1783, -0.1116, -0.0446, -0.0643,
          0.0948, -0.1519, -0.1633,  0.2055, -0.0744,  0.0604, -0.0082,  0.1154,
         -0.2187, -0.1562, -0.1732, -0.1028],
        [-0.1966,  0.2118,  0.0567, -0.1658,  0.2168,  0.0982,  0.0548, -0.0075,
          0.2153, -0.1317, -0.2100, -0.1388, -0.0663,  0.1491,  0.1637, -0.1604,
         -0.1852, -0.1507,  0.1317,  0.0316],
        [ 0.1397, -0.2057,  0.2091, -0.0141,  0.1083,  0.1968, -0.0644,  0.1116,
         -0.0638, -0.0021, -0.1184, -0.0147,  0.0575, -0.1579, -0.0361, -0.1390,
          0.0822,  0.1717,  0.0395,  0.0793],
        [-0.0322,  0.1296,  0.0215, -0.2202,  0.0050,  0.0728,  0.0574, -0.0533,
         -0.0621,  0.1868,  0.1786,  0.1983, -0.0841, -0.1678, -0.2140, -0.0893,
         -0.1080,  0.0657, -0.0537,  0.1217],
        [ 0.2153, -0.1819, -0.0525, -0.0272, -0.1185, -0.1163,  0.1186, -0.0400,
         -0.0571, -0.2122, -0.0112, -0.2176,  0.0431, -0.0485,  0.0048, -0.0935,
          0.0837, -0.0564, -0.1152,  0.0085],
        [-0.0210,  0.0188, -0.0090,  0.2140, -0.1194, -0.1126, -0.1155, -0.0750,
          0.0052,  0.0242, -0.0416, -0.1218,  0.1350,  0.1844, -0.1991,  0.0915,
          0.1655, -0.1352,  0.0915, -0.2045],
        [ 0.0451, -0.1569,  0.2222,  0.0147,  0.1115,  0.1903, -0.1513, -0.1080,
         -0.1091,  0.1162,  0.2056,  0.1272,  0.0236, -0.1372, -0.1788,  0.0420,
          0.0085, -0.1029,  0.2101,  0.0296],
        [-0.1074, -0.0122,  0.1488, -0.0633,  0.1200,  0.0539, -0.0339,  0.0455,
         -0.1911,  0.0233,  0.0342, -0.1516, -0.0129, -0.0110, -0.0227,  0.0164,
         -0.1464, -0.1182,  0.0850, -0.1543],
        [ 0.1721,  0.0521, -0.0959,  0.2148,  0.1781, -0.1358, -0.0252,  0.0073,
          0.0310,  0.1185, -0.0191, -0.2151, -0.1669,  0.0756,  0.0041, -0.0874,
          0.1732, -0.0008, -0.0139,  0.1286],
        [-0.0139, -0.1741,  0.2200,  0.0417,  0.1606,  0.0622,  0.0801,  0.1016,
          0.1711, -0.0440,  0.1290, -0.2129,  0.0922,  0.0277, -0.1981,  0.1690,
         -0.2041,  0.1755, -0.0674, -0.0322],
        [ 0.1981,  0.0271, -0.1902,  0.2230,  0.1563,  0.1889, -0.1960,  0.1767,
         -0.0579,  0.1953,  0.0262, -0.0292, -0.1975,  0.1894,  0.1068,  0.0867,
         -0.0051,  0.2155,  0.1528, -0.1857],
        [-0.0648,  0.1687,  0.0356,  0.1224, -0.0177,  0.0905, -0.2100, -0.0044,
         -0.1496,  0.0525,  0.1794, -0.0327,  0.0216,  0.1228,  0.0061,  0.2168,
         -0.0341, -0.2083,  0.1157, -0.0930],
        [-0.2075,  0.1030, -0.0514,  0.0880,  0.1010, -0.1759, -0.1498, -0.1129,
          0.0180,  0.1378,  0.0105,  0.1078,  0.0681, -0.1760, -0.1415,  0.0484,
          0.0499, -0.1106,  0.1317, -0.1294],
        [ 0.1977, -0.1893, -0.0194, -0.2017,  0.0381,  0.0711,  0.0178,  0.1812,
          0.1455,  0.1372, -0.1571,  0.0849, -0.1718,  0.1578,  0.0996,  0.1309,
         -0.0346,  0.1829,  0.0568,  0.0604],
        [ 0.0339, -0.1328,  0.1094,  0.1203,  0.1570, -0.0141,  0.0910,  0.0785,
         -0.0202, -0.0008, -0.0008,  0.0346,  0.2155, -0.0255, -0.1858, -0.0307,
          0.0656,  0.1417,  0.1257,  0.1870],
        [-0.0530, -0.0323, -0.1037, -0.0745,  0.0317, -0.2032, -0.1215, -0.2081,
         -0.0385, -0.2078, -0.1680,  0.0727, -0.1011,  0.0335,  0.0181, -0.0643,
          0.0833,  0.2233,  0.0215, -0.0443],
        [-0.0093,  0.1277,  0.1048, -0.0297,  0.0009, -0.0070,  0.1418, -0.0787,
         -0.1937, -0.1139, -0.0751, -0.0561,  0.0186,  0.0934, -0.1859,  0.0072,
          0.1128, -0.1571, -0.1510,  0.0534],
        [ 0.1674, -0.1954,  0.0122,  0.2105,  0.1232, -0.1742,  0.1164,  0.0156,
          0.0462, -0.2175, -0.1824, -0.0017, -0.0020,  0.0513, -0.0300,  0.2232,
          0.1212,  0.1643, -0.2084, -0.1650]], requires_grad=True))
('2.bias', Parameter containing:
tensor([ 0.1011,  0.0534,  0.0188, -0.0629, -0.0563, -0.2198, -0.1080, -0.1618,
        -0.0479, -0.0827, -0.2200,  0.0076,  0.0861, -0.1813,  0.1124, -0.2168,
         0.1293,  0.0408,  0.1218, -0.1780], requires_grad=True))
('4.weight', Parameter containing:
tensor([[ 0.2168,  0.1058,  0.1113, -0.0856,  0.1531,  0.2131, -0.0488, -0.1595,
         -0.1884,  0.1635, -0.2038, -0.1754, -0.1815,  0.0340, -0.0703, -0.0052,
         -0.0549, -0.0532, -0.2097,  0.2085],
        [ 0.1783, -0.1267,  0.1820,  0.0882,  0.0840, -0.0558, -0.2171,  0.1387,
          0.1473,  0.0148, -0.1571,  0.0083, -0.1804, -0.0328,  0.1599, -0.1415,
         -0.1459, -0.0491, -0.0492,  0.2196],
        [ 0.1548,  0.1735, -0.1939, -0.1407,  0.1727, -0.1773,  0.0615,  0.0842,
         -0.1108,  0.1385, -0.2028,  0.1521,  0.1359,  0.0277,  0.0484,  0.1311,
          0.1418,  0.0914,  0.1552,  0.0385],
        [-0.1698, -0.1857, -0.0439,  0.0447, -0.1425, -0.0530, -0.2051, -0.1679,
         -0.0963,  0.1071,  0.1260, -0.0804,  0.0855, -0.2106, -0.0231,  0.0602,
         -0.0348, -0.0282, -0.1310, -0.0342],
        [-0.0323, -0.0734, -0.1683,  0.0056,  0.0954, -0.0249,  0.0844,  0.1857,
          0.0551, -0.2212,  0.0492, -0.1537, -0.1897, -0.1889, -0.1463, -0.1980,
         -0.1313,  0.0137,  0.2016, -0.0341],
        [ 0.1440, -0.1978,  0.2004, -0.1299,  0.0317,  0.1874,  0.0385,  0.0489,
          0.0032,  0.1130,  0.1603,  0.0961,  0.1065, -0.0143, -0.1452, -0.1543,
         -0.0260, -0.0387,  0.2146,  0.1821],
        [ 0.1149,  0.1424, -0.0709,  0.1860,  0.0498, -0.0034, -0.0324,  0.0295,
         -0.1335, -0.0724, -0.1426, -0.1719,  0.1704, -0.0705, -0.1682,  0.0889,
          0.2156, -0.0816,  0.1831, -0.1997],
        [-0.0247,  0.1312, -0.0484, -0.1041, -0.1385,  0.1052,  0.1389,  0.1051,
          0.0055,  0.0156, -0.0318,  0.1772, -0.1151, -0.1396, -0.0369, -0.1738,
         -0.1262,  0.2178, -0.1437, -0.0182],
        [-0.1485, -0.0709,  0.0480, -0.1818, -0.0766, -0.1218,  0.0302, -0.0546,
          0.0637, -0.1609,  0.1307,  0.2050, -0.2034,  0.1834, -0.0601,  0.1659,
          0.0116,  0.1889,  0.1620, -0.0616],
        [-0.0099, -0.1658,  0.0772,  0.1764, -0.1565,  0.1971, -0.1884, -0.0053,
         -0.0439, -0.2075, -0.0552, -0.1146,  0.0960,  0.0615,  0.1276,  0.1339,
          0.1594,  0.0489, -0.0537,  0.1266],
        [-0.1260,  0.1175,  0.0539,  0.0704,  0.2168, -0.0022,  0.0238, -0.1811,
         -0.0576, -0.1759,  0.1492,  0.0113, -0.2094,  0.0811, -0.1800, -0.1997,
         -0.2092,  0.1179,  0.0448, -0.0090],
        [-0.0108,  0.0882, -0.0931,  0.1236,  0.1835,  0.1212, -0.0867,  0.0414,
          0.1074, -0.0537, -0.2161, -0.1565, -0.0789,  0.0626, -0.2151,  0.0107,
          0.0231, -0.0494, -0.1288,  0.0318],
        [-0.0073, -0.0250, -0.0337,  0.0270, -0.0224, -0.0710, -0.0525,  0.0471,
          0.1870, -0.0783, -0.1960,  0.1439,  0.0387,  0.1150,  0.1488, -0.2132,
          0.2117,  0.2188, -0.1466,  0.1129],
        [ 0.1500, -0.2196, -0.0411,  0.0596,  0.1498, -0.0782,  0.0316,  0.0469,
          0.1482,  0.0086,  0.1905,  0.0737, -0.0414,  0.1391,  0.0831,  0.0878,
          0.1094, -0.0879, -0.0683,  0.0146],
        [-0.1217, -0.0711,  0.1981, -0.0609, -0.0682, -0.0268,  0.0975, -0.1520,
          0.1033, -0.1857, -0.0104,  0.1837,  0.0239,  0.1645, -0.0965,  0.1928,
         -0.1422,  0.1281, -0.0553,  0.0251],
        [ 0.2196,  0.1968, -0.1546, -0.0326, -0.0577, -0.0565,  0.1677,  0.0694,
         -0.1829,  0.0753,  0.1771,  0.2086,  0.0657, -0.0087, -0.0402,  0.0753,
         -0.2175,  0.0311, -0.1636,  0.1801],
        [ 0.1866, -0.0396, -0.1593, -0.1505,  0.0371, -0.0736, -0.1235,  0.0091,
          0.0699,  0.1767, -0.1768, -0.2235, -0.0905, -0.2015, -0.1572, -0.0402,
          0.0975, -0.0163,  0.1133, -0.2161],
        [-0.2100, -0.2066,  0.1936, -0.1083,  0.2157, -0.0087,  0.0289,  0.0638,
         -0.1521, -0.1291, -0.1761, -0.1556, -0.0517, -0.1929, -0.0508, -0.1660,
          0.0814,  0.1509,  0.0808, -0.0826],
        [-0.0953, -0.1946,  0.1081,  0.1085,  0.0282,  0.0600, -0.1314,  0.0681,
         -0.1188,  0.1199,  0.0491,  0.1090,  0.1750,  0.0262,  0.0222, -0.0550,
         -0.0097, -0.1157,  0.0360, -0.2159],
        [ 0.0519,  0.1116, -0.1475,  0.1119,  0.0098,  0.1143,  0.1333, -0.0576,
          0.0921, -0.0159,  0.1246, -0.0136, -0.0995, -0.0612,  0.0085, -0.0890,
          0.1807,  0.0507, -0.1723, -0.1441]], requires_grad=True))
('4.bias', Parameter containing:
tensor([-0.1404,  0.1178, -0.1875, -0.2141,  0.1402, -0.0507,  0.1560, -0.1140,
         0.1034,  0.1515, -0.1774,  0.0623,  0.1056,  0.0796, -0.2172, -0.2099,
        -0.0319, -0.1213,  0.0187, -0.0350], requires_grad=True))
('6.weight', Parameter containing:
tensor([[ 1.9458e-01,  1.6212e-01,  1.8001e-01,  7.7188e-02,  1.1978e-01,
         -6.7532e-02, -1.5480e-01, -1.5313e-01, -1.1726e-02, -8.1356e-02,
         -2.1323e-01, -6.4835e-02, -4.5629e-02, -9.1450e-02,  1.1076e-01,
          2.0042e-01,  7.3945e-02, -2.4665e-02, -6.9011e-03, -1.7488e-01],
        [-1.8378e-01, -1.3897e-01, -1.0096e-01, -1.1754e-01,  1.8978e-01,
         -5.4548e-02,  1.1480e-01, -2.0865e-01,  1.3137e-01, -1.2123e-01,
          2.1111e-01,  5.5403e-02,  7.1559e-02,  5.3804e-02, -1.2043e-01,
          1.3836e-01, -1.3695e-01, -1.2076e-01,  1.1332e-02, -1.4696e-04],
        [ 1.7710e-01, -1.8359e-01, -1.5497e-01,  1.0901e-01, -1.2192e-01,
          1.2498e-01,  5.2726e-02,  1.5260e-01, -1.6826e-01,  1.8414e-01,
          1.2520e-01, -5.3535e-02, -1.6200e-01, -2.1672e-01, -1.6300e-01,
         -2.1963e-01,  1.3816e-01,  3.2575e-02, -1.0761e-01, -8.6389e-02],
        [-4.3751e-02,  9.6279e-02, -1.5546e-01, -4.1122e-02, -1.4351e-01,
          2.0826e-01, -1.1365e-01,  4.7091e-02,  1.1147e-01, -1.2786e-01,
          1.8056e-01,  1.6265e-01, -1.5222e-01,  2.0480e-01,  2.2271e-01,
          1.0582e-02, -1.8720e-01,  4.1901e-02,  1.7985e-01,  2.8824e-02],
        [-1.7268e-01, -4.3553e-02, -5.5200e-02,  3.6380e-02,  4.6141e-02,
         -6.8284e-02, -3.6277e-02,  7.1213e-02,  1.0275e-01, -5.2510e-02,
          1.0958e-01,  1.5286e-01,  1.4354e-01,  1.9409e-01, -1.5418e-01,
         -1.8291e-02, -2.1578e-01,  7.1980e-02, -2.0929e-01, -1.6309e-01],
        [ 2.9013e-02,  1.1915e-01,  1.9710e-01, -1.3014e-01,  2.8487e-02,
          1.4981e-01,  1.5059e-01, -1.3037e-01, -1.7240e-01,  1.6099e-01,
          1.1506e-01,  3.2530e-02,  5.6785e-02,  6.9113e-02,  1.5913e-01,
          2.2005e-01,  1.0288e-01, -1.7125e-01,  1.2819e-01, -1.1275e-01],
        [-1.5410e-02,  5.2581e-02,  8.6954e-02, -2.0033e-02,  1.7265e-01,
         -1.1073e-01, -2.1493e-01,  1.3047e-01, -1.5797e-01, -1.0342e-01,
         -1.3255e-03,  1.5458e-01, -4.7594e-02,  1.9549e-01,  2.1283e-01,
         -2.8149e-02,  5.7429e-02, -1.0258e-01, -3.5394e-02,  2.1455e-01],
        [-1.7193e-01, -1.6215e-01, -2.0907e-01, -7.6150e-02,  1.6068e-01,
          1.9066e-01, -1.8116e-02,  1.4956e-01, -3.2695e-02, -6.8226e-02,
          1.2429e-01,  3.4847e-02,  2.1047e-01, -1.0193e-01,  5.9673e-03,
         -1.8810e-01, -4.8087e-02,  5.6088e-02, -1.4251e-01, -1.7854e-02],
        [ 9.8739e-02,  2.1327e-01,  1.9525e-01, -1.3591e-01,  1.5784e-01,
          1.9260e-01, -6.0648e-02, -2.5052e-02,  6.2662e-02, -5.6414e-02,
          4.2745e-02, -2.0984e-01, -1.2775e-01, -2.1451e-01, -2.0323e-01,
         -1.1796e-01, -2.0200e-01,  5.9941e-02,  1.3689e-01, -7.3884e-02],
        [ 2.2968e-02, -3.0852e-02, -1.9894e-01,  3.9449e-02,  1.9767e-01,
          2.0786e-01, -3.9130e-02,  1.7646e-01, -1.1571e-02, -2.7780e-02,
         -2.1493e-01, -1.4038e-01, -1.8909e-01, -5.8631e-02,  2.0655e-01,
         -2.7861e-02, -6.5135e-02, -1.1892e-01,  1.7054e-01, -4.9492e-02],
        [-5.5913e-02, -2.0918e-01,  1.8868e-01, -6.3896e-02,  7.5822e-02,
         -1.7996e-01,  2.0925e-01,  1.2294e-01, -1.2617e-01, -7.0029e-02,
          3.0836e-02, -2.1165e-01,  1.5326e-01, -7.7593e-02, -3.4627e-02,
          1.3702e-02, -1.7875e-03, -4.9951e-02, -2.1732e-01,  9.3251e-02],
        [-4.1415e-02,  1.8287e-02,  2.2358e-01,  1.0872e-01, -1.0613e-01,
          6.2122e-02, -1.1759e-01, -1.9282e-01, -1.2181e-01,  1.1249e-01,
         -4.2172e-02,  1.3532e-01, -2.0249e-01,  1.4732e-01,  1.1687e-01,
          1.5369e-01, -1.9344e-01, -2.0407e-01, -1.6566e-01,  1.5498e-01],
        [-1.7594e-01, -2.1857e-01, -1.6664e-01, -4.2173e-02,  1.6677e-01,
         -2.4980e-02,  6.4235e-02,  2.7717e-02,  9.6544e-02,  9.0803e-02,
         -5.8438e-02,  1.8706e-02, -1.7470e-01, -1.5674e-01, -1.0294e-01,
          1.0279e-01,  9.5184e-02,  4.9550e-02,  4.9399e-02,  1.0260e-01],
        [ 2.1961e-01,  6.5255e-02, -2.6975e-02,  1.5680e-01,  7.3349e-02,
          3.9918e-02, -1.8323e-01, -1.9901e-01,  4.4924e-02,  7.1232e-02,
          8.2780e-02,  1.3082e-01,  1.8852e-01, -2.1367e-01, -1.7505e-01,
         -4.7380e-02,  3.0144e-02, -1.9956e-01, -3.2121e-03, -1.1083e-02],
        [-1.3952e-01, -1.3027e-01,  2.4351e-02, -1.0156e-01,  1.8306e-01,
         -1.7549e-01,  7.5288e-02,  7.6271e-02,  1.5502e-02, -1.1069e-01,
         -1.6502e-01, -7.0494e-02,  1.5498e-01,  7.7666e-02,  2.1282e-01,
          2.0049e-01,  2.4677e-02, -1.0549e-01,  3.8400e-02,  1.4383e-01],
        [ 2.0162e-01,  5.4577e-02,  1.4672e-01, -9.7167e-02,  1.0296e-01,
         -2.4817e-02, -5.2470e-03, -4.4625e-02, -8.7582e-03,  5.3408e-02,
         -1.9141e-01, -1.2162e-01,  1.3655e-01,  6.7714e-02,  9.6572e-02,
          1.4780e-01,  2.0834e-01, -2.2165e-01,  7.0438e-02,  1.7594e-01],
        [-7.8864e-02, -8.7782e-02, -4.6983e-02, -1.2590e-01,  9.4118e-02,
         -5.9679e-02,  1.9276e-01,  7.8526e-03, -1.8832e-02, -1.7497e-01,
          8.0131e-02, -6.3065e-02,  1.4086e-01, -4.3603e-02,  2.7592e-03,
          2.1384e-01,  1.8953e-01, -6.8577e-03,  4.8073e-02, -1.6674e-01],
        [ 1.8316e-01, -6.1505e-03, -2.2325e-01, -1.3045e-01, -3.0024e-02,
          2.0294e-02, -1.5978e-02,  1.0980e-01, -1.3704e-01,  8.2995e-03,
         -3.5691e-02,  9.7198e-02,  6.1036e-02, -8.4533e-03,  1.7771e-01,
          2.1413e-01,  1.4128e-01,  1.7764e-01, -1.8133e-01, -2.1170e-02],
        [-1.8128e-01, -1.3267e-03, -4.1591e-02, -1.8853e-01,  2.0676e-02,
         -8.6311e-02,  1.4568e-01,  1.9686e-01,  6.3841e-02, -7.3819e-02,
         -5.4492e-02,  8.9654e-02, -6.2565e-02,  4.7180e-02,  1.5577e-01,
         -1.4091e-01,  1.5149e-01,  2.3407e-02, -3.5366e-02,  8.7989e-02],
        [ 2.1238e-01, -3.8303e-02, -1.0352e-01, -4.5026e-02, -2.1707e-01,
         -4.7338e-02,  7.9185e-02,  1.1304e-01, -1.8227e-01, -1.4078e-01,
         -3.5528e-02, -1.2305e-01, -1.1540e-01, -7.8125e-02, -3.8025e-02,
          1.7090e-01, -1.9958e-01, -2.2091e-01, -1.3671e-01, -1.9045e-01]],
       requires_grad=True))
('6.bias', Parameter containing:
tensor([-0.0518,  0.1596,  0.0840,  0.1006,  0.0885,  0.0200, -0.0730, -0.2037,
        -0.0053,  0.0729,  0.0403, -0.2097, -0.1454,  0.0346,  0.0460, -0.1772,
        -0.2021,  0.1281, -0.1245,  0.0110], requires_grad=True))
('8.weight', Parameter containing:
tensor([[-0.0849, -0.0673,  0.0967,  0.1513,  0.1241,  0.1818, -0.1271,  0.1880,
          0.2182, -0.0922,  0.1430,  0.0977,  0.1773,  0.1296, -0.0709,  0.0493,
          0.1249,  0.0607,  0.0597,  0.0976]], requires_grad=True))
('8.bias', Parameter containing:
tensor([-0.1803], requires_grad=True))

And that’s why we love PyTorch. Because it does all the dirty work for us. Imagine having to keep track of all these parameters by hand.

For those who want to know what is happening inside nn.Linear, note that it is a special case of a PyTorch neural network module; see nn.Module. You would directly inherit the latter when writing your own class for a non-standard neural network. We will not cover it in this class, but you can find plenty of examples here.

Making a loss function#

Let’s now make the loss function that we want to minimize. It needs to be a PyTorch function as well. For regression problems, we can think of the loss as a function of the model predictions and the observed data. That depends on the parameters that come through the predictions. In this form, let’s write down the mean square error (MSE) loss. It is:

\[ L_{\text{MSE}}(\theta) = L_{\text{MSE}}(y_{1:n}, f(x_{1:n};\theta)) = \frac{1}{n}\sum_{i=1}^n\left[y_i-f(x_i;\theta)\right]^2, \]

where \(x_{1:n}\) are the observed inputs (features), \(y_{1:n}\) are the observed outputs (targets), and \(f(x_{1:n};\theta)\) contains the model predictions on the observed inputs.

You can implement the MSE loss like this:

mse_loss_ours = lambda y, f: torch.mean((y - f) ** 2)

Or we can use built-in PyTorch functionality:

mse_loss = nn.MSELoss()

Let’s evaluate it for some random data:

# The number of fake observations
n = 20
# Some fake observed features
x_fake = torch.rand(n, 1)
# Some fake observed targets
y_fake = 4 * x_fake ** 2 - 5 * x_fake ** 3 + 0.1 * torch.rand(n, 1)
fig, ax = plt.subplots(dpi=100)
ax.plot(x_fake, y_fake, 'x')
sns.despine(trim=True);
../_images/fec3579c7ec3a7171d5a4ea0eb310b6964f58f5ea45326c640889a343c697147.svg

Here is how to calculate the loss (for the random parameters that our net started with):

# Predict with the net:
y_pred = f(x_fake)
# Evaluate the loss
our_loss = mse_loss_ours(y_fake, y_pred)
built_in_loss = mse_loss(y_fake, y_pred)
print(our_loss)
print(built_in_loss)
tensor(0.1995, grad_fn=<MeanBackward0>)
tensor(0.1995, grad_fn=<MseLossBackward0>)

Let’s minimize the MSE loss for these fake data and see what fit we will get. Here is how you do this in PyTorch. Since I don’t have a lot of data, I will use gradient descent - no random subsampling of the data. I will show you how to use stochastic gradient descent in the following example.

# Reinitialize the net:
f = nn.Sequential(nn.Linear(1, 20),
                  nn.ReLU(),
                  nn.Linear(20, 20),
                  nn.ReLU(),
                  nn.Linear(20, 20),
                  nn.ReLU(),
                  nn.Linear(20, 20),
                  nn.ReLU(),
                  nn.Linear(20, 1))

# Initialize the optimizer - Notice that it needs to know about the 
# parameters it is optimizing
optimizer = torch.optim.SGD(f.parameters(), lr=0.01) # lr is the learning rate
# Some place to hold the training loss for visualizing it later
training_loss = []
# Iterate the optimizer. Let's just do 10 iterations.
for i in range(10000):
    # This is essential for the optimizer to keep
    # track of the gradients correctly
    # It is using some buffers internally that need to
    # be manually zeroed on each iteration.
    # This is because it doesn't know when you are done with the
    # calculation of the loss
    optimizer.zero_grad()
    # Make predictions
    y_pred = f(x_fake)
    # Evaluate the loss - That's what you are minimizing
    loss = mse_loss(y_fake, y_pred)
    # Evaluate the derivative of the loss with respect to
    # all parameters - It knows how to do it because of
    # PyTorch magick
    loss.backward()
    # And now you are ready to make a step
    optimizer.step()
    # Save the training loss of later visualization
    training_loss.append(loss.item())
    # Print the loss every one hundend iterations:
    if i % 1000 == 0:
        print('it = {0:d}: loss = {1:1.3f}'.format(i, loss.item()))
it = 0: loss = 0.172
it = 1000: loss = 0.120
it = 2000: loss = 0.068
it = 3000: loss = 0.017
it = 4000: loss = 0.008
it = 5000: loss = 0.006
it = 6000: loss = 0.004
it = 7000: loss = 0.003
it = 8000: loss = 0.002
it = 9000: loss = 0.002

Let’s plot the predictions of this model on the fake data:

fig, ax = plt.subplots(dpi=100)
ax.plot(x_fake, y_fake, 'x');
xx = torch.linspace(x_fake.min(), x_fake.max(), 100)[:, None]
yy = f(xx).detach().numpy()
ax.plot(xx, yy)
sns.despine(trim=True);
../_images/3ecc0a1c25194603b10cbf826d45b4983167088b6019d4a30259714268b767b4.svg

This may or may not work, depending on what random seed you start with. It may stack at some local minimum if you run it a few times. This algorithm is excellent if we do stochastic optimization, i.e., subsampling the data. Here is how the loss changes with each iteration:

fig, ax = plt.subplots(dpi=100)
ax.plot(training_loss)
ax.set_xlabel('Iteration')
ax.set_ylabel('Training loss')
sns.despine(trim=True);
../_images/f36757fd739ef3e6eda75b26ffc9a33629eb487ef90f0ab86dd6e0dffa72e729.svg

The problem is the plateau we have at the beginning of the optimization.

Let’s redo this thing with stochastic optimization. For stochastic optimization, we need to subsample the data during each iteration. We can either do this manually or use the PyTorch functionality. First, let’s do it manually.

# Pick a subsampling batch size
m = 5

# Reinitialize the net:
f = nn.Sequential(nn.Linear(1, 20),
                  nn.ReLU(),
                  nn.Linear(20, 20),
                  nn.ReLU(),
                  nn.Linear(20, 20),
                  nn.ReLU(),
                  nn.Linear(20, 20),
                  nn.ReLU(),
                  nn.Linear(20, 1))

# Reinitialize the optimizer
optimizer = torch.optim.SGD(f.parameters(), lr=0.01)
# Keep track of the training loss
training_loss_sgd = []
# Iterate the optimizer. Let's just do 10 iterations.
for i in range(10000):
    # Zero out the gradient buffers
    optimizer.zero_grad()
    # Sample m observation indices at random
    idx = np.random.randint(0, n, m)
    # Here is the subsample of the data
    x_batch = x_fake[idx]
    y_batch = y_fake[idx]
    # Make predictions
    y_pred = f(x_batch)
    # Evaluate the loss - That's what you are minimizing
    loss = mse_loss(y_batch, y_pred)
    # Evaluate the derivative of the loss with respect to
    # all parameters - It knows how to do it because of
    # PyTorch magick
    loss.backward()
    # And now you are ready to make a step
    optimizer.step()
    # Keep track of the training loss
    training_loss_sgd.append(loss.item())
    # Print the loss every one hundend iterations:
    if i % 1000 == 0:
        print('it = {0:d}: loss = {1:1.2e}'.format(i, loss.item()))
it = 0: loss = 3.62e-01
it = 1000: loss = 4.48e-02
it = 2000: loss = 4.90e-02
it = 3000: loss = 1.44e-02
it = 4000: loss = 4.04e-03
it = 5000: loss = 5.22e-03
it = 6000: loss = 7.22e-04
it = 7000: loss = 2.67e-03
it = 8000: loss = 2.64e-04
it = 9000: loss = 1.93e-04
fig, ax = plt.subplots(dpi=100)
ax.plot(x_fake, y_fake, 'x');
xx = torch.linspace(x_fake.min(), x_fake.max(), 100)[:, None]
yy = f(xx).detach().numpy()
ax.plot(xx, yy)
sns.despine(trim=True);
../_images/624455f9825eea162f8cc5c5ac39adb91158d115a12f22006588c5fdc190bd7f.svg

This fit does look a little bit better. Let’s now compare the training loss of stochastic gradient descent to the previous one:

fig, ax = plt.subplots(dpi=100)
ax.plot(training_loss_sgd, label='Gradient descent')
ax.plot(training_loss, label='Stochastic gradient descent')
ax.set_xlabel('Iteration')
ax.set_ylabel('Training loss')
plt.legend(loc='best', frameon=False)
sns.despine(trim=True);
../_images/af0d185caae6938250c42a93bbade26c1103860d84fbc30256d260180414dd58.svg

This wiggly nature of stochastic gradient descent allows it to escape bad local minima.

Questions#

  • Rerun the stochastic gradient descent with one sample per iteration (\(m=1\)). Does it converge? Do you need fewer or more iterations? Is it more or less wiggly?

  • Rerun the stochastic gradient descent with ten samples per iteration. How does it perform now?

Example - Motorcycle Data#

Let’s now use the motorcycle data to do regression with DNNs. This will help us demonstrate some best practices specifically:

  • Standardizing the data

  • Splitting in training and test subsets

First, start by loading the dataset:

url = "https://github.com/PredictiveScienceLab/data-analytics-se/raw/master/lecturebook/data/motor.dat"
download(url)
# Load the data
data = np.loadtxt('motor.dat')
x = data[:, 0][:, None]
y = data[:, 1][:, None]
# Split into training and test datasets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
# Visualize them
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(x_train, y_train, 'x', markeredgewidth=2, label='Training data')
ax.plot(x_test, y_test, 'o', markeredgewidth=2, label='Test data')
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
plt.legend(loc='best', frameon=False)
sns.despine(trim=True);
../_images/f74397a6f8539bd64df7c7246c2f7e760b3d2cd8decaf61f0dc0f37456f9a6b3.svg
# Turn the data into torch tensors:
x_train = torch.tensor(x_train, dtype=torch.float)
y_train = torch.tensor(y_train, dtype=torch.float)
x_test = torch.tensor(x_test, dtype=torch.float)
y_test = torch.tensor(y_test, dtype=torch.float)

Please note that the dtype=torch.float specification is absolutely needed here. You need to include it so that the code is going to work. The problem is that the x_train, etc., are all numpy arrays and that numpy arrays have 64-bit floating point numbers by default. PyTorch is using 32-bit floating point numbers by default. We need, at some point, to make the two compatible.

Now, we are ready to train the network. Let’s give it a shot. We will use the same architecture as before. The only difference is that I will print the validation loss instead of the training loss.

# The number of training samples
n = x_train.shape[0]

# Pick a subsampling batch size
m = 5

# Reinitialize the net:
f = nn.Sequential(nn.Linear(1, 20),
                  nn.ReLU(),
                  nn.Linear(20, 20),
                  nn.ReLU(),
                  nn.Linear(20, 20),
                  nn.ReLU(),
                  nn.Linear(20, 20),
                  nn.ReLU(),
                  nn.Linear(20, 1))

# Reinitialize the optimizer
optimizer = torch.optim.SGD(f.parameters(), lr=0.01)
# Keep track of the training loss and the test loss
training_loss = []
test_loss = []
# Iterate the optimizer. Let's just do 10 iterations.
for i in range(10000):
    # Zero out the gradient buffers
    optimizer.zero_grad()
    # Sample m observation indices at random
    idx = np.random.randint(0, n, m)
    # Here is the subsample of the data
    x_batch = x_train[idx]
    y_batch = y_train[idx]
    # Make predictions
    y_pred = f(x_batch)
    # Evaluate the loss - That's what you are minimizing
    loss = mse_loss(y_batch, y_pred)
    training_loss.append(loss.item())
    # Evaluate the derivative of the loss with respect to
    # all parameters - It knows how to do it because of
    # PyTorch magick
    loss.backward()
    # And now you are ready to make a step
    optimizer.step()
    # Evaluate the test loss
    y_pred_test = f(x_test)
    ts_loss = mse_loss(y_test, y_pred_test)
    test_loss.append(ts_loss.item())
    # Print the loss every one hundend iterations:
    if i % 1000 == 0:
        print('it = {0:d}: loss = {1:1.2e}'.format(i, ts_loss.item()))
it = 0: loss = 2.11e+03
it = 1000: loss = nan
it = 2000: loss = nan
it = 3000: loss = nan
it = 4000: loss = nan
it = 5000: loss = nan
it = 6000: loss = nan
it = 7000: loss = nan
it = 8000: loss = nan
it = 9000: loss = nan

The above code may not work at all, giving you nan’s. Or it may work and get you nowhere. The problem here is the scale of both the inputs and the outputs and the assumptions that have been made about them when we initialize the weights of the net and when we pick the learning rate of the optimization algorithm. The easiest way to overcome this problem is to standardize the data. This is achieved by subtracting the empirical mean and dividing the inputs and the outputs by the empirical standard deviation. By standardizing the data, we are making the default parameters (for weight initialization and stochastic gradient descent) valid again. Standardization is such a common process that it is already implemented in sklearn.preprocessing.StandardScaler. Here is how it works:

from sklearn.preprocessing import StandardScaler

feature_scaler = StandardScaler().fit(x)
target_scaler = StandardScaler().fit(y)

The feature_scaler.transform() is a function:

\[ x \rightarrow \frac{x-\mu}{\sigma}, \]

where \(\mu\) and \(\sigma\) are the features’ empirical mean and standard deviation. Here they are:

# The mean:
feature_scaler.mean_
array([25.9])
# The standard deviation:
feature_scaler.scale_
array([13.88042399])

And here is how the scalers work:

x_scaled = feature_scaler.transform(x)
y_scaled = target_scaler.transform(y)

The data are now scaled, see this fig:

fig, ax = plt.subplots(dpi=100)
ax.plot(x_scaled, y_scaled, 'x')
ax.set_xlabel('$x$ (scaled)')
ax.set_ylabel('$y$ (scaled)')
sns.despine(trim=True);
../_images/b1d1686093f3949e3c79aaf7fa6b5a1fdc96525a7f057e29c184c698cb3d81a5.svg

We will train the net using x_scale and y_scaled. We can always go back to the original scales at the end. Let’s see if it works.

Hide code cell source
# Split in training and test
x_s_train, x_s_test, y_s_train, y_s_test = train_test_split(x_scaled, y_scaled,
                                                            test_size=0.3)

# Turn the data into torch tensors:
x_s_train = torch.tensor(x_s_train, dtype=torch.float)
y_s_train = torch.tensor(y_s_train, dtype=torch.float)
x_s_test = torch.tensor(x_s_test, dtype=torch.float)
y_s_test = torch.tensor(y_s_test, dtype=torch.float)

# The number of training samples
n = x_train.shape[0]

# Pick a subsampling batch size
m = 5

# Reinitialize the net:
f = nn.Sequential(nn.Linear(1, 20),
                  nn.ReLU(),
                  nn.Linear(20, 20),
                  nn.ReLU(),
                  nn.Linear(20, 20),
                  nn.ReLU(),
                  nn.Linear(20, 20),
                  nn.ReLU(),
                  nn.Linear(20, 1))

# Reinitialize the optimizer
optimizer = torch.optim.SGD(f.parameters(), lr=0.01)
# Keep track of the training loss and the test loss
training_loss = []
test_loss = []
# Iterate the optimizer. Let's just do 10 iterations.
for i in range(10000):
    # Zero out the gradient buffers
    optimizer.zero_grad()
    # Sample m observation indices at random
    idx = np.random.randint(0, n, m)
    # Here is the subsample of the data
    x_batch = x_s_train[idx]
    y_batch = y_s_train[idx]
    # Make predictions
    y_pred = f(x_batch)
    # Evaluate the loss - That's what you are minimizing
    loss = mse_loss(y_batch, y_pred)
    training_loss.append(loss.item())
    # Evaluate the derivative of the loss with respect to
    # all parameters - It knows how to do it because of
    # PyTorch magick
    loss.backward()
    # And now you are ready to make a step
    optimizer.step()
    # Evaluate the test loss
    y_pred_test = f(x_s_test)
    ts_loss = mse_loss(y_s_test, y_pred_test)
    test_loss.append(ts_loss.item())
    # Print the loss every one hundend iterations:
    if i % 1000 == 0:
        print('it = {0:d}: loss = {1:1.2e}'.format(i, ts_loss.item()))
it = 0: loss = 1.03e+00
it = 1000: loss = 1.65e-01
it = 2000: loss = 2.17e-01
it = 3000: loss = 1.60e-01
it = 4000: loss = 1.94e-01
it = 5000: loss = 1.56e-01
it = 6000: loss = 1.69e-01
it = 7000: loss = 1.64e-01
it = 8000: loss = 2.15e-01
it = 9000: loss = 1.85e-01

Let’s visualize the fit:

xx_scaled = torch.linspace(x_scaled.min(), x_scaled.max(), 100)[:, None]
yy_scaled = f(xx_scaled).detach().numpy()
fig, ax = plt.subplots(dpi=100)
ax.plot(x_s_train, y_s_train, 'x', label='Training data')
ax.plot(x_s_test, y_s_test, 'o', label='Test data')
ax.plot(xx_scaled, yy_scaled, label='DNN fit')
ax.set_xlabel('$x$ (scaled)')
ax.set_ylabel('$y$ (scaled)')
plt.legend(loc='best', frameon=False)
sns.despine(trim=True);
../_images/f945082e3a49d85d8a715077efb94b96b9e51e16416c842bd78eef82067e574c.svg

Here is predictions-observations plot on the test data set:

y_pred_test = f(x_s_test).detach().numpy()
fig, ax = plt.subplots(dpi=100)
ax.plot(y_pred_test, y_s_test, '.')
yys = np.linspace(y_s_test.min(), y_s_test.max(), 10)
ax.plot(yys, yys, 'r')
ax.set_xlabel('Predictions')
ax.set_ylabel('Observations')
sns.despine(trim=True);
../_images/d2e4b213b416bb84d23a6762665aebd30d19ede645c85dcc8cdc6fbee5ac253e.svg

Also, if you wish, you can scale the predictions back to the original units:

xx = feature_scaler.inverse_transform(xx_scaled)
yy = target_scaler.inverse_transform(yy_scaled)
fig, ax = plt.subplots(dpi=100)
ax.plot(x_train, y_train, 'x', label='Training data')
ax.plot(x_test, y_test, 'o', label='Test data')
ax.plot(xx, yy, label='DNN fit')
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
plt.legend(loc='best', frameon=False)
sns.despine(trim=True);
../_images/a6c71132adbd2fa0cbbd246de9a087926d3cb2459261b02eed86bdaa454f8ebe.svg

It is instructive to observe how the training and test losses evolve as a function of the optimization iteration:

fig, ax = plt.subplots(dpi=100)
ax.plot(training_loss, label='Training loss')
ax.plot(test_loss, label='Test loss')
ax.set_xlabel('Iteration')
ax.set_ylabel('Loss')
plt.legend(loc='best', frameon=False)
sns.despine(trim=True);
../_images/96b7a67e37695f989c0e1475c67006756c87054c5a34c40a3ca78cb1db563b68.svg

The wiggliness is, of course, due to the stochastic nature of the optimization. The training error converges to a minimum as you keep iterating. This is a direct consequence of the Robbins-Monro theorem. You will reach a local minimum of the training error eventually. However, this is not the case for the test error. The test error will reach a minimum eventually and then increase! It will always do this when training networks by minimizing a loss function. What happens is that the algorithm will start overfitting the training data and will not be able to generalize correctly for the test data. There are ways around this. We will learn about the basic one in the next lecture (weight regularization and early stopping). There are some advanced ways to avoid overfitting (e.g., dropout, Bayesian neural networks), which we will not cover in the class.

Questions#

  • Change the activation function from nn.ReLU to nn.Tanh. Are you getting a better fit or a worse fit?

  • Rerun the code above for 100,000 iterations. Does it start to overfit the training data? What happens to the test loss?

  • Rerun the code for 5,000 iterations. What does the prediction look like now? Early stopping would stop at about this point.