Neural Network from Scratch Pt. 2

Published on June 5, 2024 at 11:14 PM

What's up! This is part 2 of my series on creating a neural network from scratch in Python. In this post, I'll formalize how neural networks are used and trained mathematically, and then I'll use the derivations to write my own neural net. Let's dive in!

Pulling the curtain

I explained in my last post what a neural network is, what it looks like, and how data is fed through it. I explained the math behind an individual perceptron, but not the full network, so lets start from there.

To reiterate, the equation for a perceptron is the following:

y = f(w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b)

Each $w_n$ represents an incoming weight bringing data from a preceding input $x_n$ . Each layer can be represented by a column vector:

X_1 = \left(\begin{array}{cc} x_1\\x_2\end{array}\right), X_2 = \left(\begin{array}{cc} x_3\\x_4\\x_5\end{array}\right), X_3 = \left(\begin{array}{cc} x_6\\x_7\end{array}\right)

First off, its important to note that the equations above and below are referring to the network in the main image of this blog post (just if you wanted to know why I chose such arbitrary layer sizes). Next up, we'll create the weight matrices. To make things easier, lets name each weight based on the two nodes it connects. For example, the weight feeding data from $x_1$ to $x_3$ will be called $w_{13}$ . We need two weight matrices - one between the first and second layer, and one between the second and third. For now, let's focus on the first, $W_1$ . This matrix transforms a 2-dimensional column vector into a 3-dimensional one, so the dimensions will be 3 by 2. Each row of the matrix will contain all the weights from each previous layer node to a single output layer node. The same will apply to $W_2$ , except it will be 2 x 3 because it is transforming a 3-dimensional layer into a 2-dimensional one.

W_1 = \left(\begin{array}{cc} w_{13} & w_{23}\\w_{14} & w_{24}\\w_{15} & w_{25}\end{array}\right), W_2 = \left(\begin{array}{cc} w_{36} & w_{46} & w_{56}\\w_{37} & w_{47} & w_{57}\end{array}\right)

Since the biases are added to the result of the product of $W_n$ and $X_n$ , the dimensions are the same as the layer being outputted.

B_1 = \left(\begin{array}{cc} b_3\\b_4\\b_5\end{array}\right), B_2 = \left(\begin{array}{cc} b_6\\b_7\end{array}\right)

Finally, we take the vector of weighted sums and feed it into an activation function, which essentially squishes or truncates the output of a layer into a desired range. A popular activation function is the sigmoid function, $\sigma(x)$ .

\sigma(x) = \frac{1}{1 + e^{-x}}

Sigmoid curve Notice how high values become close to 1 wile low values are flattened to 0.

Now, put together, this is the vectorized equation for computing a layer in the network:

X_i = \sigma(W_i X_{i-1} + B_i)

I've computed the forward pass from the first layer to the second layer below:

\left(\begin{array}{cc} x_3\\x_4\\x_5\end{array}\right) = \sigma(\left(\begin{array}{cc} w_{13} & w_{23}\\w_{14} & w_{24}\\w_{15} & w_{25}\end{array}\right)\left(\begin{array}{cc} x_1\\x_2\end{array}\right) + \left(\begin{array}{cc} b_3\\b_4\\b_5\end{array}\right))

\left(\begin{array}{cc} x_3\\x_4\\x_5\end{array}\right) = \sigma(\left(\begin{array}{cc} w_{13} x_1 + w_{23} x_2\\w_{14} x_1 + w_{24} x_2\\w_{15} x_1 + w_{25} x_2\end{array}\right) + \left(\begin{array}{cc} b_3\\b_4\\b_5\end{array}\right))

\left(\begin{array}{cc} x_3\\x_4\\x_5\end{array}\right) = \sigma(\left(\begin{array}{cc} w_{13} x_1 + w_{23} x_2 + b_3\\w_{14} x_1 + w_{24} x_2 + b_4\\w_{15} x_1 + w_{25} x_2 + b_5\end{array}\right)

If you look closely, you'll notice that each row of the output matrix is just the scalar formula we computed earlier.

NOTE: One important activation function I didn't discuss above is the softmax function. I've discussed before how in classification tasks, where our network has to assign a data point to a particular class (like picture of a cat = "cat"), its helpful to output a probability distribution, where the peak of the distribution is what the network thinks a data point most likely is. This is achieved with the softmax activation function:

\sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}

$z_i$ is the output of the previous layer before an activation function is applied. Each element in the output is exponentiated, and then divided by the sum of all the exponentiated elements. This ensures that all elements sum to 1, creating a probability distribution where the highest elements have the highest probabilities. Example:

\sigma(\left(\begin{array}{cc} 5\\2\\2\end{array}\right)) = \left(\begin{array}{cc} .909\\.045\\.045\end{array}\right)

The full decimals have been truncated for clarity, but the outputs roughly sum to 1. Please note that the softmax function, if used, is always used in the output layer of a network (the last layer).

Interlude

The next section will discuss gradient descent, which is how our network will learn the optimal weights and biases for its task. However, I think it'd be helpful to step back and understand why neural nets are the way they are. What is the purpose of weights? Biases? Activation functions?

Weights can be thought of us as small dials that selectively emphasize certain features in data. Biases allow us to shift outputs to better fit to the data. If we think about curve fitting in regression, we may have the correct function shape for a set of data, but it may be offset from the actual data points. The bias shifts the function output so it produces a better fit to the data. And finally, activation functions introduce nonlinearity into the network, allowing it to make more complex decisions.

Gradient Descent & Backpropagation

The process I outlined above is called forward propagation, and its how a network "reads" and processes information. If you've heard anything about machine learning models, you'll know that usually they're "trained" on the data they read. This is done through backpropagation, or a backward pass through the network.

First, we need an assessment of how "close" the network's guess was to the true answer. This is done using a cost function. With a softmax output, a popular cost function is the crossentropy loss (https://en.wikipedia.org/wiki/Cross-entropy).

L = -\sum_{x} p(x) \log q(x)

p(x) is the true distribution, or the data's label, while q(x) is the predicted distribution outputted by our network. This function assesses how different two probability distributions are, with higher scores indicating a higher difference. As an example, lets once again use the 3-layer network shown at the top of this blog post. Let's say this network is reading in two binary features about an animal and then ouputting whether the animal is a cat or a dog. An output of $\left(\begin{array}{cc} 1\\0\end{array}\right)$ represents a cat, while $\left(\begin{array}{cc} 0\\1\end{array}\right)$ is a dog. If our model reads in data for a cat and outputs $\left(\begin{array}{cc} 0.8\\0.2\end{array}\right)$ , then our loss would be:

-(1 \cdot log(0.8) + 0 \cdot log(0.2)) = 0.223

Now that we have a measure of our network's performance, we need to somehow use this measure to improve its parameters. We accomplish this using an algorithm called gradient descent. Thinking back to calculus, we know that the gradient of a function is the vector-valued derivative with respect to each of its variables. However, an interesting observation about the gradient is that it points in the direction of steepest ascent in a function. The vector field graph below is the gradient for the function $f(x,y) = x^2 + y^2$ . Notice how every arrow points outward in the direction the paraboloid increases the fastest.

Using this information, if we have a function that takes in weights and biases as parameters and outputs a value determining how poorly our network is performing, then finding the negative gradient of the loss with respect to each parameter should give us the direction to "shift" the parameter such that it decreases the loss. This is why the algorithm is called gradient descent - we are continously applying the negative gradient to every weight and bias to improve the model's performance.

However, there's one small problem. The cost function isn't just a function of 2 or 3 parameters - It's a function of usually thousands of weights and biases, each nested in a different layer of the network. To compute the gradients efficiently, we will utilize the chain rule by multiplying the jacobian matrix with respect to each layer and parameter as we travel back through the network.

Now, let's work out the math. Our loss function takes in a vector (the output layer), and outputs a scalar. Thus, we can find the gradient $\frac{\nabla_{L}}{\nabla_{X_3}}$ using

\frac{\nabla_{L}}{\nabla_{X_3}} = \left(\begin{array}{cc} \frac{\partial L}{\partial x_6} & \frac{\partial L}{\partial x_7}\end{array}\right)

The derivative of the crossentropy function with respect to an input $q(x)$ is $-\frac{p(x)}{q(x)}$ . So, if our label vector is $\left(\begin{array}{cc} y_1\\y_2\end{array}\right)$ , our expression becomes:

\frac{\nabla_{L}}{\nabla_{X_3}} = \left(\begin{array}{cc} -\frac{y_1}{x_6} & -\frac{y_2}{x_7}\end{array}\right)

Next, lets find the derivative of the softmax output, $X_3$ , with respect to its input, $Z_3$ . When $i = j$ , the derivative is $x_i(1 - x_i)$ , and when $i \neq j$ , the derivative is $-x_i x_j$ (https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1).

J_s = \left(\begin{array}{cc} \frac{\partial x_6}{\partial z_6} & \frac{\partial x_6}{\partial z_7}\\\frac{\partial x_7}{\partial z_6} & \frac{\partial x_7}{\partial z_7}\end{array}\right) = \left(\begin{array}{cc} x_6(1 - x_6) & -x_6 x_7\\-x_7 x_6 & x_7(1 - x_7)\end{array}\right)

To apply the chain rule, we matrix multiply the 1 by 2 gradient by the 2 by 2 jacobian and get the following product:

\left(\begin{array}{cc} \frac{\partial L}{\partial x_6} & \frac{\partial L}{\partial x_7}\end{array}\right)\left(\begin{array}{cc} \frac{\partial x_6}{\partial z_6} & \frac{\partial x_6}{\partial z_7}\\\frac{\partial x_7}{\partial z_6} & \frac{\partial x_7}{\partial z_7}\end{array}\right) = \left(\begin{array}{cc} \frac{\partial L}{\partial x_6}\frac{\partial x_6}{\partial z_6} + \frac{\partial L}{\partial x_7}\frac{\partial x_7}{\partial z_6} & \frac{\partial L}{\partial x_6}\frac{\partial x_6}{\partial z_7} + \frac{\partial L}{\partial x_7}\frac{\partial x_7}{\partial z_7}\end{array}\right)

For simplicty, I'll reduce this matrix down to $\left(\begin{array}{cc} \frac{\partial L}{\partial z_6} & \frac{\partial L}{\partial z_7}\end{array}\right)$ . Remember that $z_6$ and $z_7$ come from $W_2 X_2 + B_2$ (before the activation function is applied). Now that we have the derivative of the loss with respect to these values, we can finish the chain and find the derivative with respect to the weights.

Since the weight matrix is 2D, we can't necessarily find a "jacobian" for the weights. Instead, let's unwrap the chain rule for each weight and find a matrix multiplication that gives us the correct gradient. For starters, we know that the gradient must match the dimensions of the weight matrix (2 by 3)

\frac{\partial L}{\partial W_2} = \left(\begin{array}{cc} \frac{\partial L}{\partial w_{36}} & \frac{\partial L}{\partial w_{46}} & \frac{\partial L}{\partial w_{56}} \\ \frac{\partial L}{\partial w_{37}} & \frac{\partial L}{\partial w_{47}} & \frac{\partial L}{\partial w_{57}}\end{array}\right) = \left(\begin{array}{cc}\frac{\partial L}{\partial z_6}\frac{\partial z_6}{\partial w_{36}} & \frac{\partial L}{\partial z_6}\frac{\partial z_6}{\partial w_{46}} & \frac{\partial L}{\partial z_6}\frac{\partial z_6}{\partial w_{56}}\\\frac{\partial L}{\partial z_7}\frac{\partial z_7}{\partial w_{37}} & \frac{\partial L}{\partial z_7}\frac{\partial z_7}{\partial w_{47}} & \frac{\partial L}{\partial z_7}\frac{\partial z_7}{\partial w_{57}}\end{array}\right)

We already computed the partials of the loss with respect to the $z$ values above. Additionally, the derivative of the $z$ 's with respect to their weights is just the node being weighted from the previous layer.

\frac{\partial L}{\partial W_2} = \left(\begin{array}{cc} x_3\frac{\partial L}{\partial z_6} & x_4\frac{\partial L}{\partial z_6} & x_5\frac{\partial L}{\partial z_6}\\x_3\frac{\partial L}{\partial z_7} & x_4\frac{\partial L}{\partial z_7} & x_5\frac{\partial L}{\partial z_7}\end{array}\right)

Perfect! This expression is just a matrix multiplication between the transpose of our intermediate derivatives and the transpose of the previous layer.

\frac{\partial L}{\partial W_2} = \left(\begin{array}{cc} \frac{\partial L}{\partial z_6} & \frac{\partial L}{\partial z_7}\end{array}\right)^{\top}\left(\begin{array}{cc} x_3\\x_4\\x_5\end{array}\right)^{\top}

And this right here is the gradient update for $W_2$ . To apply it, we subtract this matrix multiplied by a very small constant from $W_2$ . This constant is called a learning rate, and its purpose is to keep the optimization process controlled so we don't overshoot the local maximum of the cost function.

Now you may be asking, "what about the bias?" Well, if we look at the formula for computing a layer, we see that the bias is a constant matrix that is summed with the product of $W$ and $X$ . Therefore, the derivative of the output with respect to the bias is just 1, and we can just use $\left(\begin{array}{cc} \frac{\partial L}{\partial z_6} & \frac{\partial L}{\partial z_7}\end{array}\right)^{\top}$ as our bias update. Neat!

This process is repeated for every other layer simply by extending the chain rule. To move to layer $X_2$ for example, we multiply $\frac{\partial L}{\partial Z}\frac{\partial Z}{\partial X_2}$ , where $\frac{\partial Z}{\partial X_2}$ is just $W_2$ . Each subsequent gradient depends on the gradients previously calculated. This is why the process is called backpropagation. We are propagating the gradients backward.

Coding a neural network

Now that we've discussed the necessary math behind constructing and training a neural network, let's code one! I'll be using Python due to its fast computation and dataset libraries like Numpy and Keras. To start, let's code up our loss and activation functions

import numpy as np

# activations
def sigmoid(x):
  return 1 / (1 + np.exp(-x))

def softmax(y):
  y = np.exp(y - np.max(y))
  return y / np.sum(y)

# losses
# -Σp(x)log(q(x)), p(x) = true distribution, q(x) = predicted distribution
# derivative = -p(x) / q(x)
def crossentropy(y_true, y_pred):
  y_pred = np.log(y_pred)
  loss = -np.sum(y_true * y_pred)
  return loss

# derivatives
def sigmoid_deriv(sig):
  return sig * (1 - sig)

def softmax_deriv(soft):
  return np.diagflat(soft) - (soft @ soft.T)

def crossentropy_deriv(y_true, y_pred):
  deriv = (-y_true / y_pred).T
  return deriv

Next, let's create a class for our neural network and a constructor that allows us to control the number of layers and nodes per layer.

class NN:
  """
  Initialize network with desired input size and hidden/output layers
  """
  def __init__(self, layers):
    # network structure
    self.layers = [np.zeros((layer, 1)) for layer in layers]

    # trainable parameters
    self.weights = [np.random.uniform(-1, 1, size=(layer, layers[i-1])) for i, layer in enumerate(layers[1:], 1)]
    self.biases = [np.random.uniform(-1, 1, size=(layer, 1)) for layer in layers[1:]]

    # accumulated gradients for weights and biases
    self.weight_grads = [np.zeros(weight.shape) for weight in self.weights]
    self.bias_grads = [np.zeros(bias.shape) for bias in self.biases]

The layers start out as column vectors of 0s, and the weights and biases are initialized from a uniform random distribution from -1 to 1. Now lets write a function to forward propagate data

  """
  Forward propagate input, x, through network
  """
  def forward(self, x):
    # reshape input and feed into input layer
    x = x.reshape(self.layers[0].shape)
    self.layers[0] = x

    # forward prop thru hidden layers
    for i in range(1, len(self.layers)-1):
      self.layers[i] = sigmoid(self.weights[i-1] @ self.layers[i-1] + self.biases[i-1])

    # deal with output layer
    self.layers[-1] = softmax(self.weights[-1] @ self.layers[-2] + self.biases[-1])

    # return output layer
    return self.layers[-1]

No surprises here. Notice how every hidden layer uses the sigmoid function, while only the output uses softmax. There are more activation functions out there I can use, but for simplicty, I've hard coded the two I want to use. Now let's write our backpropagation function.

  """
  Backward propagate using input, x, and label, y
  """
  def backward(self, x, y):
    # get the initial part of the backward prop through the loss function
    loss_deriv = crossentropy_deriv(y, self.layers[-1])

    # multiply loss deriv by derivative of output layer y wrt z
    soft_deriv = softmax_deriv(self.layers[-1])
    current_deriv = loss_deriv @ soft_deriv

    # now we iteratively get the gradient wrt the weights/biases
    for i in range(len(self.weights)-1, -1, -1):
      # transpose accumulated backprop gradient
      current_deriv_transpose = current_deriv.T

      # get gradient for weights and biases
      weight_grad = current_deriv_transpose @ self.layers[i].T
      bias_grad = current_deriv_transpose

      # add gradients to accumulated gradients
      self.weight_grads[i] += weight_grad
      self.bias_grads[i] += bias_grad

      current_deriv = current_deriv @ self.weights[i]

If you read through it, you'll find the exact formulas I computed above written in code. Notice how I'm accumulating a sum of the gradients rather than applying them straight away. This is standard practice, as applying the gradients after seeing just one training example can lead to unstable training. Now, to tie it all together, lets create a function for training the network on a full set of data.

  """
  Full training function to optimize network on labeled dataset
  """
  def train(self, full_x, full_y, epochs, batch_size, lr=0.01):
    for epoch in range(epochs):
      # compute loss
      total_loss = 0
      for i, x in enumerate(full_x):
        self.forward(x)
        single_loss = crossentropy(one_hot_encode(full_y[i], self.layers[-1].shape), self.layers[-1])
        total_loss += single_loss
      print("Epoch {}, current loss: {}".format(epoch + 1, total_loss))

      for i, x in enumerate(full_x):
        # extract appropriate label
        y = one_hot_encode(full_y[i], self.layers[-1].shape)

        # forward propagate data
        self.forward(x)

        # backpropagate data to get grads
        self.backward(x, y)

        # if this batch is completed, apply the gradients and reset the accumulators
        if (i + 1) % batch_size == 0:
          for k, grad in enumerate(self.weight_grads):
            # apply grads
            self.weights[k] -= lr * grad
            self.biases[k] -= lr * self.bias_grads[k]

            # reset accumulators
            self.weight_grads[k] -= grad
            self.bias_grads[k] -= self.bias_grads[k]

    # compute final loss
    total_loss = 0
    for i, x in enumerate(full_x):
      self.forward(x)
      single_loss = crossentropy(one_hot_encode(full_y[i], self.layers[-1].shape), self.layers[-1])
      total_loss += single_loss
    print("Final loss: {}".format(total_loss))

A few things to note here. One, I calculate the total loss over all the data during each iteration to assess whether the network is truly learning. Next, I uses a function called "one_hot_encode" to produce a label vector. This is because in many datasets, the labels are just scalar numbers representing the index in the output vector that is the correct class. I simply take this index and place a 1 in that place in the label array, while keeping the rest of the values as 0s. Finally, you'll notice I have a variable called "batch_size." Remember above how I mentioned that I accumulate gradients rather than immediately apply them to promote stable learning. The batch size represents how many data points the network looks through before applying its gradients. Larger batch sizes lead to more stable learning, but smaller batch sizes lead to faster learning.

Now that we've written the code for our network, let's get to training. For this project, I'll be training my network on the MNIST dataset, which is a dataset of over 60,000 handwritten digits from 0-9.

# some libraries for importing and displaying data
import tensorflow as tf
import matplotlib.pyplot as plt

# load in the mnist dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# normalize the data to make things easier for our network
x_train = x_train / 255
x_test = x_test / 255

This dataset has both a training set and testing set. The pictures are black-and-white images, where the brightness of a pixel is decided by a number from 0-255, with 255 being the brightest. To prevent numerical overflows in the network, I normalize the data to numbers between 0 and 1 by dividing by 255. An image from the training set is shown below: Next, lets instantiate our network and start training. The images are 28x28 pixels, so the input layer will have 784 nodes. There are 10 classes (0-9), so the output layer will have 10 nodes. I'll arbitrarily choose to have 2 hidden layers, one with 64 nodes and the other with 32. I'll set my learning rate to 0.001 and my batch size to 200.

nn = NN(layers=[784, 64, 32, 10])
nn.train(x_train, y_train, 4, 200, lr=0.001)

This is the console output we get from training:

Epoch 1, current loss: 289370.97397839575
Epoch 2, current loss: 30748.213928195135
Epoch 3, current loss: 27350.06052039953
<ipython-input-2-ce5a75947270>:3: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-x))
Epoch 4, current loss: 26453.471591585298
Final loss: 26031.953678489383

As you can see, the loss went down significantly in just the first epoch due to batched training. Even afterward, it dropped lower and lower with every epoch. Now, lets see how the network evaluates the example image shown above:

[[9.91598970e-01]
 [4.34214118e-06]
 [9.61998034e-04]
 [6.54792590e-04]
 [3.76296853e-06]
 [5.28934581e-03]
 [1.13763488e-03]
 [2.53214196e-04]
 [2.54139168e-05]
 [7.05255797e-05]]

This is the output layer after feeding in the image of the 0. As you can see, the 0th index contains a probability of 0.992! The network is almost 100% confident that its looking at a 0, and it's correct.

Looking at just one image is great, but I'd like to see how well it performs over the full training and testing set. I do this with the code below:

# test on training data
correct = 0
for i, x in enumerate(x_train):
  output = nn.forward(x)
  correct += np.argmax(output) == y_train[i]

print("Score: {} / {} = {}%".format(correct, len(x_train), (correct / len(x_train)) * 100 ))

# test on testing data
correct = 0
for i, x in enumerate(x_test):
  output = nn.forward(x)
  correct += np.argmax(output) == y_test[i]

print("Score: {} / {} = {}%".format(correct, len(x_test), (correct / len(x_test)) * 100 ))

It outputs the following results:

Score: 52384 / 60000 = 87.30666666666667%
Score: 8746 / 10000 = 87.46000000000001%

An 87% on both testing sets. B+. Not bad.

While you or I could probably glance at every image and correctly identify them 100% of the time, its still amazing that a simple algorithm is able to read and interpret an image with almost 90% accuracy. It's important to remember that this was all done just with an extension of the chain rule from calculus.

Where do we go from here?

The neural network I implemented above is obviously a very minimal implementation. With AIs like Google Translate and ChatGPT, its obvious machine learning has come very, very far. Even for basic feedforward neural networks, there's a boatload of optimizations I can implement, like different activation functions, optimized weight initialization, the Adam optimizer for training, Dropout regularization, and so much more. However, to avoid reinventing the wheel, I'll end this project here and continue making more complicated models using well-established libraries like Tensorflow, which are heavily optimized and extensively used in ML industry/research today. This project was a great review of some of the core math concepts used in deep learning, and I can't wait to work on bigger projects.

All the code written can be found here: https://colab.research.google.com/drive/1MkeEGOgXQmySVpPUmIVRV_gaqnTP7RP5?usp=sharing

Until next time!🚀