Playing with TensorFlow. A quick literature review and example… | by Alexander Morton

A quick literature review and example MNIST fits

14 min read

Nov 26, 2020

I’ve wanted to learn more about neural nets and in particular TensorFlow for a while now. I recently had a bit more time to dedicate to it, so began to write about what I had learned and some of the basic examples I had ran through.

Even though this information has been covered before, I decided to post it since it could help other beginners like myself. I’ve focused on parts which I thought were essential; hopefully making the subject matter as clear as possible without losing substance.

The usual picture of a neural net as a bunch of nodes connect by lines can make the subject matter appear somewhat esoteric.

An example of a multiple layer fully connected neural network. Source: Cyberbotics, via wiki.

Then you realise that the picture you have in your head does not touch the surface of neural nets used in the wild; further adding to the mystery.

An example of a more elaborate neural network. Source: MingxianLin, via wiki.

This impression is compounded when you realise the subject is a mixture of computer science, mathematics, statistics and neurology; which are already difficult subjects in their own right. All this can seem overwhelming but when you break the concepts into parts it starts to make sense.

Objective and Approximation

To start breaking this down lets focus on two concepts which I thought cut through the noise:

A neural net, no matter the complexity, is nothing more than a very complex function.
We approximate this very complex function using multiple simple non linear functions in combination.

The objective of training a neural net is to find a very complex function which maps one set to another; the set and mapping completely depend on the problem you are trying to solve. Sets have a long history in mathematics which I could spend this whole article getting into. Rather than getting into that, you should see it as nothing more than mapping a series of numbers to another series of numbers. This might seem strange at first; however, it is not strange when you realise that everything can be represented by numbers: an image is nothing more than a series of numbers representing pixels; sentences can be represented by numbers, with each word being a number or a vector of numbers depending on how you want to display the information. All problems can be represented, in someway, by numbers.

We approximate this complex function by combining lots of non linear functions. The ability to approximate almost any function as a combination of non linear functions is known as the universal approximation theorem. For anyone familiar with Taylor or Fourier series this will make some sense; sometimes it is easier to break a function into parts with each part approximating the original function.

So we have the objective of finding a complex function which we will approximate using multiple non linear functions.

Activation Functions

These non linear functions are known as activation functions.

Structure of a single node. Source: Image by Author.

Above you can see three types of parameters

Inputs (X).
Weights (W).
Bias (B).

The inputs are from the data or from the previous layer. The weights and biases are what does the “learning”: the parameters we will change to produce our complex function which will solve our problem.

There are many different functions which can take these weights and biases as input.

Some activation functions. Source: Image by Author.

There are advantages and disadvantages to each function but before understanding that we need to understand how we will learn the correct weights and biases.

Loss Function

The term “learning” is a really abstract term which completely misses the point of what you are doing. When you “learn” you are setting up all the weights and biases with the correct values to map inputs to as close as possible to the known outputs. The accuracy between the calculated outputs and the known outputs, which we know in supervised learning, is measured using a loss function.

The best known loss function is quadratic loss which is used in least squares fitting. With this loss it is clear you are trying to get your fitting function as close to the data points as possible. However, more elaborate functions might be more difficult to visualise.

Example of two parameters and the loss for each value. Source: Image by Author.

Above you can see an example of a loss function for two parameters. This image is incredible useful; however, you need need to keep a few things in mind when your models get more complex.

Increase the number of dimensions for each weight and bias parameter.
Understand that the node types (activation functions) and how you connect them change the shape of the loss landscape.
The shape of the landscape is not known to you.

The last point is important. The diagram above makes you think that you could easily find the smallest value of the loss function; however, you need to find this iteratively.

Keeping this point in mind, we then need find a method for descending the unknown loss landscape. This method follows a set procedure.

Initialise weights.
Propagate inputs through weights to get value of output.
Find value of loss function.
Back propagate the weights trying to minimise the loss.
Repeat.

The first three parts are quite obvious to follow. The fourth part, back propagation, is the part you need to go through in a bit more detail. To get a feel for how this is done there is a good tutorial which I followed and then surmised below.

Back Propagation

Back propagation is using input data to iteratively update the weights. The process starts from the output layer and works backwards. To demonstrate this we will use an example activation and loss function.

First lets define a few terms

We are using a quadratic loss function and sigmoid function as the activation; this would change depending on the problem.

Now we have defined our terms we can write down the update function which will change the weights iteratively:

The first line is the update equation. The next line is a breakdown of the the partial differential. Source: Image by Author.

The first term of the update equation is the change of weights due to change in loss and the second is the momentum term. The only term we need to calculate is the change of weights due to change in loss. The momentum is simply the change due to the last iteration; this term is added to avoid local minimums.

Now we have the update rule, and have broken it into parts using chain rule, we need to work out the functions for each part. One of the terms is the partial derivative of the activation, the sigmoid function, relative to the inputs.

Differentiation by substitution is used to determine how the activation changes relative to its inputs. Source: Image by Author.

Other terms, which are easier to determine, are the change in inputs relative to weights and the previous layer’s activations:

Finally, we put all these partial derivative together and determine how the loss changes relative to the weights.

Now we need to find how the loss changes with activation.

The first equation is how the gradient of the quadratic loss changes with the data and activation for the output layer. The second equation specifies how the loss change for each layer using the result from the layer above. Source: Image by Author.

Using all of this we can move back, layer after layer, determining how the weights should change. Let’s now use what we have learned to understand some typical examples.

Using the concepts above we can begin our example fits starting with a fully connected network. Each of the examples uses the same docker image to create the required environment to run TensorFlow. I start this container with my code mounted from my local machine and allow TensorBoard to run from port 6006.

docker run -p 6006:6006 -v `pwd`:/mnt/ml-mnist-examples -it tensorflow/tensorflow  bash

Now we have the environment we need to consider the dataset we will be playing with. The dataset I used to learn was MNIST: 60,000 small square 28×28 pixel grayscale images of handwritten single digits between 0 and 9.

The code I used was the same as the tutorial which you can find online.