In the context of Deep Learning, the function we are trying to optimize is our loss function J. We define our training losses as the average of the losses for all our training dataset:
Where Dtrain is the number of samples in our training dataset.
Thus, we can implement gradient descent based on the following algorithm, in which we compute the gradient of the training losses with respect to the weights for a certain number of epochs to perform an update of the model weights [2]
w = [0, ... ,0] # Initialize the weights
for k = 1,..., num_epochs: # Repeat for the desired number of iterations
grad = ∇w TrainLoss(w) # Gradient of the Training losses w[k] <-- w[k-1] - η * grad # Update the model weights
The problem with gradient descent as shown in the pseudo-code above is its low speed.
For a toy example with a few points and a simple function it might work well, but imagine we are developing an ANN and we have a million data points to train it.
If we want to update the weights of the model using gradient descent, we would need to look at all samples of the training data, just to do one update of the model weights. Then repeat it, over and over again, until we reach convergence. One weight update per epoch.
To overcome this problem, we could use the so-called Stochastic Gradient Descent algorithm
Stochastic Gradient Descent
To overcome the slow convergence problem of “vanilla” gradient descent, we can perform an update of the model weights based on each sample of the training set. The advantage is that we don’t need to wait until we have looped through the whole set to perform just one update of the weights.
We can update the weights several times per epoch, since we use the Loss function for each individual sample, instead of the whole training losses.
This is what the Stochastic Gradient Descent Algorithm (SGD) looks like
w = [0, ... ,0] # Initialize the weights
for k = 1,..., num_epochs:
for (x, y) ∈ Dtrain:
grad = ∇w J(x,y,w)
w[k] <-- w[k-1] - η(k) * grad
Notice that the step size is a function of the training iterations. That’s because, for the algorithm to converge, η must decrease as the number of iterations progresses. In the lecture notes of MIT 6.036 [1], the following theorem is mentioned:
Theorem 4.1. If J is convex, and η(t) is a sequence satisfying
then SGD converges with probability one to the optimal θ.
People take different approaches to reduce the value of η as the training progresses, and this is often called “Annealing”:
- Changing the learning rate in proportion to the training epoch (for instance, η(t) = 1/t ), or setting it to a smaller value once a certain learning epoch has been reached. This method offers good results but is unrelated to the model performance. This is called “learning rate decay”, and is the industry standard to handle this problem.
- Multiplying the learning rate by the gradient of the loss function: This method is good because it’s adaptive to the problem, but requires careful scaling. This is incorporated into RMSprop and Adam variations of gradient descent.
- Multiplying the learning rate by the losses: The advantage is that this is also adaptive to the problem, but requires scaling too.
SGD may perform well after visiting only some of the data. This behavior can be useful for relatively large data sets, because we may reduce the amount of memory needed, and the total runtime compared to the “vanilla” implementation of gradient descent.
We could say that the “vanilla” implementation is slower because it needs to run through the whole “batch” of samples to perform a single update of the weights. SGD performs an update at every sample, but the quality of the updates is lower. We may have noisy data or a really complex function we are trying to fit with our ANN. It’s like using a batch of size Dtrain is slow, but accurate, and using a batch of size 1 is fast, but a bit less accurate.
There is a term in between, and it’s called “mini-batch” gradient descent.
If we split our data into smaller batches of equal size, we could do the “vanilla” gradient descent for each of those mini-batches.
Let’s say we divide the data into 100 smaller parts. We could go through our data in 100 steps. On each step, we look at the Training Losses just for the data contained in the current mini-batch, and improve our model parameters. We repeat this until we start the cycle again. Each cycle is known as an epoch, although I have used the term more loosely before just to refer to the number of iterations during optimization.
The advantage of this is that we update our model parameters on each mini-batch, rather than after we have looked at the whole data set. The approach I have referred to as “vanilla” gradient descent, is also known as batch gradient descent, where the batch size is the total number of samples in the training dataset. For mini-batch gradient descent, the mini-batches are usually powers of two: 32 samples, 64, 128, 256, and so on.
SGD would be an extreme case when the mini-batch size is reduced to a single example in the raining dataset.
The disadvantage of using mini-batch gradient in our optimization process is that we incorporate a level of variability. It is not guaranteed that every step will move us closer to the ideal parameter values, but the general direction is still towards the minimum.
This method is one of the industry standards because by finding the optimal batch size, we can choose a compromise between speed and accuracy when dealing with very large datasets.
Be the first to comment