# Stochastic Gradient Descent – Mini-batch and more

In the neural network tutorial, I introduced the gradient descent algorithm which is used to train the weights in an artificial neural network.  In reality, for deep learning and big data tasks standard gradient descent is not often used.  Rather, a variant of gradient descent called stochastic gradient descent and in particular its cousin mini-batch gradient descent is used.  That is the focus of this post.

The gradient descent optimisation algorithm aims to minimise some cost/loss function based on that function’s gradient.  Successive iterations are employed to progressively approach either a local or global minimum of the cost function.  The figure below shows an example of gradient descent operating in a single dimension:

When training weights in a neural network, normal batch gradient descent usually takes the mean squared error of all the training samples when it is updating the weights of the network:

$$W = W – \alpha \nabla J(W,b)$$

where $W$ are the weights, $\alpha$ is the learning rate and $\nabla$ is the gradient of the cost function $J(W,b)$ with respect to changes in the weights.  More details can be found in the neural networks tutorial, but in that tutorial the cost function $J$ was defined as:

\begin{align}
J(W,b) &= \frac{1}{m} \sum_{z=0}^m J(W, b, x^{(z)}, y^{(z)})
\end{align}

As can be observed, the overall cost function (and therefore the gradient) depends on the mean cost function calculated on all of the m training samples ($x^{(z)}$ and $y^{(z)}$ refer to each training sample pair).  Is this the best way of doing things?  Batch gradient descent is good because the training progress is nice and smooth – if you plot the average value of the cost function over the number of iterations / epochs it will look something like this:

As you can see, the line is mostly smooth and predictable.  However, a problem with batch gradient descent in neural networks is that for every gradient descent update in the weights, you have to cycle through every training sample.  For big data sets i.e. > 50,000 training samples, this can be time prohibitive.  Batch gradient descent also has the following disadvantages:

• It requires the loading of the whole dataset into memory, which can be problematic for big data sets
• Batch gradient descent can’t be efficiently parallelised (compared to the techniques about to be presented) – this is because each update in the weight parameters requires a mean calculation of the cost function over all the training samples.
• The smooth nature of the reducing cost function tends to ensure that the neural network training will get stuck in local minimums, which makes it less likely that a global minimum of the cost function will be found.

Stochastic gradient descent is an algorithm that attempts to address some of these issues.

Stochastic gradient descent updates the weight parameters after evaluation the cost function after each sample.  That is, rather than summing up the cost function results for all the sample then taking the mean, stochastic gradient descent (or SGD) updates the weights after every training sample is analysed.  Therefore, the updates look like this:

$$W = W – \alpha \nabla J(W,b, x^{(z)}, y^{(z)})$$

Notice that an update to the weights (and bias) is performed after every sample $z$ in $m$.  This is easily implemented by a minor variation of the batch gradient descent code in Python, by simply shifting the update component into the sample loop (the original train_nn function can be found in the neural networks tutorial and here):

def train_nn_SGD(nn_structure, X, y, iter_num=3000, alpha=0.25, lamb=0.000):
W, b = setup_and_init_weights(nn_structure)
cnt = 0
m = len(y)
avg_cost_func = []
print('Starting gradient descent for {} iterations'.format(iter_num))
while cnt < iter_num:
if cnt%50 == 0:
print('Iteration {} of {}'.format(cnt, iter_num))
tri_W, tri_b = init_tri_values(nn_structure)
avg_cost = 0
for i in range(len(y)):
delta = {}
# perform the feed forward pass and return the stored h and z values,
# to be used in the gradient descent step
h, z = feed_forward(X[i, :], W, b)
# loop from nl-1 to 1 backpropagating the errors
for l in range(len(nn_structure), 0, -1):
if l == len(nn_structure):
delta[l] = calculate_out_layer_delta(y[i,:], h[l], z[l])
avg_cost += np.linalg.norm((y[i,:]-h[l]))
else:
if l > 1:
delta[l] = calculate_hidden_delta(delta[l+1], W[l], z[l])
# triW^(l) = triW^(l) + delta^(l+1) * transpose(h^(l))
tri_W[l] = np.dot(delta[l+1][:,np.newaxis],
np.transpose(h[l][:,np.newaxis]))
# trib^(l) = trib^(l) + delta^(l+1)
tri_b[l] = delta[l+1]
# perform the gradient descent step for the weights in each layer
for l in range(len(nn_structure) - 1, 0, -1):
W[l] += -alpha * (tri_W[l] + lamb * W[l])
b[l] += -alpha * (tri_b[l])
# complete the average cost calculation
avg_cost = 1.0/m * avg_cost
avg_cost_func.append(avg_cost)
cnt += 1
return W, b, avg_cost_func

In the above function, to implement stochastic gradient descent, the following code was simply indented into the sample loop “for i in range(len(y)):” (and the averaging over m samples removed):

for l in range(len(nn_structure) - 1, 0, -1):
W[l] += -alpha * (tri_W[l] + lamb * W[l])
b[l] += -alpha * (tri_b[l])

In other words, a very easy transition from batch to stochastic gradient descent.  Where does the “stochastic” part come in?  The stochastic component is in the selection of the random selection of training sample.  However, if we use the scikit-learn test_train_split function the random selection has already occurred, so we can simply iterate through each training sample, which has a randomised order.

So how does SGD perform?  Let’s take a look.  The plot below shows the average cost versus the number of training epochs / iterations for batch gradient descent and SGD on the scikit-learn MNIST dataset.  Note that both of these are operating off the same optimised learning parameters (i.e. learning rate, regularisation parameter) which were determined according to the methods described in this post.

Some interesting things can be noted from the above figure.  First, SGD converges much more rapidly than batch gradient descent.  In fact, SGD converges on a minimum J after < 20 iterations.  Secondly, despite what the average cost function plot says, batch gradient descent after 1000 iterations outperforms SGD.  On the MNIST test set, the SGD run has an accuracy of 94% compared to a BGD accuracy of 96%.  Why is that?  Let’s zoom into the SGD run to have a closer look:

As you can see in the figure above, SGD is noisy.  That is because it responds to the effects of each and every sample, and the samples themselves will no doubt contain an element of noisiness.  While this can be a benefit in that it can act to “kick” the gradient descent out of local minimum values of the cost function, it can also hinder it settling down into a good minimum.  This is why, eventually, batch gradient descent has outperformed SGD after 1000 iterations.  It might be argued that this is a worthwhile pay-off, as the running time of SGD versus BGD is greatly reduced.  However, you might ask – is there a middle road, a trade-off?

There is, and it is called mini-batch gradient descent.

Mini-batch gradient descent is a trade-off between stochastic gradient descent and batch gradient descent.  In mini-batch gradient descent, the cost function (and therefore gradient) is averaged over a small number of samples, from around 10-500.  This is opposed to the SGD batch size of 1 sample, and the BGD size of all the training samples.  It looks like this:

$$W = W – \alpha \nabla J(W,b, x^{(z:z+bs)}, y^{(z:z+bs)})$$

Where $bs$ is the mini-batch size and the cost function is:

$$J(W,b, x^{(z:z+bs)}, y^{(z:z+bs)}) = \frac{1}{bs} \sum_{z=0}^{bs} J(W, b, x^{(z)}, y^{(z)})$$

What’s the benefit of doing it this way?  First, it smooths out some of the noise in SGD, but not all of it, thereby still allowing the “kick” out of local minimums of the cost function.  Second, the mini-batch size is still small, thereby keeping the performance benefits of SGD.

To create the mini-batches, we can use the following function:

from numpy import random
def get_mini_batches(X, y, batch_size):
random_idxs = random.choice(len(y), len(y), replace=False)
X_shuffled = X[random_idxs,:]
y_shuffled = y[random_idxs]
mini_batches = [(X_shuffled[i:i+batch_size,:], y_shuffled[i:i+batch_size]) for
i in range(0, len(y), batch_size)]
return mini_batches

Then our new neural network training algorithm looks like this:

def train_nn_MBGD(nn_structure, X, y, bs=100, iter_num=3000, alpha=0.25, lamb=0.000):
W, b = setup_and_init_weights(nn_structure)
cnt = 0
m = len(y)
avg_cost_func = []
print('Starting gradient descent for {} iterations'.format(iter_num))
while cnt < iter_num:
if cnt%1000 == 0:
print('Iteration {} of {}'.format(cnt, iter_num))
tri_W, tri_b = init_tri_values(nn_structure)
avg_cost = 0
mini_batches = get_mini_batches(X, y, bs)
for mb in mini_batches:
X_mb = mb[0]
y_mb = mb[1]
# pdb.set_trace()
for i in range(len(y_mb)):
delta = {}
# perform the feed forward pass and return the stored h and z values,
# to be used in the gradient descent step
h, z = feed_forward(X_mb[i, :], W, b)
# loop from nl-1 to 1 backpropagating the errors
for l in range(len(nn_structure), 0, -1):
if l == len(nn_structure):
delta[l] = calculate_out_layer_delta(y_mb[i,:], h[l], z[l])
avg_cost += np.linalg.norm((y_mb[i,:]-h[l]))
else:
if l > 1:
delta[l] = calculate_hidden_delta(delta[l+1], W[l], z[l])
# triW^(l) = triW^(l) + delta^(l+1) * transpose(h^(l))
tri_W[l] += np.dot(delta[l+1][:,np.newaxis],
np.transpose(h[l][:,np.newaxis]))
# trib^(l) = trib^(l) + delta^(l+1)
tri_b[l] += delta[l+1]
# perform the gradient descent step for the weights in each layer
for l in range(len(nn_structure) - 1, 0, -1):
W[l] += -alpha * (1.0/bs * tri_W[l] + lamb * W[l])
b[l] += -alpha * (1.0/bs * tri_b[l])
# complete the average cost calculation
avg_cost = 1.0/m * avg_cost
avg_cost_func.append(avg_cost)
cnt += 1
return W, b, avg_cost_func

Let’s see how it performs with a min-batch size of 100 samples:

As can be observed in the figure above, mini-batch gradient descent appears be the superior method of gradient descent to be used in neural networks training.  The jagged decline in the average cost function is evidence that mini-batch gradient descent is “kicking” the cost function out of local minimum values to reach better, perhaps even the best, minimum.  However, it is still able to find a good minimum and stick to it.  This is confirmed in the test data – the mini-batch method achieves an accuracy of 98% compared to the next best, batch gradient descent, which has an accuracy of 96%.  The great thing is – it gets to these levels of accuracy after only 150 iterations or so.

One final benefit of mini-batch gradient descent is that it can be performed in a distributed manner.  That is, each mini-batch can be computed in parallel by “workers” across multiple servers, CPUs and GPUs to achieve significant improvements in training speeds.  There are multiple algorithms and architectures to perform this parallel operation, but that is a topic for another day.  In the mean-time, enjoy trying out mini-batch gradient descent in your neural networks.