Improve your neural networks – Part 1 [TIPS AND TRICKS]

In the last post, I presented a comprehensive tutorial of how to build and understand neural networks.  At the end of that tutorial, we developed a network to classify digits in the MNIST dataset.  However, the accuracy was well below the state-of-the-art results on the dataset.  This post will show some techniques on how to improve the accuracy of your neural networks, again using the scikit learn MNIST dataset.

When we are thinking about “improving” the performance of a neural network, we are generally referring to two things:

  1. Improve the accuracy
  2. Speed up the training process (while still maintaining the accuracy)

(1) and (2) can play off against each other.  This is because, in order to improve the accuracy, we often need to train our network with more data and more iterations.  This slows down the training however, and makes it more expensive.  The “tips and tricks” in this post will address both of these issues.  All code will be in Python.

1 Regularisation and over-fitting in neural networks

1.1 Over-fitting

As was presented in the neural networks tutorial, we always split our available data into at least a training and a test set.  We do this because we want the neural network to generalise well.  This means that we want our network to perform well on data that it hasn’t “seen” before during training.  If we just throw all the data we have at the network during training, we will have no idea if it has over-fitted on the training data.  If it has, then it will perform badly on new data that it hasn’t been trained on.  We need another data set, the test set, to check and make sure our network is generalising well.

What happens when a machine learning model over-fits during training?  A simple way of thinking about it is that it becomes over-complicated given the data it has to train on.  This is often best illustrated using a linear regression example, see the image below from Wikipedia:

Overfitted Data

By Ghiles (Own work) [CC BY-SA 4.0], via Wikimedia Commons

The black dots are the training data.  As can be observed, the data-points pretty obviously follow a linear trend with increasing x, but there is a bit of noise in the relationship i.e. not all the data points sit on the black linear trend line.  The blue line is a high order polynomial model, which has perfectly fit the training data.  However, intuitively we would say that the black linear line is the better model.  Why is that?  It is because the blue model is unjustifiably complicated given the data – it is so complicated that it has “chased after” the random noise of the data.  We can see that small changes in the horizontal axis can lead to huge changes in the vertical axis.  This is not a good model and will therefore perform badly on any test set that it is tried on.

Over-fitting is something we also have to be wary of in neural networks.  A good way of avoiding this is to use something called regularisation. 

1.2 Regularisation

Regularisation involves making sure that the weights in our neural network do not grow too large during the training process.  During training, our neural networks will converge on local minimum values of the cost function.  There will be many of these local minima, and many of them will have roughly the same cost function – in other words, there are many ways to skin the cat.  Some of these local minimum values will have large weights connecting the nodes and layers, others will have smaller values.  We want to force our neural network to pick weights which are smaller rather than larger.

This makes our network less complex – but why is that?  Consider the previous section, where we discussed that an over-fitted model has large changes in predictions compared to small changes in input.  In other words, if we have a little bit of noise in our data, an over-fitted model will react strongly to that noise.  The analogous situation in neural networks is when we have large weights – such a network is more likely to react strongly to noise.  This is because large weights will amplify small variations in the input which could be solely due to noise.  Therefore, we want to adjust the cost function to try to make the training drive the magnitude of the weights down, while still producing good predictions.

The old cost function was (see the neural networks tutorial for an explanation of the notation used):

$$J(w,b) = \frac{1}{m} \sum_{z=0}^m \frac{1}{2} \parallel y^z – h^{(n_l)}(x^z) \parallel ^2$$

In this cost function, we are trying to minimize the mean squared error (MSE) of the prediction compared to the training data.  Now we want to vary the cost function to:

$$J(w,b) = \frac{1}{m} \sum_{z=0}^m \frac{1}{2} \parallel y^z – h^{(n_l)}(x^z) \parallel ^2 + \frac {\lambda}{2}\sum_{all} \left(W_{ij}^{(l)}\right)^2$$

Notice the addition of the last term, which is a summation of all the weight values in each layer, multiplied by the $\lambda$ constant divided by 2 (the division by 2 is a little trick to clean things up when we take the derivative).  This $\lambda$ value is usually quite small.  This shows that any increase in the weights must be balanced by an associated decrease in the mean squared error term in the cost function.  In other words, large weights will be penalised in this new cost function if they don’t do much to improve the MSE.

To incorporate this new component into the training of our neural network, we need to take the partial derivative.  Working this through gives a new gradient descent step equation.  The old equation:

$$W^{(l)} = W^{(l)} – \alpha \left[\frac{1}{m} \Delta W^{(l)} \right]$$

is now replaced by the new:

$$W^{(l)} = W^{(l)} – \alpha \left[\frac{1}{m} \Delta W^{(l)} + \lambda W^{(l)} \right]$$

1.3 Regularisation in the training code

The gradient descent weight updating line in the code of the neural network tutorial can simply be updated to the following, to incorporate regularisation into the Python code:

W[l] += -alpha * (1.0/m * tri_W[l] + lamb * W[l])

Where “lamb” is the regularisation parameter, which must be chosen.  If you recall from the tutorial, without regularisation the prediction accuracy on the scikit learn sample MNIST data set was only 86%.  This was with a learning rate ($\alpha$) of 0.25 and 3,000 training iterations.  Using the same parameters, and a regularisation parameter ($\lambda$) equal to 0.001, we now get a prediction accuracy of 95%!  That is a 9% increase in prediction accuracy by altering a single line of code and adding a new parameter.

Therefore, it is safe to say that in our previous example without regularisation we were over-fitting the data, despite the mean squared error of both versions being practically the same after 3,000 iterations.

2. Selecting the right parameters

In our training code for neural networks, we have a number of free parameters.  These parameters are the learning rate $\alpha$, the number and size of the hidden layers and now the regularisation parameter $\lambda$.  We also have to make a choice about what activation function to use.  All of these selections will affect the performance of the neural network, and therefore must be selected carefully.  How do we do this?  Usually by some sort of brute force search method, where we vary the parameters and try to land on those parameters which give us the best predictive performance.

Do we still use the test set to determine the predictive accuracy by which we tune our parameters?  Is it really a test set in that case?  Aren’t we then using all our data to make the network better, rather than leaving some aside to ensure we aren’t over-fitting?  Yes, we are.  We need to introduce a new set of the training data called the validation set.

To create a validation set, we can use the scikit learn function called train_test_split.  Usually, we want to keep the majority of data for training, say 60%.  The remaining data we can split into a test set and a validation set.  The code below shows how to do this:

from sklearn.model_selection import train_test_split
X_train, X_holdover, y_train, y_holdover = train_test_split(X, y, test_size=0.4)
X_valid, X_test, y_valid, y_test = train_test_split(X_holdover, y_holdover, test_size=0.5)

Now we have training, validation and test data sets, and we’re ready to perform parameter selections.

2.1 Brute-force search example

Often model parameter selection is performed using the brute-force search method.  This method involves cycling through likely values for the parameters in different combinations and assessing some measure of accuracy / fitness for each combination on the validation set.  We then select the best set of parameter values and see how they go on the test set.  The brute-force search method is easy to implement but can take a long time to run, given the combinatorial explosion of scenarios to test when there are many parameters.

In the example below, we will be using the brute-force search method to find the best parameters for a three-layer neural network to classify the scikit learn MNIST dataset.  Note: this data set isn’t the “real” MNIST dataset that is used often as a benchmark (it’s a cut down version), but it is good enough for our purposes.  This example will be using some of the same functions as in the neural network tutorial.  The parameters that we are going to test are:

  1. The number of hidden layers
  2. The learning rate
  3. The regularisation constant

Let’s first setup some lists for the parameter cycling:

hidden_size = [10, 25, 50, 60]
alpha = [0.05, 0.1, 0.25, 0.5]
lamb = [0.0001, 0.0005, 0.001, 0.01]

It is now a simple matter of cycling through each parameter combination, training the neural network, and assessing the accuracy.  The code below shows how this can be done, assessing the accuracy of the trained neural network after 3,000 iterations.  Careful, this will take a while to run:

from sklearn.metrics import accuracy_score
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
digits = load_digits()
X_scale = StandardScaler()
X = X_scale.fit_transform(
y =

# setup the parameter selection function
def select_parameters(hidden_size, alpha, lamb, X, y):
    X_train, X_holdover, y_train, y_holdover = train_test_split(X, y, test_size=0.4)
    X_valid, X_test, y_valid, y_test = train_test_split(X_holdover, y_holdover, 
    # convert the targets (scalars) to vectors
    yv_train = convert_y_to_vect(y_train)
    yv_valid = convert_y_to_vect(y_valid)
    results = np.zeros((len(hidden_size)*len(alpha)*len(lamb), 4))
    cnt = 0
    for hs in hidden_size:
        for al in alpha:
            for l in lamb:
                nn_structure = [64, hs, 10]
                W, b, avg_cost = train_nn(nn_structure, X_train, yv_train, 
                                    iter_num=3000, alpha=al, lamb=l)
                y_pred = predict_y(W, b, X_valid, 3)
                accuracy = accuracy_score(y_valid, y_pred) * 100
                print("Accuracy is {}% for {}, {}, {}".format(accuracy, hs, al, l))
                # store the data
                results[cnt, 0] = accuracy
                results[cnt, 1] = hs
                results[cnt, 2] = al
                results[cnt, 3] = l
                cnt += 1
    # get the index of the best accuracy
    best_idx = np.argmax(results[:, 0])
    return results, results[best_idx, :]
select_parameters(hidden_size, alpha, lamb, X, y)

Note that the above code uses functions developed in the neural networks tutorial.  These functions can be found on this site’s GitHub repository.   After running this code, we find that the best accuracy (98.6%) is achieved on the validation set with 50 hidden layers, a learning rate of 0.5 and a regularisation parameter of 0.001.

Using these parameters on the test set now gives us an accuracy of 96%.  Therefore, using the brute-force search method and a validation set, along with regularisation, improved our original naïve results in the neural networks tutorial from 86% to 96%!  A big improvement, clearly worth the extra time taken to improve our model.

In the next part of this series we’ll look at ways of speeding up the training.

534 thoughts on “Improve your neural networks – Part 1 [TIPS AND TRICKS]”

  1. Thanks for the fantastic tutorial series on deep learning. It is the best on the web.
    Just a small comment. I think hs should be called as number of nodes in the hidden layers and not as number of hidden layers. I got confused initially.

  2. Interesting blog! Is your theme custom made or did you download it from somewhere? A design like yours with a few simple tweeks would really make my blog shine. Please let me know where you got your design. Bless you

  3. Together with everything that appears to be building throughout this specific subject matter, your points of view are actually relatively stimulating. On the other hand, I appologize, but I do not give credence to your whole strategy, all be it exhilarating none the less. It looks to everyone that your comments are not totally justified and in reality you are yourself not really totally convinced of the argument. In any case I did enjoy looking at it.

  4. I really love your site.. Great colors & theme. Did you build this website yourself? Please reply back as I’m trying to create my own website and would like to know where you got this from or just what the theme is called. Kudos!

  5. I look after a vape shop web directory and we have had a posting from a vape store in the United States that additionally offers CBD products. A Calendar month later, PayPal has contacted use to say that our account has been restricted and have asked us to remove PayPal as a payment solution from our vape shop directory. We do not sell CBD products such as CBD oil. We solely offer online marketing professional services to CBD companies. I have visited Holland & Barrett– the UK’s Major Wellness Store and if you take a good peek, you will witness that they promote a pretty wide-ranging variety of CBD products, particularly CBD oil and they also happen to accept PayPal as a payment solution. It emerges that PayPal is employing contradictory standards to many different firms. Due to this restriction, I can no longer take PayPal on my CBD-related web site. This has restrained my payment choices and now, I am intensely dependent on Cryptocurrency payments and straightforward bank transfers. I have spoken with a solicitor from a Magic Circle law practice in London and they said that what PayPal is undertaking is altogether unlawful and discriminatory as it ought to be applying an uniform standard to all firms. I am still to talk with an additional legal representative from a US law office in The city of london to see what PayPal’s legal position is in the United States. In the meantime, I would be very appreciative if anyone here at targetdomain could offer me with alternative payment processors/merchants that work with CBD firms.

  6. The advantages of an online dating service personals service

    Using an online dating site service to find appointments used to be frowned along or perhaps seen fearsome. on the flip side, instances have impacted and so have the amount in people who not necessarily approve of employing an online dating services products even so the amount of people the fact is using an! especially, the invention of the online dating service personals satisfaction is now clients to the shops and dating ever again. If you are searching for a date, yet still are a bit more distrustful around using an online dating services function, Consider the benefits that follow an international dating business [url=]asian friend finder review[/url] may provide.

    uniform dating satisfaction learn 1 privacy

    you’ll be able publish images to your online dating services active service when you wish, But it is not required. you can aquire more tendencies after people of your internet dating agency, nevertheless, as soon as you provide a graphic. unfortunately, you’ll never be required to reveal all your address, electronic mail nicely number holiday to a human being of the internet dating customer service unless you intend to do so. as such, you possibly can surfing the disposable songs on the online dating service provider completely incognito.

    online dating center aid 2 ways

    you can an incredible degree american singles involving an online dating services products and services. because of this, you’re able to review just about all possible personals for your leisure combined with assess the content given that. recovery process you fair concerning an individual’s desires and demands and whether you smacked rid of it or not necessarily. as a result of browsing this information you’re completely killing the game compared with what simply meeting someone on the road or perhaps in a rod. With the international dating service plans do you know what an individual is about a person decide to previously client all of them with.

    internet dating online system learn 3 obtain

    Your online dating site service provider does not ever release your info expected the protection related to keeping the pc user may perhaps be troubling you. in the instance that for instance a particular international dating platform end user supplies you with terrifying e e-mails or allows you to feel awkward you can just complain of those and indulge in the balance terminated and at the very least keep off make contact with you. a great safeguard is unavailable in the real world, sadly,on the other hand, But it is through your uniform dating solution.

  7. Thanks for your publication. I also think laptop computers have gotten more and more popular right now, and now are often the only form of computer utilized in a household. The reason being at the same time actually becoming more and more economical, their computing power keeps growing to the point where these are as potent as desktop coming from just a few years back.

  8. After I initially commented I clicked the -Notify me when new feedback are added- checkbox and now every time a comment is added I get 4 emails with the identical comment. Is there any method you possibly can remove me from that service? Thanks!