Weight initialization tutorial in TensorFlow

Weight initialization - First pass distribution of inputs to the first layer

In the late 80’s and 90’s, neural network research stalled due to a lack of good performance. There were a number of reasons for this, outlined by the prominent AI researcher Geoffrey Hinton – these reasons included poor computing speeds, lack of data, using the wrong type of non-linear activation functions and poor initialization of the weights in neural networks. My post on the vanishing gradient problem and ReLUs addresses the problem of the wrong kind of non-linear activation functions, and this post will deal with proper weight initialization. In particular, in this post we’ll be examining the problem with a naive normal distribution when initializing weights, and examine Xavier and He initialization as a remedy to this problem. This will be empirically studied using TensorFlow and some associated TensorBoard visualizations. Note: to run the code in this tutorial, you’ll need TensorFlow 1.8 or greater installed.

Eager to learn more? Get the book here

The problem with a naive initialization of weights

The random initialization of weights is critical to learning good mappings from input to output in neural networks. Because the search space involving many weights during training is very large, there are multiple local minimums within which the back-propagation may be trapped. Effective randomization of weights ensures that the search space is adequately explored during training, resulting in the best chances of a good minimum being found during back-propagation (for more on back-propagation, see my neural networks tutorial). However, the weight initialization randomization function needs to be carefully chosen and specified otherwise there is a large risk that the training progress will be slowed to the point of impracticality.

This is especially the case when using the historical “squashing” non-linear activation functions such as the sigmoid function and the tanh function, though it is still an issue with ReLU function, as will be seen later. The reason for this problem is that, if the weights are such that the activation functions of nodes are pushed into the “flat” regions of their curves, they are “saturated” and impede learning. Consider the plot below showing the tanh function and its first derivative:

Tanh function - weight initialization TensorFlow

Tanh function and its first derivative

It can observed that when abs(x) > 2, the derivative of the tanh function approaches zero. Now because the back-propagation method of updating the weight values in a neural network depends on the derivative of the activation functions, this means that when nodes are pushed into such “saturation” regions, slow or no learning will take place. Therefore, we don’t want to start with weight values that push some or all of the nodes into those saturation regions, as that network just won’t work very well. The sigmoid function operates similarly, as can be observed in my vanishing gradient post.

A naive initialization of weights might be to simply use a normal distribution of mean zero and unit standard deviation (i.e. 1.0). Let’s consider how this might play out using a bit of simple statistical theory. Recall that the input to a neuron in the first layer of a neural network looks like:

$$in = X_1 W_1 + X_2 W_2 + X_3 W_3+ X_4 W_4 + \dots$$

The input, in other words, is a summation of the respective weights and their inputs. The variance (the square of the standard deviation) of each element in this sum can be explained by the product of independent variables law:

$$Var(X_i W_i) = [E(X_i)]^2 Var(W_i) + [E(W_i)]^2 Var(X_i) + Var(X_i)Var(W_i)$$

If we assume that the input has been appropriately scaled with a mean of 0 and a unit variance, and likewise we initialize the weights for a mean 0 and unit variance, then this results in:

$$Var(X_i W_i) = 0 \times 1 + 0 \times 1 + 1 \times 1 = 1$$

So each product within the total sum of in has a variance of 1. What is the total variance of the node input variable in? We can make the assumption that each product (i.e. each $X_i W_i$) is statistically independent (not quite correct for things like images, but close enough for our purposes) and then apply the sum of uncorrelated independent variables law:

$$Var(in) = \sum_{i=0}^n  Var(X_i W_i) = n \times 1 = n$$

Where n is the number of inputs. So here, we can observe that if there are, say, 784 inputs (equal to the input size of the MNIST problem), the variance will be large and the standard deviation will be $\sqrt{Var(in)} = \sqrt{784} = 28$. This will result in the vast majority of neurons in the first layer being saturated, as most values will be >> |2| (i.e. the saturation regions of the functions).

Clearly this is not ideal, and so another way of initializing our weight variables is desirable.

Xavier or variance scaling for weight initialization

The Xavier method of weight initialization is a big improvement on the naive way of weight scaling shown in the section above. This method has helped accelerate the field of deep learning in a big way. It takes into account the problems shown above and bases the standard deviation or the variance of the weight initialization on the number of variables involved. It thereby adjusts itself based on the number of weight values. It works on the idea that if you can keep the variance constant from layer to layer in both the feed forward direction and back-propagation direction, your network will learn optimally. This makes sense, as if the variance increases or decreases as you go through the layers, your weights will eventually saturate your non-linear neurons in either the positive or negative direction.

So, how do we use this idea to work out what variance should be used to best initialize the weights? First, because the network will be learning effectively when it is operating in the linear regions of the tanh and sigmoid functions, the activation function can be approximated by a linear activation, i.e.:

$$ Y = W_{1} X_{1} + W_{2} X_{2} + W_{3} X_{3} + \dots $$

Therefore, with this linear activation function, we can use the same result that was arrived at above using the product of independent variables and sum of uncorrelated independent variables, namely:

$$ Var(Y) = n_{in} Var(W_i)Var(X_i)$$

Where $n_{in}$ is the number of inputs to each node. If we want the variance of the input ($Var(X_i)$) to be equal to the variance of the output ($Var(Y)$) this reduces to:

$$ Var(W_i) = \frac{1}{n_{in}} $$

Which is a preliminary result for a good initialization variance for the weights in your network. However, this is really just keeping the variance constant during the forward pass. What about trying to keep the variance constant also during back-propagation? It turns out that during back-propagation, to try to do this you need:

$$ n_{i+1} Var(W_i) = 1 $$


$$ Var(W_i) = \frac{1}{n_{out}} $$

Now there are two different ways of calculating the variance, one depending on the value of the number of inputs and the other on the number of outputs. The authors of the original paper on Xavier initialization take the average of the two:

$$ n_{avg} = \frac{n_{in}  + n_{out}}{2} $$

$$ Var(W_i) = \frac{1}{n_{avg}} = \frac {2}{n_{in} + n_{out}} $$

That is the final result in the Xavier initialization of weights for squashing activation functions i.e. tanh and sigmoid. However, it turns out this isn’t quite as optimal for ReLU functions.

ReLU activations and the He initialization

Consider the ReLU function – for all values less than zero, the output of the activation function is also zero. For values greater than zero, the ReLU function simply returns it’s input. In other words, half of the output is linear, like the assumption made in the analysis above – so that’s easy. However, for the other half of the inputs, for input values < 0, the output is zero. If we assume that the inputs to the ReLU neurons are approximately centered about 0, then, roughly speaking, half the variance will be in line with the Xavier initialization result, and the other half will be 0.

This is basically equivalent to halving the number of input nodes. So if we return to our Xavier calculations, but with half the number of input nodes, we have:

$$ Var(Y) = \frac{n_{in}}{2} Var(W_i)Var(X_i) $$

Again, if we want the variance of the input ($Var(X_i)$) to be equal to the variance of the output ($Var(Y)$) this reduces to:

$$ Var(W_i) = \frac{2}{n_{in}} $$

This is He initialization, and this initialization has been found to generally work better with ReLU activation functions.

Now that we’ve reviewed the theory, let’s get to the code.

Weight initialization in TensorFlow

This section will show you how to initialize weights easily in TensorFlow. The full code can be found on this site’s Github page. Performing Xavier and He initialization in TensorFlow is now really straight-forward using the tf.contrib.layers.variance_scaling_initializer. By adjusting the available parameters, we can create either Xavier, He or other types of modern weight initializations. In this TensorFlow example, I’ll be creating a simple MNIST classifier using TensorFlow’s packaged MNIST dataset, with a simple three layer fully connected neural network architecture. I’ll also be logging various quantities so that we can visualize the variance, activations and so on in TensorBoard.

First, we define a Model class to hold the neural network model:

class Model(object):
    def __init__(self, input_size, label_size, initialization, activation, num_layers=3,
        self._input_size = input_size
        self._label_size = label_size
        self._init = initialization
        self._activation = activation
        # num layers does not include the input layer
        self._num_layers = num_layers
        self._hidden_size = hidden_size

The above code is the class initialization function – notice that various initialization and activation functions can be passed to the model. Later on, we’ll cycle through different weight initialization and activation functions and see how they perform.

In the next section, I define the model creation function inside the Model class:

    def _model_def(self):
        # create placeholder variables
        self.input_images = tf.placeholder(tf.float32, shape=[None, self._input_size])
        self.labels = tf.placeholder(tf.float32, shape=[None, self._label_size])
        # create self._num_layers dense layers as the model
        input = self.input_images
        tf.summary.scalar("input_var", self._calculate_variance(input))
        for i in range(self._num_layers - 1):
            input = tf.layers.dense(input, self._hidden_size, kernel_initializer=self._init,
                                    activation=self._activation, name='layer{}'.format(i+1))
            # get the input to the nodes (sans bias)
            mat_mul_in = tf.get_default_graph().get_tensor_by_name("layer{}/MatMul:0".format(i + 1))
            # log pre and post activation function histograms
            tf.summary.histogram("mat_mul_hist_{}".format(i + 1), mat_mul_in)
            tf.summary.histogram("fc_out_{}".format(i + 1), input)
            # also log the variance of mat mul
            tf.summary.scalar("mat_mul_var_{}".format(i + 1), self._calculate_variance(mat_mul_in))
        # don't supply an activation for the final layer - the loss definition will
        # supply softmax activation. This defaults to a linear activation i.e. f(x) = x
        logits = tf.layers.dense(input, 10, name='layer{}'.format(self._num_layers))
        mat_mul_in = tf.get_default_graph().get_tensor_by_name("layer{}/MatMul:0".format(self._num_layers))
        tf.summary.histogram("mat_mul_hist_{}".format(self._num_layers), mat_mul_in)
        tf.summary.histogram("fc_out_{}".format(self._num_layers), input)
        # use softmax cross entropy with logits - no need to apply softmax activation to
        # logits
        self.loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits,
        # add the loss to the summary
        tf.summary.scalar('loss', self.loss)
        self.optimizer = tf.train.AdamOptimizer().minimize(self.loss)
        self.accuracy = self._compute_accuracy(logits, self.labels)
        tf.summary.scalar('acc', self.accuracy)
        self.merged = tf.summary.merge_all()
        self.init_op = tf.global_variables_initializer()

I’ll step through the major points in this function. First, there is the usual placeholders to hold the training input and output data – if you’re unfamiliar with the basics of TensorFlow, check out my introductory tutorial here. Then, a scalar variable is logged called “input_var” which logs the variance of the input images, calculated via the _calculate_variance function – this will be presented later (if TensorFlow logging and visualization is unfamiliar to you, check out my TensorFlow visualization tutorial). The next step involves a loop through the layers, and here I have used the TensorFlow layers API which allows us to create densely connected layers easily. Notice that the kernel_initializer argument is what will initialize the weights of the layer, and activation is the activation function which the layer neurons will use.

Next, I access the values of the matrix multiplication between the weights and inputs for each layer, and log the values. This way we can observe what the values of the inputs to each neuron is, and the variance of these inputs. We log these values as histograms. Finally, within the layer loop, the variance of the matrix multiplication input is also logged as a scalar.

The remainder of this model construction function is all the standard TensorFlow operations which define the loss, the optimizer and variable initialization, and also some additional logging of variables. The next function to take notice of within the Model class is the _calculate_variance function – it looks like:

    def _calculate_variance(self, x):
        mean = tf.reduce_mean(x)
        sqr = tf.square(x - mean)
        return tf.reduce_mean(sqr)

The function above is just a simple calculation of the variance of x.

The main code block creates a list of various scenarios to run through, each with a different folder name in which to store the results, a different weight initialization function and finally a different activation function to supply to the neurons. The main training / analysis loop first runs a single batch of data through the network to examine initial variances. Thereafter it performs a full training run of the network so that performance indicators can be analysed.

if __name__ == "__main__":
    sub_folders = ['first_pass_normal', 'first_pass_variance',
                   'full_train_normal', 'full_train_variance',
                   'full_train_normal_relu', 'full_train_variance_relu',
    initializers = [tf.random_normal_initializer,
                    tf.contrib.layers.variance_scaling_initializer(factor=1.0, mode='FAN_AVG', uniform=False),
                    tf.contrib.layers.variance_scaling_initializer(factor=1.0, mode='FAN_AVG', uniform=False),
                    tf.contrib.layers.variance_scaling_initializer(factor=1.0, mode='FAN_AVG', uniform=False),
                    tf.contrib.layers.variance_scaling_initializer(factor=2.0, mode='FAN_IN', uniform=False)]
    activations = [tf.sigmoid, tf.sigmoid, tf.sigmoid, tf.sigmoid, tf.nn.relu, tf.nn.relu, tf.nn.relu]
    assert len(sub_folders) == len(initializers) == len(activations)
    for i in range(len(sub_folders)):
        model = Model(784, 10, initializers[i], activations[i])
        if "first_pass" in sub_folders[i]:
            init_pass_through(model, sub_folders[i])
            train_model(model, sub_folders[i], 30, 1000)

The most important thing to consider in the code above is the Xavier and He weight initialization definitions. The function used to create these is the tf.contrib.layers.variance_scaling_initializer which allows us to create weight initializers which are based on the number of input and output connections in order to execute the Xavier and He initialization discussed previously.

The three arguments used in this function are:

  • The factor argument, which is a multiplicative factor that is applied to the scaling. This is 1.0 for Xavier weight initialization, and 2.0 for He weight initialization
  • The mode argument: this defines which is on the denominator of the variance calculation. If ‘FAN_IN’, the variance scaling is based solely on the number of inputs to the node. If ‘FAN_OUT’ it is based solely on the number of outputs. If it is ‘FAN_AVG’, it is based on an averaging calculation, i.e. Xavier initialization. For He initialization, use ‘FAN_IN’
  • The uniform argument: this defines whether to use a uniform distribution or a normal distribution to sample the weights from during initialization. For both Xavier and He weight initialization, you can use a normal distribution, so set this argument to False

The other weight initialization function used in the scenarios is the tf.random_normal_initializer with default parameters. The default parameters for this initializer are a mean of zero, and a unit (i.e. 1.0) standard deviation / variance.

After running this code, a number of interesting results are obtained.

Visualizing the TensorFlow model variables

The first thing that we want to look at is the “first pass” model results, where only one batch is passed through the model. If we look at the distribution of inputs into the first layer in TensorBoard, with our naive normally distributed weight values with a unit variance, we can see the following (if TensorBoard visualization is unfamiliar to you, check out my TensorFlow visualization tutorial):

Weight initialization - First pass distribution of inputs to the first layer

First pass distribution of inputs to the first layer

As can be observed the matrix multiplication input into the first layer is approximately normally distributed, with a standard deviation around 10. If you recall, the variance scalar of the matrix multiplication input was also been logged, and it gives a value of approximately 88. Does this make sense? I mentioned earlier that with 784 inputs (i.e. the input size of the MNIST dataset), we should expect a variance of approximately 784. What’s the explanation of this discrepancy? Well, remember I also logged the variance of the input data – it turns out that the MNIST TensorFlow dataset has a variance of 0.094. You’ll recall that we assumed a unit variance in the calculations previously shown. In this case, though, we should expect a variance of (remember that $Var(W_i)$, for the normal distribution initializer we are currently considering, is equal to 1.0):

$$Var(in) = \sum_{i=0}^n Var(X_i)Var(W_i) = n Var(X_i)Var(W_i) = 784 * 0.094 * 1 = 74$$

This is roughly in line with the observed variance – so we can be happy that we are on the right track. The distribution shown above is the distribution into the first layer neurons. In the first set of scenarios, we’re using a sigmoid activation function – so what does the first layer output distribution look like for this type of input distribution?

Weight initialization - Distribution of outputs from first layer - sigmoid activations and normal weight initialization

Distribution of outputs from first layer – sigmoid activations and normal weight initialization

As can be observed, the input distribution with such a relatively large variance completely saturates the first layer – with the output distribution being squeezed to the saturated regions of the sigmoid curve i.e. outputs close to 0 and 1 (we’d observe the same thing with a tanh activation). This confirms our previous analysis of the problems with a naive normally distributed weight initialization.

What happens when we use the Xavier initialization configuration of the variance scaler initializer? The plot below shows the same distribution of outputs:

Weight initialization - Distribution of outputs from first layer - sigmoid activations and Xavier weight initialization

Distribution of outputs from first layer – sigmoid activations and Xavier weight initialization

As can be observed, this is a very satisfactory distribution – with the output values centered around the linear region of the sigmoid function (i.e. 0.5), with no saturation occurring. This more optimal initialization results in better training outcomes also. The figure below shows the accuracy comparison between the normally initialized weight distribution and the Xavier initialized weight distribution, for the full training run scenario:

Weight initialization - Accuracy comparison between normal and Xavier initialization - sigmoid activation

Accuracy comparison between normal (red) and Xavier initialization (light blue) – sigmoid activation

As can be observed, Xavier initialization results in better training performance, as we should expect.

The next thing to compare is the performance of normal weight initialization, Xavier initialization and He initialization for a ReLU activation function. The plot below shows the accuracy comparison during training between the three initialization techniques:

Weight initialization - He, Xavier and normal comparison with ReLU activations

Accuracy comparison for ReLU activation functions and normal (red), Xavier (green) and He (grey) weight initialization

As can be observed, the model performance is significantly greater for Xavier and He weight initialization than for the normal initialization on a ReLU network. There is little clear difference between the Xavier and He initialization, but a better average performance should be expected from He initialization for more complicated networks and problems that use a ReLU activation function.

There you have it – you should now hopefully understand the drawbacks of naive, normally distributed weight initialization, and you should also understand the basics of how Xavier and He initialization work, and their performance benefits. You should also understand how to easily use such initialization methods in TensorFlow. I hope this helps you build better performing models for both sigmoid/tanh and ReLU networks.

Eager to learn more? Get the book here