Bayes Theorem, maximum likelihood estimation and TensorFlow Probability

A growing trend in deep learning (and machine learning in general) is a probabilistic or Bayesian approach to the problem. Why is this? Simply put – a standard deep learning model produces a prediction, but with no statistically robust understanding of how confident the model is in the prediction. This is important in the understanding of the limitations of model predictions, and also if one wants to do probabilistic modeling of any kind. There are also other applications, such as probabilistic programming and being able to use domain knowledge, but more on that in another post. The TensorFlow developers have addressed this problem by creating TensorFlow Probability. This post will introduce some basic Bayesian concepts, specifically the likelihood function and maximum likelihood estimation, and how these can be used in TensorFlow Probability for the modeling of a simple function.

The code contained in this tutorial can be found on this site’s Github repository.


Eager to build deep learning systems in TensorFlow 2? Get the book here

 

Bayes theorem and maximum likelihood estimation

Bayes theorem is one of the most important statistical concepts a machine learning practitioner or data scientist needs to know. In the machine learning context, it can be used to estimate the model parameters (e.g. the weights in a neural network) in a statistically robust way. It can also be used in model selection e.g. choosing which machine learning model is the best to address a given problem. I won’t be going in-depth into all the possible uses of Bayes theorem here, however, but I will be introducing the main components of the theorem.

Bayes theorem can be shown in a fairly simple equation involving conditional probabilities as follows:

$$P(\theta \vert D) = \frac{P(D \vert \theta) P(\theta)}{P(D)}$$

In this representation, the variable $\theta$ corresponds to the model parameters (i.e. the values of the weights in a neural network), and the variable $D$ corresponds to the data that we are using to estimate the $\theta$ values. Before I talk about what conditional probabilities are, I’ll just quickly point out three terms in this formula which are very important to familiarise yourself with, as they come up in the literature all the time. It is worthwhile memorizing what these terms refer to:

$P(\theta \vert D)$ – this is called the posterior

$P(D \vert \theta)$ – this is called the likelihood

$P(\theta)$ – this is called the prior

I’m going to explain what all these terms refer to shortly, but first I’ll make a quick detour to discuss conditional probability for those who may not be familiar. If you are already familiar with conditional probability, feel free to skip this section.

Conditional probability

Conditional probability is an important statistical concept that is thankfully easy to understand, as it forms a part of our everyday reasoning. Let’s say we have a random variable called RT which represents whether it will rain today – it is a discrete variable and can take on the value of either 1 or 0, denoting whether it will rain today or not. Let’s say we are in a fairly dry environment, and by consulting some long-term rainfall records we know that RT=1 about 10% of the time, and therefore RT=0 about 90% of the time. This fully represents the probability function for RT which can be written as P(RT). Therefore, we have some prior knowledge of what P(RT) is in the absence of any other determining factors.

Ok, so what does P(RT) look like if we know it rained yesterday? Is it the same or is it different? Well, let’s say the region we are in gets most of its rainfall due to big weather systems that can last for days or weeks – in this case, we have good reason to believe that P(RT) will be different given the fact that it rained yesterday. Therefore, the probability P(RT) is now conditioned on our understanding of another random variable P(RY) which represents whether it has rained yesterday. The way of showing this conditional probability is by using the vertical slash symbol $\vert$ – so the conditional probability that it will rain today given it rained yesterday looks like the following: $P(RT=1 \vert RY = 1)$. Perhaps for this reason the probability that it will rain today is no longer 10%, but maybe will rise to 30%, so $P(RT=1 \vert RY = 1) = 0.3$

We could also look at other probabilities, such as $P(RT=1 \vert RY = 0)$ or $P(RT=0 \vert RY = 1)$ and so on. To generalize this relationship we would just write $P(RT \vert RY)$.

Now that you have an understanding of conditional probabilities, let’s move on to explaining Bayes Theorem (which contains two conditional probability functions) in more detail.

Bayes theorem in more detail

The posterior

Ok, so as I stated above, it is time to delve into the meaning of the individual terms of Bayes theorem. Let’s first look at the posterior term – $P(\theta \vert D)$. This term can be read as: given we have a certain dataset $D$, what is the probability of our parameters $\theta$? This is the term we want to maximize when varying the parameters of a model according to a dataset – by doing so, we find those parameters $\theta$ which are most probable given the model we are using and the training data supplied. The posterior is on the left-hand side of the equation of Bayes Theorem, so if we want to maximize the posterior we can do this by maximizing the right-hand side of the equation.

Let’s have a look at the terms on the right-hand side.

The likelihood

The likelihood is expressed as $P(D \vert \theta)$ and can be read as: given this parameter $\theta$, which defines some process of generating data, what is the probability we would see this given set of data $D$? Let’s say we have a scattering of data-points – a good example might be the heights of all the members of a classroom full of kids. We can define a model that we assume is able to generate or represent this data – in this case, the Normal distribution is a good choice. The parameters that we are trying to determine in the Normal distribution is the tuple ($\mu$, $\sigma$) – the mean and variance of the Normal distribution.

So the likelihood $P(D \vert \theta)$ in this example is the probability of seeing this sample of measured heights given different values of the mean and variance of the Normal distribution function. There is some more mathematical precision needed here (such as the difference between a probability distribution and a probability density function, discrete samples etc.) but this is ok for our purposes of coming to a conceptual understanding.

I’ll come back to the concept of the likelihood shortly when we discuss maximum likelihood estimation, but for now, let’s move onto the prior.

The prior

The prior probability $P(\theta)$, as can be observed, is not a conditioned probability distribution. It is simply a representation of the probability of the parameters prior to any other consideration of data or evidence. You may be puzzled as to what the point of this probability is. In the context of machine learning or probabilistic programming, it’s purpose is to enable us to specify some prior understanding of what the parameters should actually be, and the prior probability distribution it should be drawn from.

Returning to the example of the heights of kids in a classroom. Let’s say the teacher is a pretty good judge of heights, and therefore he or she can come to the problem with a rough prior estimate of what the mean height would be. Let’s say he or she guesses that the average height is around 130cm. He can then put a prior around the mean parameter $\mu$ of, say, a normal distribution with a mean of 130cm.

The presence of the prior in the Bayes theorem allows us to introduce expert knowledge or prior beliefs into the problem, which aids the finding of the optimal parameters $\theta$. These prior beliefs are then updated by the data collected $D$ – with the updating occurring through the action of the likelihood function.

The graph below is an example of the evolution of a prior distribution function exposed to some set of data:

The evolution of the prior - Bayes Theorem - Maximum likelihood estimation

The evolution of the prior distribution towards the evidence / data

Here we can see that, through the application of the Bayes Theorem, we can start out with a certain set of prior beliefs in the form of a prior distribution function, but by applying the evidence or data through the likelihood $P(D \vert \theta)$, the posterior estimate $P(\theta \vert D)$ moves closer to “reality”.

The data

The final term in Bayes Theorem is the unconditioned probability distribution of the process that generated the data $P(D)$. In machine learning applications, this distribution is often unknown – but thankfully, it doesn’t matter. This distribution acts as a normalization constant and has nothing to say about the parameters we are trying to estimate $\theta$. Therefore, because we are trying to simply maximize the right-hand side of the equation, it drops out of any derivative calculation that is made in order to find the maximum. So in the context of machine learning and estimating parameters, this term can be safely ignored. Given this understanding, the form of Bayes Theorem that we are mostly interested in for machine learning purposes is as follows: $$P(\theta \vert D) \propto P(D \vert \theta) P(\theta)$$

Given this formulation, all we are concerned about is either maximizing the right-hand side of the equation or by simulating the sampling of the posterior itself (not covered in this post).

How to estimate the posterior

Now that we have reviewed conditional probability concepts and Bayes Theorem, it is now time to consider how to apply Bayes Theorem in practice to estimate the best parameters in a machine learning problem. There are a number of ways of estimating the posterior of the parameters in a machine learning problem. These include maximum likelihood estimation, maximum a posterior probability (MAP) estimation, simulating the sampling from the posterior using Markov Chain Monte Carlo (MCMC) methods such as Gibbs sampling, and so on. In this post, I will just be considering maximum likelihood estimation (MLE) with other methods being considered in future content on this site.

Maximum likelihood estimation (MLE)

What happens if we just throw our hands up in the air with regards to the prior $P(\theta)$ and say we don’t know anything about the best parameters to describe the data? In that case, the prior becomes a uniform or un-informative prior – in that case, $P(\theta)$ becomes a constant (same probability no matter what the parameter values are), and our Bayes Theorem reduces to:

$$P(\theta \vert D) \propto P(D \vert \theta)$$

If this is the case, all we have to do is maximize the likelihood $P(D \vert \theta)$ and by doing so we will also find the maximum of the posterior – i.e. the parameter with the highest probability given our model and data – or, in short, an estimate of the optimal parameters. If we have a way of calculating $P(D \vert \theta)$ while varying the parameters $\theta$, we can then feed this into some sort of optimizer to calculate:

$$\underset{\theta}{\operatorname{argmax}} P(D \vert \theta)$$
 
Nearly always, instead of maximizing $P(D \vert \theta)$ the log of $P(D \vert \theta)$ is maximized. Why? If we were doing the calculations by hand, we would need to calculate the derivative of the product of multiple exponential functions (as probability functions like the Normal distribution have exponentials in them) which is tricky. Because logs are monotonically increasing functions, they have maximums at the same point as the non-log function. So in other words, the maximum likelihood will occur at the same parameter value as the maximum of the log likelihood. By taking the log of the likelihood, products turn into sums and this makes derivative calculations a whole lot easier.
 
Finally, some optimizers in machine learning packages such as TensorFlow only minimize loss functions, so we need to invert the sign of the loss function in order to maximize it. In that case, for maximum likelihood estimation, we would minimize the negative log likelihood, or NLL, and get the same result.
 
Let’s look at a simple example of maximum likelihood estimation by using TensorFlow Probability.

TensorFlow Probability and maximum likelihood estimation

For the simple example of maximum likelihood estimation that is to follow, TensorFlow Probability is overkill – however, TensorFlow Probability is a great extension of TensorFlow into the statistical domain, so it is worthwhile introducing MLE by utilizing it. The Jupyter Notebook containing this example can be found at this site’s Github repository. Note this example is loosely based on the TensorFlow tutorial found here. In this example, we will be estimating linear regression parameters based on noisy data. These parameters can obviously be solved using analytical techniques, but that isn’t as interesting. First, we import some libraries and generate the noisy data:

import tensorflow as tf
import tensorflow_probability as tfp
import numpy as np
import matplotlib.pylab as plt
tfd = tfp.distributions

x_range = np.arange(0, 10, 0.1)
grad = 2.0
intercept = 3.0
lin_reg = x_range * grad + np.random.normal(0, 3.0, len(x_range)) + intercept

Plotting our noisy regression line looks like the following:

Noisy regression line - Maximum likelihood estimation

Noisy regression line

Next, let’s set up our little model to predict the underlying regression function from the noisy data:

model = tf.keras.Sequential([
  tf.keras.layers.Dense(1),
  tfp.layers.DistributionLambda(lambda x: tfd.Normal(loc=x, scale=1)),
])

So here we have a simple Keras sequential model (for more detail on Keras and TensorFlow, see this post). The first layer is a Dense layer with one node. Given each Dense layer has one bias input by default – this layer equates to generating a simple line with a gradient and intercept: $xW + b$ where x is the input data, W is the single input weight and b is the bias weight. So the first Dense layer produces a line with a trainable gradient and y-intercept value.

The next layer is where TensorFlow Probability comes in. This layer allows you to create a parameterized probability distribution, with the parameter being “fed in” from the output of previous layers. In this case, you can observe that the lambda x, which is the output from the previous layer, is defining the mean of a Normal distribution. In this case, the scale (i.e. the standard deviation) is fixed to 1.0. So, using TensorFlow probability, our model no longer will just predict a single value for each input (as in a non-probabilistic neural network) – no, instead the output is actually a Normal distribution. In that case, to actually predict values we need to call statistical functions from the output of the model. For instance:

  • model(np.array([[1.0]])).sample(10) will produce a random sample of 10 outputs from the Normal distribution, parameterized by the input value 1.0 fed through the first Dense layer
  • model(np.array([[1.0]])).mean() will produce the mean of the distribution, given the input
  • model(np.array([[1.0]])).stddev() will produce the standard deviation of the distribution, given the input

and so on. We can also calculate the log probability of the output distribution, as will be discussed shortly. Next, we need to set up our “loss” function – in this case, our “loss” function is actually just the negative log likelihood (NLL):

def neg_log_likelihood(y_actual, y_predict):
  return -y_predict.log_prob(y_actual)

In the above, the y_actual values are the actual noisy training samples. The values y_predict are actually a tensor of parameterized Normal probability distributions – one for each different training input. So, for instance, if one training input is 5.0, the corresponding y_predict value will be a Normal distribution with a mean value of, say, 12. Another training input may have a value 10.0, and the corresponding y_predict will be a Normal distribution with a mean value of, say, 20, and so on. Therefore, for each y_predict and y_actual pair, it is possible to calculate the log probability of that actual value occurring given the predicted Normal distribution.

To make this more concrete – let’s say for a training input value 5.0, the corresponding actual noisy regression value is 8.0. However, let’s say the predicted Normal distribution has a mean of 10.0 (and a fixed variance of 1.0). Using the formula for the log probability / log likelihood of a Normal distribution:

$$\ell_x(\mu,\sigma^2) = – \ln \sigma – \frac{1}{2} \ln (2 \pi) – \frac{1}{2} \Big( \frac{x-\mu}{\sigma} \Big)^2$$

Substituting in the example values mentioned above:

$$\ell_x(10.0,1.0) = – \ln 1.0 – \frac{1}{2} \ln (2 \pi) – \frac{1}{2} \Big( \frac{8.0-10.0}{1.0} \Big)^2$$

We can calculate the log likelihood from the y_predict distribution and the y_actual values. Of course, TensorFlow Probability does this for us by calling the log_prob method on the y_predict distribution. Taking the negative of this calculation, as I have done in the function above, gives us the negative log likelihood value that we need to minimize to perform MLE.

After the loss function, it is now time to compile the model, train it, and make some predictions:

model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.05), loss=neg_log_likelihood)
model.fit(x_range, lin_reg, epochs=500, verbose=False)

yhat = model(x_range)
mean = yhat.mean()

As can be observed, the model is compiled using our custom neg_log_likelihood function as the loss. Because this is just a toy example, I am using the full dataset as both the train and test set. The estimated regression line is simply the mean of all the predicted distributions, and plotting it produces the following:

plt.close("all")
plt.scatter(x_range, lin_reg)
plt.plot(x_range, mean, label='predicted')
plt.plot(x_range, x_range * grad + intercept, label='ground truth')
plt.legend(loc="upper left")
plt.show()
plt.close("all") plt.scatter(x_range, lin_reg) plt.plot(x_range, mean, label='predicted') plt.plot(x_range, x_range * grad + intercept, label='ground truth') plt.legend(loc="upper left") plt.show()

TensorFlow Probability based regression using maximum likelihood estimation

Another example with changing variance

Another, more interesting, example is to use the model to predict not only the mean but also the changing variance of a dataset. In this example, the dataset consists of the same trend but the noise variance increases along with the values:

def noise(x, grad=0.5, const=2.0):
  return np.random.normal(0, grad * x + const)

x_range = np.arange(0, 10, 0.1)
noise = np.array(list(map(noise, x_range)))
grad = 2.0
intercept = 3.0
lin_reg = x_range * grad + intercept + noise

plt.scatter(x_range, lin_reg)
plt.show()
linear regression with increasing noise variance

Linear regression with increasing noise variance

The new model looks like the following:

model = tf.keras.Sequential([
  tf.keras.layers.Dense(2),
  tfp.layers.DistributionLambda(lambda x: tfd.Normal(loc=x[:, 0], scale=1e-3 + tf.math.softplus(0.3 * x[:, 1]))),
])

In this case, we have two nodes in the first layer, ostensibly to predict both the mean and standard deviation of the Normal distribution, instead of just the mean as in the last example. The mean of the distribution is assigned to the output of the first node (x[:, 0]) and the standard deviation / scale is set to be equal to a softplus function based on the output of the second node (x[:, 1]). After training this model on the same data and using the same loss as the previous example, we can predict both the mean and standard deviation of the model like so:

mean = yhat.mean()
upper = mean + 2 * yhat.stddev()
lower = mean - 2 * yhat.stddev()

In this case, the upper and lower variables are the 2-standard deviation upper and lower bounds of the predicted distributions. Plotting this produces:

plt.close("all")
plt.scatter(x_range, lin_reg)
plt.plot(x_range, mean, label='predicted')
plt.fill_between(x_range, lower, upper, alpha=0.1)
plt.plot(x_range, x_range * grad + intercept, label='ground truth')
plt.legend(loc="upper left")
plt.show()
Regression prediction with increasing variance

Regression prediction with increasing variance

As can be observed, the model is successfully predicting the increasing variance of the dataset, along with the mean of the trend. This is a limited example of the power of TensorFlow Probability, but in future posts I plan to show how to develop more complicated applications like Bayesian Neural Networks. I hope this post has been useful for you in getting up to speed in topics such as conditional probability, Bayes Theorem, the prior, posterior and likelihood function, maximum likelihood estimation and a quick introduction to TensorFlow Probability. Look out for future posts expanding on the increasingly important probabilistic side of machine learning.


Eager to build deep learning systems in TensorFlow 2? Get the book here


 

Leave a Reply

Your email address will not be published. Required fields are marked *