Exploring Tensorflow Probability

An introduction

4 min readNov 3, 2021

TensorFlow Probability (TFP) is a library we will be using to generate probability distributions and make predictions.

TFP uses the same underlying data constructs, Tensors, as standard TensorFlow. It also benefits from GPU acceleration, as a result. TFP has many features for generating data based on distributions and fitting statistical models.

Using TFP

This Introduction notebook contains the code below.

We’ll start by using TFP to generate some data, then demonstrate inferring the true probability that was used to generate the data.

TrueA and TrueB will be hidden from the TFP algorithm, we will use the generated observations in order to attempt to infer the TrueA and B.

Now we have some generated sample data with observed means that probably diverge slightly from the true means already.

Let’s define the metric for success in order to evaluate our model or algorithm:

Double Joint Log Probability is used to assess how likely it is that the model generated the observed data inputted. This function is how the algorithm evaluates the goodness of fit for its current parameters. Input observations are evaluated against model outputs to determine the probability the model generated the observed data. This function will look different based on specific use cases.

Markov chain Monte Carlo

Implementing an algorithm for sampling from a probability distribution. We will use the specific Hamiltonian Monte Carlo algorithm or implementation.

MCMC will use our generated data, obsA and obsB, to tune the parameters for an algorithm that will allow us to generate or sample from a distribution that approximates our observations. The algorithm will explore the problem space (hopefully) converging over the defined steps.

Burnin seeks to avoid the initial, unrefined guesses from the algorithm as it begins to converge.

Initial chain state is the starting point for the algorithm; we’ll start with the means of our pairs of observations.

Finally, we define the HMC kernel which captures the target probability function (our double joint log probability function) and includes step sizes for how large the adjustments HMC makes each step are.

Running and Sampling

Now that we have defined the HMC kernel, we will set the parameters for running and sampling from the chain.

The output from running the chain will consist of 30,000 results for the distributions A and B in an array (samples).

Kernel_results contains more metadata about the run.

Let’s plot our results:

Note the alpha probability’s more narrow band, more samples has allowed us to be more confident of the mean.

Applied to Titanic

Titanic Notebook with this implemented.

Let’s look at the Titanic Dataset, we’ll look at how much passenger class influences the how likely a passenger was to survive the crash;

We’re seeking coefficients beta, the slope, and alpha, our intercept.

Again, we’ll use MCMC to generate samples for our distribution, in this case we’ll sample to discern our alpha and beta based on the survival and passenger class.

pass_class and pass_survival will be used to infer alpha and beta.

Titanic Joint Log Probability

We take the inputs, including the estimate for alpha and beta and calculate a probability, much like the previous example.

In this case, the logistic_p includes a measure for each of the survival readings, this is then used as input to a Bernoulli distribution, which results in a discrete probability for each of the survival readings, resulting in a tensor that is X by 1309, as we have 1309 survival readings.

This then is summed into our log probability output:

We can see as we modify the alpha and beta assumptions the probability changes, becoming more negative with alpha and beta assumptions that make less sense.

Much as before, this is used in an MCMC chain;

Sampling from the chain shows that as a passenger’s class increases, their chances of survival decrease, which was expected;

Charting the mean alpha and beta from posterior;

One Last View

Alternatively, we could treat each passenger class as a distinct probability and generate distributions for each;

Here we can see that the mean survival probability decreases with each class and the third class passengers have the narrowest band and the lowest mean, with the difference in likelihood between a first and third passenger’s survival being between 30 and 40%.