Bayesian AB Testing with Pyro. A primer in Bayesian thinking and AB… | by Fraser Brown | Nov, 2023


AB Testing Using Pyro

Consider a company that has designed a new website landing page and wants to understand the impact this will have on conversion, i.e. do visitors continue their web session on the website after landing on the page? In test group A, website visitors will be shown the current landing page. In test group B, website visitors will be shown the new landing page. In the rest of the article, I will refer to test group A as the control group, and group B as the treatment group. The business is sceptical about the change and has opted for an 80/20 split in session traffic. The total number of visitors and the total number of page conversions for each test group are summarised below.

Test Observations

The null hypothesis of the AB test is that there will be no change in page conversion for the two test groups. Under the frequentist framework, this would be expressed as the following for a two-sided test, where r_c and r_t are the page conversion rates in the control and treatment groups, respectively.

Null and Alternative Hypotheses

A significance test would then seek to either reject or fail to reject the null hypothesis. Under the Bayesian framework, we express the null hypothesis slightly differently by asserting the same prior for each of the test groups.

Let’s pause and outline exactly what is happening during our test. The variable we are interested in is the page conversion rate. This is simply calculated by taking the number of distinct converted visitors over the total number of visitors. The event that generates this rate is whether the visitor clicks through the page. There are only two possible outcomes here for each visitor, either the visitor clicks through the page and converts, or does not. Some of you might recognise that for each distinct visitor, this is an example of a Bernoulli trial; there is one trial and two possible outcomes. Now, when we collect a set of these Bernoulli trials, we have a binomial distribution. When the random variable X has a binomial distribution, we give it the following notation:

Binomial Distribution Notation

Where n is the number of visitors (or the number of Bernoulli trials), and p is the probability of the event on each trial. p is what we are interested in here, we want to understand what the probability of a visitor converting on the page is in each test group. We have observed some data, but as mentioned in the previous section, we first need to define our prior. As always in Bayesian statistics, we need to define this prior as a probability distribution. As mentioned before, this probability distribution is a characterisation of our uncertainty. Beta distributions are commonly used for modelling probabilities, as it is defined between the intervals of [0,1]. Furthermore, using a beta distribution as our prior for a binomial likelihood function gives us the helpful property of conjugacy, which means our posterior will be generated from the same distribution as our prior. We say that the beta distribution is a conjugate prior. A beta distribution is defined by two parameters, alpha, and confusingly, beta.

Beta Distribution Notation

With access to historical data, we can assert an informed prior. We do not necessarily need historical data, we could use our intuition to inform our understanding, but for now let’s assume we have neither (later in this tutorial we will use informed priors, but to demonstrate the impact, I will start with the uninformed). Let’s assume we have no understanding of the conversion rate on the company’s site, and therefore define our prior as Beta(1,1). This is called a flat prior. The probability distribution of this function looks like the graph below, the same as a uniform distribution defined between the intervals [0,1]. By asserting a Beta(1,1) prior, we say that all possible values of the page conversion rate are equally probable.

Credit: Author

We now have all the information we need, the priors, and the data. Let’s jump into the code. The code provided herein will provide a framework to get started with AB testing using Pyro; it therefore neglects some features of the package. To help optimise your code further and take full advantage of Pyro’s capabilities, I recommend referring to the official documentation.

First, we need to import our packages. The final line is good practice, particularly when working in notebooks, clearing the store of parameters we have built up.

import pyro
import pyro.distributions as dist
from pyro.infer import NUTS, MCMC
import torch
from torch import tensor
import matplotlib.pyplot as plt
import seaborn as sns
from functools import partial
import pandas as pd

pyro.clear_param_store()

Models in Pyro are defined as regular Python functions. This is helpful as it makes it intuitive to follow.

def model(beta_alpha, beta_beta):
def _model_(traffic: tensor, number_of_conversions: tensor):
# Define Stochastic Primatives
prior_c = pyro.sample('prior_c', dist.Beta(beta_alpha, beta_beta))
prior_t = pyro.sample('prior_t', dist.Beta(beta_alpha, beta_beta))
priors = torch.stack([prior_c, prior_t])
# Define the Observed Stochastic Primatives
with pyro.plate('data'):
observations = pyro.sample('obs', dist.Binomial(traffic, priors),\
obs = number_of_conversions)
return partial(_model_)

A few things to break down and explain here. First, we have a function wrapped inside an outer function, the outer function returns the partial function of the inner function. This allows us to change our priors, without needing to change the code. I have referred to the variables defined in the inner function as primitives, think of primitives as variables in the model. We have two types of primitives in the model, stochastic and observed stochastic. In Pyro, we do not have to explicitly define the difference, we simply add the obs argument to the sample method when it is an observed primitive and Pyro interprets it accordingly. Observed primitives are contained within the context manager pyro.plate(), which is best practice and makes our code look cleaner. Our stochastic primitives are our two priors, characterised by Beta distributions, governed by the alpha and beta parameters that we pass in from the outer function. As previously mentioned, we assert the null hypothesis by defining these as equal. We then stack these two primitives together using tensor.stack(), which performs an operation akin to concatenating a Numpy array. This will return a tensor, the data structure required for inference in Pyro. We have defined our model, now let’s move onto the inference stage.

As previously mentioned, this tutorial will use MCMC. The function below will take the model that we have defined above and the number of samples we wish to use to generate our posterior distribution as a parameter. We also pass our data into the function, as we did for the model.

def run_infernce(model, number_of_samples, traffic, number_of_conversions):
kernel = NUTS(model)

mcmc = MCMC(kernel, num_samples = number_of_samples, warmup_steps = 200)

mcmc.run(traffic, number_of_conversions)

return mcmc

The first line inside this function defines our kernel. We use the NUTS class to define our kernel, which stands for No-U-Turn Sampler, an autotuning version of Hamiltonian Monte Carlo. This tells Pyro how to sample from the posterior probability space. Again, it is beyond the scope of this article to dive deeper into this topic, but for now, it is sufficient to know that NUTS allows us to sample from the probability space intelligently. The kernel is then used to initialise the MCMC class on the second line, specifying it to use NUTS. We pass the number_of_samples argument in the MCMC class which is the number of samples used to generate the posterior distribution. We assign the initialised MCMC class to the mcmc variable and call the run() method, passing our data as parameters. The function returns the mcmc variable.

This is all we need; the following code defines our data and calls the functions we have just made using the Beta(1,1) prior.

traffic = torch.tensor([5523., 1379.])
conversions =torch.tensor([2926., 759.])
inference = run_infernce(model(1,1), number_of_samples = 1000, \
traffic = traffic, number_of_conversions = conversions)

The first element of the traffic and conversions tensors are the counts for the control group, and the second element in each tensor is the counts for the treatment group. We pass the model function, with the parameters to govern our prior distribution, alongside the tensors we have defined. Running this code will generate our posterior samples. We run the following code to extract the posterior samples and pass them to a Pandas dataframe.

posterior_samples = inference.get_samples()
posterior_samples_df = pd.DataFrame(posterior_samples)

Notice the column names of this dataframe are the strings we passed when we defined our primitives in the model function. Each row in our dataframe contains samples drawn from the posterior distribution, and each of these samples represents an estimate of the page conversion rate, the probability value p that governs our Binomial distribution. Now we have returned the samples, we can plot our posterior distributions.

Results

An insightful way to visualise the results of the AB test with two test groups is by a joint kernel density plot. It allows us to visualise the density of samples in the probability space across both distributions. The graph below can be produced from the dataframe we have just built.

Credit: Author

The probability space contained in the graph above can be divided across its diagonal, anything above the line would indicate regions where the estimation of the conversion rate is higher in the treatment group than the control and vice versa. As illustrated in the plot, the samples drawn from the posterior are densely populated in the region which would indicate the conversion rate is higher in the treatment group. It is important to highlight that the posterior distribution for the treatment group is wider than the control group, reflecting a higher degree of uncertainty. This is a result of observing less data in the treatment group. Nevertheless, the plot strongly indicates that the treatment group has outperformed the control group. By collecting an array of samples from the posterior and taking the element-wise difference, we can say that the probability that the treatment group outperforms the control group is 90.4%. This figure suggests that 90.4% of the samples drawn from the posterior will be populated above the diagonal in the joint density plot above.

These results were achieved by using a flat (uninformed) prior. The use of an informed prior may help improve the model, particularly when the availability of observed data is limited. A helpful exercise is to explore the effects of using different priors. The plot below shows the Beta(2,2) probability density function and the joint plot it produces when we rerun the model. We can see that using the Beta(2,2) prior produces a very similar posterior distribution for both test groups.

Credit: Author

The samples drawn from the posterior suggest there is a 91.5% probability that the treatment group performs better than the control. Therefore, we do believe with a higher degree of certainty that the treatment group is better than the control versus using a flat prior. However, in this example the difference is negligible.

There is one other thing I would like to highlight about these results. When we ran the inference, we told Pyro to generate 1000 samples from the posterior. This is an arbitrary number, selecting a different number of samples can change the results. To highlight the effect of increasing the number of samples, I ran an AB test where the observations from the control and treatment groups were the same, each with an overall conversion rate of 50%. Using a Beta(2,2) prior generates the following posterior distributions as we incrementally increase the number of samples.

Credit: Author

When we run our inference with just 10 samples, the posterior distribution for the control and treatment groups are relatively wide and adopt different shapes. As the number of samples that we draw increases, the distributions converge, eventually generating nearly identical distributions. Furthermore, we observe two properties of statistical distributions, the central limit theorem and the law of large numbers. The central limit theorem states that the distribution of sample means converges towards a normal distribution as the number of samples increases, and we can see that in the plot above. Additionally, the law of large numbers states that as the sample size grows, the sample mean converges towards the population mean. We can see that the mean of the distributions in the bottom right tile is approximately 0.5, the conversion rate observed in each of the test samples.



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*