Bayesian Inference in Python: A Comprehensive Guide with Examples

Bayesian

Data-driven decision-making has become essential across various fields, from finance and economics to medicine and engineering. Understanding probability and statistics is crucial for making informed choices today. Bayesian inference, a powerful tool in probabilistic reasoning, allows us to update our beliefs about an event based on new evidence.

Bayes’s theorem, a fundamental concept in probability theory, forms the foundation of Bayesian inference. This theorem provides a way to calculate the probability of an event occurring given prior knowledge and observed data. Combining our initial beliefs with the likelihood of the evidence, we can arrive at a more accurate posterior probability.

Bayesian inference is a statistical method based on Bayes’s theorem, which updates the probability of an event as new data becomes available. It is widely used in various fields, such as finance, medicine, and engineering, to make predictions and decisions based on prior knowledge and observed data. In Python, Bayesian inference can be implemented using libraries like NumPy and Matplotlib to generate and visualize posterior distributions.

This article will explore Bayesian inference and its implementation using Python, a popular programming language for data analysis and scientific computing. We will start by understanding the fundamentals of Bayes’s theorem and formula, then move on to a step-by-step guide on implementing Bayesian inference in Python. Along the way, we will discuss a real-world example of predicting website conversion rates to illustrate the practical application of this powerful technique.

Recommended: Python and Probability: Simulating Blackjack Card Counting with Python Code

Recommended: Understanding Joint Probability Distribution with Python

What is Bayesian Inference?

Bayesian inference is based on Bayes’s theorem, which is based on the prior probability of an event. As events happen, the probability of the event keeps updating. Let us look at the formula of Baye’s theorem.

Bayes Theorem Formula

Implementing Bayesian Inference in Python

Let us try to implement the same in Python with the code below.

Let us look at the output of the same.

Bayes Theorem Plot

Real-World Example: Predicting Website Conversion Rates with Bayesian Inference

Let us now look at the case of a website. We are trying to predict how many people will buy the product to the ratio of the number of visitors. We have also created some parameters.

Let us look at the output of the above code and try to deduce some information from it.

Bayesian Inference Output

According to our predictions, the probability of conversion is around 5%.

Here you go! Now, you know a lot more about Bayesian inference. In this article, we learned what Bayesian inference is and also touched upon how to implement it in the Python programming language. We also learned about a very simple case study where we calculated the probability of customers’ conversions if they visited a particular website. Bayesian inference, as mentioned, is also used heavily in the fields of finance, economics, and engineering.

Hope you enjoyed reading it!!

Recommended: Monte-Carlo Simulation to find the probability of Coin toss in python

Recommended: 5 Ways to Detect Fake Dollar Bills Using Python Machine Learning

An Introduction to Data Analysis

11 bayesian hypothesis testing.

This chapter introduces common Bayesian methods of testing what we could call statistical hypotheses . A statistical hypothesis is a hypothesis about a particular model parameter or a set of model parameters. Most often, such a hypothesis concerns one parameter, and the assumption in question is that this parameter takes on a specific value, or some value from a specific interval. Henceforth, we speak just of a “hypothesis” even though we mean a specific hypothesis about particular model parameters. For example, we might be interested in what we will call a point-valued hypothesis , stating that the value of parameter \(\theta\) is fixed to a specific value \(\theta = \theta^*\) . Section 11.1 introduces different kinds of statistical hypotheses in more detail.

Given a statistical hypothesis about parameter values, we are interested in “testing” it. Strictly speaking, the term “testing” should probably be reserved for statistical decision procedures which give clear categorical judgements, such as whether to reject a hypothesis, accept it as true or to withhold judgement because no decision can be made (yet/currently). While we will encounter such categorical decision routines in this chapter, Bayesian approaches to hypotheses “testing” are first and foremost concerned, not with categorical decisions, but with quantifying evidence in favor or against the hypothesis in question. (In a second step, using Bayesian decision theory which also weighs in the utility of different policy choices, we can use Bayesian inference also for informed decision making, of course.) But instead of speaking of “Bayesian inference to weigh evidence for/against a hypothesis” we will just speak of “Bayesian hypothesis testing” for ease of parlor.

We consider two conceptually distinct approaches within Bayesian hypothesis testing.

  • Estimation-based testing considers just one model. It uses the observed data \(D_\text{obs}\) to retrieve posterior beliefs \(P(\theta \mid D_{\text{obs}})\) and checks whether, a posteriori , our hypothesis is credible.
  • Comparison-based testing uses Bayesian model comparison, in the form of Bayes factors, to compare two models, namely one model that assumes that the hypothesis in question is true, and one model that assumes that the complement of the hypothesis is true.

The main difference between these two approaches is that estimation-based hypothesis testing is simpler (conceptually and computationally), but less informative than comparison-based hypothesis testing. In fact, comparison-based methods give a clearer picture of the quantitative evidence for/against a hypothesis because they explicitly take into account a second alternative to the hypothesis which is to be tested. As we will see in this chapter, the technical obstacles for comparison-based approaches can be overcome. For special but common use cases, like testing directional hypotheses, there are efficient methods of performing comparison-based hypothesis testing.

The learning goals for this chapter are:

  • point-valued, ROPE-d and directional hypotheses
  • complement / alternative hypothesis
  • be able to apply Bayesian hypothesis testing to (simple) case studies
  • understand and be able to apply the Savage-Dickey method (and its extension to interval-based hypotheses in terms of encompassing models )
  • become familiar with a Bayesian \(t\) -test model for comparing the means of two groups of metric measurements

Bayesian Hypothesis Testing with PyMC

Austin Rochford

Lately I’ve been reading the excellent, open source book Probabilistic Programming and Bayesian Methods for Hackers . The book’s prologue opens with the following line.

The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis.

As a mathematician with training in statistics as well, this comment rings quite true. Even with a firm foundation in the probability theory necessary for Bayesian inference, the calculations involved are often incredibly tedious. An introduction to Bayesian statistics often invovles the hazing ritual of showing that, if \(X | \mu \sim N(\mu, \sigma^2)\) and \(\mu \sim N(\mu_0, \sigma_0^2)\) where \(\sigma^2\) , \(\mu_0\) , and \(\sigma_0^2\) are known, then

\[\mu | X \sim N \left(\frac{\frac{X}{\sigma^2} + \frac{\mu_0}{\sigma_0^2}}{\frac{1}{\sigma^2} + \frac{1}{\sigma_0^2}}\right).\]

If you want to try it yourself, Bayes' Theorem and this polynomial identity will prove helpful.

Programming and Bayesian Methods for Hackers emphasizes the utility of PyMC for both eliminating tedious, error-prone calculations, and also for approximating posterior distributions in models that have no exact solution. I was recently reminded of PyMC’s ability to eliminate tedious, error-prone calculations during my statistics final, which contained a problem on Bayesian hypothesis testing . This post aims to present both the exact theoretical analysis of this problem and its approximate solution using PyMC.

The problem is as follows.

Given that \(X | \mu \sim N(\mu, \sigma^2)\) , where \(\sigma^2\) is known, we wish to test the hypothesis \(H_0: \mu = 0\) vs.  \(H_A: \mu \neq 0\) . To do so in a Bayesian framework, we need prior probabilities on each of the hypotheses and a distribution on the parameter space of the alternative hypothesis. We assign the hypotheses equal prior probabilities, \(\pi_0 = \frac{1}{2} = \pi_A\) , which indicates that prior to observing the value of \(X\) , we believe that each hypothesis is equally likely to be true. We also endow the alternative parameter space, \(\Theta_A = (-\infty, 0) \cup (0, \infty)\) , with a \(N(0, \sigma_0^2)\) distribution, where \(\sigma_0^2\) is known.

The Bayesian framework for hypothesis testing relies on the calculation of the posterior odds of the hypotheses,

\[\textrm{Odds}(H_A|x) = \frac{P(H_A | x)}{P(H_0 | x)} = BF(x) \cdot \frac{\pi_A}{\pi_0},\]

where \(BF(x)\) is the Bayes factor. In our situation, the Bayes factor is

\[BF(x) = \frac{\int_{\Theta_A} f(x|\mu) \rho_A(\mu)\ d\mu}{f(x|0)}.\]

The Bayes factor is the Bayesian counterpart of the likelihood ratio , which is ubiquitous in frequentist hypothesis testing. The idea behind Bayesian hypothesis testing is that we should choose whichever hypothesis better explains the observation, so we reject \(H_0\) when \(\textrm{Odds}(H_A) > 1\) , and accept \(H_0\) otherwise. In our situation, \(\pi_0 = \frac{1}{2} = \pi_A\) , so \(\textrm{Odds}(H_A) = BF(x)\) . Therefore we base our decision on the value of the Bayes factor.

In the following sections, we calculate this Bayes factor exactly and approximate it with PyMC. If you’re only interested in the simulation and would like to skip the exact calculation (I can’t blame you) go straight to the section on PyMC approximation .

Exact Calculation

The calculation becomes (somewhat) simpler if we reparamatrize the normal distribution using its precision instead of its variance. If \(X\) is normally distributed with variance \(\sigma^2\) , its precision is \(\tau = \frac{1}{\sigma^2}\) . When a normal distribution is parametrized by precision, its probability density function is

\[f(x|\mu, \tau) = \sqrt{\frac{\tau}{2 \pi}} \textrm{exp}\left(-\frac{\tau}{2} (x - \mu)^2\right).\]

Reparametrizing the problem in this way, we get \(\tau = \frac{1}{\sigma^2}\) and \(\tau_0 = \frac{1}{\sigma_0^2}\) , so

\[\begin{align} f(x|\mu) \rho_A(\mu) & = \left(\sqrt{\frac{\tau}{2 \pi}} \textrm{exp}\left(-\frac{\tau}{2} (x - \mu)^2\right)\right) \cdot \left(\sqrt{\frac{\tau_0}{2 \pi}} \textrm{exp}\left(-\frac{\tau_0}{2} \mu^2\right)\right) \\ & = \frac{\sqrt{\tau \cdot \tau_0}}{2 \pi} \cdot \textrm{exp} \left(-\frac{1}{2} \left(\tau (x - \mu)^2 + \tau_0 \mu^2\right)\right). \end{align}\]

Focusing momentarily on the sum of quadratics in the exponent, we rewrite it as \[\begin{align} \tau (x - \mu)^2 + \tau_0 \mu^2 & = \tau x^2 + (\tau + \tau_0) \left(\mu^2 - 2 \frac{\tau}{\tau + \tau_0} \mu x\right) \\ & = \tau x^2 + (\tau + \tau_0) \left(\left(\mu - \frac{\tau}{\tau + \tau_0} x\right)^2 - \left(\frac{\tau}{\tau + \tau_0}\right)^2 x^2\right) \\ & = \left(\tau - \frac{\tau^2}{\tau + \tau_0}\right) x^2 + (\tau + \tau_0) \left(\mu - \frac{\tau}{\tau + \tau_0} x\right)^2 \\ & = \frac{\tau \tau_0}{\tau + \tau_0} x^2 + (\tau + \tau_0) \left(\mu - \frac{\tau}{\tau + \tau_0} x\right)^2. \end{align}\]

Therefore \[\begin{align} \int_{\Theta_A} f(x|\mu) \rho_A(\mu)\ d\mu & = \frac{\sqrt{\tau \tau_0}}{2 \pi} \cdot \textrm{exp}\left(-\frac{1}{2} \left(\frac{\tau \tau_0}{\tau + \tau_0}\right) x^2\right) \int_{-\infty}^\infty \textrm{exp}\left(-\frac{1}{2} (\tau + \tau_0) \left(\mu - \frac{\tau}{\tau + \tau_0} x\right)^2\right)\ d\mu \\ & = \frac{\sqrt{\tau \tau_0}}{2 \pi} \cdot \textrm{exp}\left(-\frac{1}{2} \left(\frac{\tau \tau_0}{\tau + \tau_0}\right) x^2\right) \cdot \sqrt{\frac{2 \pi}{\tau + \tau_0}} \\ & = \frac{1}{\sqrt{2 \pi}} \cdot \sqrt{\frac{\tau \tau_0}{\tau + \tau_0}} \cdot \textrm{exp}\left(-\frac{1}{2} \left(\frac{\tau \tau_0}{\tau + \tau_0}\right) x^2\right). \end{align}\]

The denominator of the Bayes factor is \[\begin{align} f(x|0) & = \sqrt{\frac{\tau}{2 \pi}} \cdot \textrm{exp}\left(-\frac{\tau}{2} x^2\right), \end{align}\] so the Bayes factor is \[\begin{align} BF(x) & = \frac{\frac{1}{\sqrt{2 \pi}} \cdot \sqrt{\frac{\tau \tau_0}{\tau + \tau_0}} \cdot \textrm{exp}\left(-\frac{1}{2} \left(\frac{\tau \tau_0}{\tau + \tau_0}\right) x^2\right)}{\sqrt{\frac{\tau}{2 \pi}} \cdot \textrm{exp}\left(-\frac{\tau}{2} x^2\right)} \\ & = \sqrt{\frac{\tau_0}{\tau + \tau_0}} \cdot \textrm{exp}\left(-\frac{\tau}{2} \left(\frac{\tau_0}{\tau + \tau_0} - 1\right) x^2\right) \\ & = \sqrt{\frac{\tau_0}{\tau + \tau_0}} \cdot \textrm{exp}\left(\frac{1}{2} \left(\frac{\tau^2}{\tau + \tau_0}\right) x^2\right). \end{align}\]

From above, we reject the null hypothesis whenever \(BF(x) > 1\) , which is equivalent to \[\begin{align} \textrm{exp}\left(\frac{1}{2} \left(\frac{\tau^2}{\tau + \tau_0}\right) x^2\right) & > \sqrt{\frac{\tau + \tau_0}{\tau_0}}, \\ \frac{1}{2} \left(\frac{\tau^2}{\tau + \tau_0}\right) x^2 & > \frac{1}{2} \log\left(\frac{\tau + \tau_0}{\tau_0}\right),\textrm{ and} \\ x^2 & > \left(\frac{\tau + \tau_0}{\tau^2}\right) \cdot \log\left(\frac{\tau + \tau_0}{\tau_0}\right). \end{align}\]

As you can see, this calculation is no fun. I’ve even left out a lot of details that are only really clear once you’ve done this sort of calculation many times. Let’s see how PyMC can help us avoid this tedium.

PyMC Approximation

PyMC is a Python module that uses Markov chain Monte Carlo methods (and others) to fit Bayesian statistical models. If you’re unfamiliar with Markov chains , Monte Carlo methods , or Markov chain Monte Carlo methods, each of which is a an important topic in its own right, PyMC provides a set of tools to approximate marginal and posterior distributions of Bayesian statistical models.

To solve this problem, we will use PyMC to approximate \(\int_{\Theta_A} f(x|\mu) \rho_A(\mu)\ d\mu\) , the numerator of the Bayes factor. This quantity is the marginal distribution of the observation, \(X\) , under the alternative hypothesis.

We begin by importing the necessary Python packages.

In order to use PyMC to approximate the Bayes factor, we must fix numeric values of \(\sigma^2\) and \(\sigma_0^2\) . We use the values \(\sigma^2 = 1\) and \(\sigma_0^2 = 9\) .

We now initialize the random variable \(\mu\) .

Note that PyMC’s normal distribution is parametrized in terms of the precision and not the variance. When we initialize the variable x , we use the variable mu as its mean.

PyMC now knows that the distribution of x depends on the value of mu and will respect this relationship in its simulation.

We now instantiate a Markov chain Monte Carlo sampler, and use it to sample from the distributions of mu and x .

In the second line above, we tell PyMC to run the Markov chain Monte Carlo simulation for 50,000 steps, using the first 10,000 steps as burn-in, and then count each o the last 40,000 steps towards the sample. The burn-in period is the number of samples we discard from the beginning of the Markov chain Monte Carlo algorithm. A burn-in period is necessary to assure that the algorithm has converged to the desired distribution before we sample from it.

Finally, we may obtain samples from the distribution of \(X\) under the alternative hypothesis.

From our exact calculations, we expect that, given the alternative hypothesis, \(X \sim N\left(0, \frac{\tau \tau_0}{\tau + \tau_0}\right)\) . The following chart shows both the histogram derived from x_samples and the probability density function of \(X\) under the alternative hypothesis.

bayesian hypothesis testing python

It’s always nice to see agreement between theory and simulation. The problem now is that we need to evaluate the probability distribution function of \(X|H_A\) at the observed point \(X = x_0\) , but we only know the cumulative distribution function of \(X\) (via its histogram, computed from x_samples ). Enter kernel density estimation , a nonparametric method for estimating the probability density function of a random variable from samples. Fortunately, SciPy provides an excellent module for kernel density estimation .

We define two functions, the first of which gives the simulated Bayes factor, and the second of which gives the exact Bayes factor.

The following figure shows the excellent agreement between the simulated and calculated bayes factors.

bayesian hypothesis testing python

The only reason it’s possible to see the red graph of simulated Bayes factor behind the blue graph of the exact Bayes factor is that we’ve doubled the width of the red graph. In fact, on the interval \([0, 2]\) , the maximum relative error of the simulated Bayes factor is approximately 0.9%.

We now use the simulated Bayes factor to approximate the critical value for rejecting the null hypothesis.

The SciPy function opt.brentq is used to find a solution of the equation bayes_factor_sim(x) = 1 , which is equivalent to finding a zero of bayes_factor_crit_helper . We plug x_crit into bayes_factor_exact in order to verify that we have, in fact, found the critical value.

This value is quite close to one, so we have in fact approximated the critical point well.

It’s interesting to note that we have used PyMC in a somewhat odd way here, to approximate the marginal distribution of \(X\) under the null hypothesis. A much more typical use of PyMC and its Markov chain Monte Carlo would be to fix an observed value of \(X\) and approximate the posterior distribution of \(\mu\) given this observation.

  • Bayes Factors
  • Bayesian workflow
  • DEMetropolis
  • DEMetropolis(Z)
  • Reinforcement Learning
  • Simpson's paradox
  • approximation
  • autoregressive
  • bayesian imputation
  • bayesian structural timeseries
  • binned data
  • binomial regression
  • calibration
  • categorical regression
  • causal impact
  • causal inference
  • classification
  • competing risks
  • counterfactuals
  • cox process
  • cross validation
  • debiased machine learning
  • diagnostics
  • difference in differences
  • discrete choice
  • do-operator
  • factor analysis
  • forecasting
  • frailty models
  • gaussian process
  • gaussian processes
  • generalized linear model
  • gradient-free inference
  • graph mutation
  • hidden markov model
  • hierarchical
  • hierarchical model
  • hilbert space approximation
  • hypothesis testing
  • latent gaussian process
  • linear model
  • logistic regression
  • longitudinal
  • matrix factorization
  • missing data
  • mixture model
  • model averaging
  • model comparison
  • model expansion
  • multi-output
  • negative binomial regression
  • neural networks
  • non-parametric
  • nonparametric
  • ordinal regression
  • out of sample predictions
  • parameter estimation
  • path analysis
  • posterior predictive
  • product recommendation
  • propensity scores
  • quasi experiments
  • shared data
  • survival analysis
  • time series
  • time-series
  • time-to-failure
  • variational inference
  • vector autoregressive model

Introduction to Bayesian A/B Testing #

This notebook demonstrates how to implement a Bayesian analysis of an A/B test. We implement the models discussed in VWO’s Bayesian A/B Testing Whitepaper [ Stucchio, 2015 ] , and discuss the effect of different prior choices for these models. This notebook does not discuss other related topics like how to choose a prior, early stopping, and power analysis.

What is A/B testing? #

From https://vwo.com/ab-testing/:

A/B testing (also known as split testing) is a process of showing two variants of the same web page to different segments of website visitors at the same time and comparing which variant drives more conversions.

Specifically, A/B tests are often used in the software industry to determine whether a new feature or changes to an existing feature should be released to users, and the impact of the change on core product metrics (“conversions”). Furthermore:

We can test more than two variants at the same time. We’ll be dealing with how to analyse these tests in this notebook as well.

Exactly what “conversions” means can vary between tests, but two classes of conversions we’ll focus on are:

Bernoulli conversions - a flag for whether the visitor did the target action or not (e.g. completed at least one purchase).

Value conversions - a real value per visitor (e.g. the dollar revenue, which could also be 0).

If you’ve studied controlled experiments in the context of biology, psychology, and other sciences before, A/B testing will sound a lot like a controlled experiment - and that’s because it is! The concept of a control group and treatment groups, and the principles of experimental design, are the building blocks of A/B testing. The main difference is the context in which the experiment is run: A/B tests are typically run by online software companies, where the subjects are visitors to the website / app, the outcomes of interest are behaviours that can be tracked like signing up, purchasing a product, and returning to the website.

A/B tests are typically analysed with traditional hypothesis tests (see t-test ), but another method is to use Bayesian statistics. This allows us to incorporate prior distributions and produce a range of outcomes to the questions “is there a winning variant?” and “by how much?”.

Bernoulli Conversions #

Let’s first deal with a simple two-variant A/B test, where the metric of interest is the proportion of users performing an action (e.g. purchase at least one item), a bernoulli conversion. Our variants are called A and B, where A refers to the existing landing page and B refers to the new page we want to test. The outcome that we want to perform statistical inference on is whether B is “better” than A, which is depends on the underlying “true” conversion rates for each variant. We can formulate this as follows:

Let \(\theta_A, \theta_B\) be the true conversion rates for variants A and B respectively. Then the outcome of whether a visitor converts in variant A is the random variable \(\mathrm{Bernoulli}(\theta_A)\) , and \(\mathrm{Bernoulli}(\theta_B)\) for variant B. If we assume that visitors’ behaviour on the landing page is independent of other visitors (a fair assumption), then the total conversions \(y\) for a variant has the Binomial distribution:

Under a Bayesian framework, we assume the true conversion rates \(\theta_A, \theta_B\) cannot be known, and instead they each follow a Beta distribution. The underlying rates are assumed to be independent (we would split traffic between each variant randomly, so one variant would not affect the other):

The observed data for the duration of the A/B test (the likelihoood distribution) is: the number of visitors landing on the page N , and the number of visitors purchasing at least one item y :

With this, we can sample from the joint posterior of \(\theta_A, \theta_B\) .

You may have noticed that the Beta distribution is the conjugate prior for the Binomial, so we don’t need MCMC sampling to estimate the posterior (the exact solution can be found in the VWO paper). We’ll still demonstrate how sampling can be done with PyMC though, and doing this makes it easier to extend the model with different priors, dependency assumptions, etc.

Finally, remember that our outcome of interest is whether B is better than A. A common measure in practice for whether B is better than is the relative uplift in conversion rates , i.e. the percentage difference of \(\theta_B\) over \(\theta_A\) :

We’ll implement this model setup in PyMC below.

Now that we’ve defined a class that can take a prior and our synthetic data as inputs, our first step is to choose an appropriate prior. There are a few things to consider when doing this in practice, but for the purpose of this notebook we’ll focus on the following:

We assume that the same Beta prior is set for each variant.

An uninformative or weakly informative prior occurs when we set low values for alpha and beta . For example, alpha = 1, beta = 1 leads to a uniform distribution as a prior. If we were considering one distribution in isolation, setting this prior is a statement that we don’t know anything about the value of the parameter, nor our confidence around it. In the context of A/B testing however, we’re interested in comparing the relative uplift of one variant over another. With a weakly informative Beta prior, this relative uplift distribution is very wide, so we’re implicitly saying that the variants could be very different to each other.

A strong prior occurs when we set high values for alpha and beta . Contrary to the above, a strong prior would imply that the relative uplift distribution is thin, i.e. our prior belief is that the variants are not very different from each other.

We illustrate these points with prior predictive checks.

Prior predictive checks #

Note that we can pass in arbitrary values for the observed data in these prior predictive checks. PyMC will not use that data when sampling from the prior predictive distribution.

../_images/df45924e1e38a6f758e57bc367e572f0b3f635b79ce5a2a9c9c7bcd3b0dbed07.png

With the weak prior our 94% HDI for the relative uplift for B over A is roughly [-20%, +20%], whereas it is roughly [-2%, +2%] with the strong prior. This is effectively the “starting point” for the relative uplift distribution, and will affect how the observed conversions translate to the posterior distribution.

How we choose these priors in practice depends on broader context of the company running the A/B tests. A strong prior can help guard against false discoveries, but may require more data to detect winning variants when they exist (and more data = more time required running the test). A weak prior gives more weight to the observed data, but could also lead to more false discoveries as a result of early stopping issues.

Below we’ll walk through the inference results from two different prior choices.

We generate two datasets: one where the “true” conversion rate of each variant is the same, and one where variant B has a higher true conversion rate.

A B
trials 100000 100000
successes 22979 22970

We’ll also write a function to wrap the data generation, sampling, and posterior plots so that we can easily compare the results of both models (strong and weak prior) under both scenarios (same true rate vs. different true rate).

Scenario 1 - same underlying conversion rates #

../_images/dfd18449d02049876b7fc0b4ff58f4e37aff9fa2edba9954e1521eb804190a60.png

In both cases, the true uplift of 0% lies within the 94% HDI.

We can then use this relative uplift distribution to make a decision about whether to apply the new landing page / features in Variant B as the default. For example, we can decide that if the 94% HDI is above 0, we would roll out Variant B. In this case, 0 is in the HDI, so the decision would be to not roll out Variant B.

Scenario 2 - different underlying rates #

../_images/624d26c8093cd100ccfcca4c5773016d2927897ab9313e8e1dd5852b5f5346e8.png

In both cases, the posterior relative uplift distribution suggests that B has a higher conversion rate than A, as the 94% HDI is well above 0. The decision in this case would be to roll out Variant B to all users, and this outcome “true discovery”.

That said, in practice are usually also interested in how much better Variant B is. For the model with the strong prior, the prior is effectively pulling the relative uplift distribution closer to 0, so our central estimate of the relative uplift is conservative (i.e. understated) . We would need much more data for our inference to get closer to the true relative uplift of 9.5%.

The above examples demonstrate how to calculate perform A/B testing analysis for a two-variant test with the simple Beta-Binomial model, and the benefits and disadvantages of choosing a weak vs. strong prior. In the next section we provide a guide for handling a multi-variant (“A/B/n”) test.

Generalising to multi-variant tests #

We’ll continue using Bernoulli conversions and the Beta-Binomial model in this section for simplicity. The focus is on how to analyse tests with 3 or more variants - e.g. instead of just having one different landing page to test, we have multiple ideas we want to test at once. How can we tell if there’s a winner amongst all of them?

There are two main approaches we can take here:

Take A as the ‘control’. Compare the other variants (B, C, etc.) against A, one at a time.

For each variant, compare against the max() of the other variants.

Approach 1 is intuitive to most people, and is easily explained. But what if there are two variants that both beat the control, and we want to know which one is better? We can’t make that inference with the individual uplift distributions. Approach 2 does handle this case - it effectively tries to find whether there is a clear winner or clear loser(s) amongst all the variants.

We’ll implement the model setup for both approaches below, cleaning up our code from before so that it generalises to the n variant case. Note that we can also re-use this model for the 2-variant case.

We generate data where variants B and C are well above A, but quite close to each other:

../_images/87ac309267b10c83ed48de858e61162b9dea0e26d047aa9cb05d0daaa100e72a.png

The relative uplift posteriors for both B and C show that they are clearly better than A (94% HDI well above 0), by roughly 7-8% relative.

However, we can’t infer whether there is a winner between B and C.

../_images/087c0c3cc7c7957535d5d88aeb537a5c4b25dc39ec2c3b13c5fec9d87f2cddce.png

The uplift plot for A tells us that it’s a clear loser compared to variants B and C (94% HDI for A’s relative uplift is well below 0).

Note that the relative uplift calculations for B and C are effectively ignoring variant A. This is because, say, when we are calculating reluplift for B, the maximum of the other variants will likely be variant C. Similarly when we are calculating reluplift for C, it is likely being compared to B.

The uplift plots for B and C tell us that we can’t yet call a clear winner between the two variants, as the 94% HDI still overlaps with 0. We’d need a larger sample size to detect the 23% vs 22.8% conversion rate difference.

One disadvantage of this approach is that we can’t directly say what the uplift of these variants is over variant A (the control). This number is often important in practice, as it allows us to estimate the overall impact if the A/B test changes were rolled out to all visitors. We can get this number approximately though, by reframing the question to be “how much worse is A compared to the other two variants” (which is shown in Variant A’s relative uplift distribution).

Value Conversions #

Now what if we wanted to compare A/B test variants in terms of how much revenue they generate, and/or estimate how much additional revenue a winning variant brings? We can’t use a Beta-Binomial model for this, as the possible values for each visitor are now in the range [0, Inf) . The model proposed in the VWO paper is as follows:

The revenue generated by an individual visitor is revenue = probability of paying at all * mean amount spent when paying :

We assume that the probability of paying at all is independent to the mean amount spent when paying. This is a typical assumption in practice, unless we have reason to believe that the two parameters have dependencies. With this, we can create separate models for the total number of visitors paying, and the total amount spent amongst the purchasing visitors (assuming independence between the behaviour of each visitor):

where \(N\) is the total number of visitors, \(K\) is the total number of visitors with at least one purchase.

We can re-use our Beta-Binomial model from before to model the Bernoulli conversions. For the mean purchase amount, we use a Gamma prior (which is also a conjugate prior to the Gamma likelihood). So in a two-variant test, the setup is:

\(\mu\) here represents the average revenue per visitor, including those who don’t make a purchase. This is the best way to capture the overall revenue effect - some variants may increase the average sales value, but reduce the proportion of visitors that pay at all (e.g. if we promoted more expensive items on the landing page).

Below we put the model setup into code and perform prior predictive checks.

For the Beta prior, we can set a similar prior to before - centered around 0.5, with the magnitude of alpha and beta determining how “thin” the distribution is.

We need to be a bit more careful about the Gamma prior. The mean of the Gamma prior is \(\dfrac{\alpha_G}{\beta_G}\) , and needs to be set to a reasonable value given existing mean purchase values. For example, if alpha and beta were set such that the mean was 1 dollar, but the average revenue per visitor for a website is much higher at 100 dollars, his could affect our inference.

../_images/05b1553c08b71229f5a0dbd8faf3041e01ccd3467e0c74d2745edeb447014f11.png

Similar to the model for Bernoulli conversions, the width of the prior predictive uplift distribution will depend on the strength of our priors. See the Bernoulli conversions section for a discussion of the benefits and disadvantages of using a weak vs. strong prior.

Next we generate synthetic data for the model. We’ll generate the following scenarios:

Same propensity to purchase and same mean purchase value.

Lower propensity to purchase and higher mean purchase value, but overall same revenue per visitor.

Higher propensity to purchase and higher mean purchase value, and overall higher revenue per visitor.

Scenario 1 - same underlying purchase rate and mean purchase value #

../_images/9cc655373aa3072d3ec81d9b04cc0a70e3cb2a950119d19d211d947e7cf4e364.png

The 94% HDI contains 0 as expected.

Scenario 2 - lower purchase rate, higher mean purchase, same overall revenue per visitor #

../_images/90f7ba1980ff498e1e46628517b93aa9b065108c22b8e94af574527097180966.png

The 94% HDI for the average revenue per visitor (RPV) contains 0 as expected.

In these cases, it’s also useful to plot the relative uplift distributions for theta (the purchase-anything rate) and 1 / lam (the mean purchase value) to understand how the A/B test has affected visitor behaviour. We show this below:

../_images/dd3c75a046386cf65b6a87677844ca028d9e5ce4d1dd0c7de9c195cbf8588469.png

Variant B’s conversion rate uplift has a HDI well below 0, while the revenue per converting visitor has a HDI well above 0. So the model is able to capture the reduction in purchasing visitors as well as the increase in mean purchase amount.

Scenario 3 - Higher propensity to purchase and mean purchase value #

../_images/e7a2f8bc4477dd1b53634299c80bf91de426b906f685c0d7e04f42c490aeddac.png

The 94% HDI is above 0 for variant B as expected.

Note that one concern with using value conversions in practice (that doesn’t show up when we’re just simulating synthetic data) is the existence of outliers. For example, a visitor in one variant could spend thousands of dollars, and the observed revenue data no longer follows a ‘nice’ distribution like Gamma. It’s common to impute these outliers prior to running a statistical analysis (we have to be careful with removing them altogether, as this could bias the inference), or fall back to bernoulli conversions for decision making.

Further Reading #

There are many other considerations to implementing a Bayesian framework to analyse A/B tests in practice. Some include:

How do we choose our prior distributions?

In practice, people look at A/B test results every day, not only once at the end of the test. How do we balance finding true differences faster vs. minizing false discoveries (the ‘early stopping’ problem)?

How do we plan the length and size of A/B tests using power analysis, if we’re using Bayesian models to analyse the results?

Outside of the conversion rates (bernoulli random variables for each visitor), many value distributions in online software cannot be fit with nice densities like Normal, Gamma, etc. How do we model these?

Various textbooks and online resources dive into these areas in more detail. Doing Bayesian Data Analysis [ Kruschke, 2014 ] by John Kruschke is a great resource, and has been translated to PyMC here .

We also plan to create more PyMC tutorials on these topics, so stay tuned!

Authored by Cuong Duong in May, 2021 ( pymc-examples#164 )

Re-executed by percevalve in May, 2022 ( pymc-examples#351 )

References #

Chris Stucchio. Bayesian a/b testing at vwo. 2015. URL: https://vwo.com/downloads/VWO_SmartStats_technical_whitepaper.pdf .

John Kruschke. Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan . Academic Press, 2014.

Watermark #

License notice #.

All the notebooks in this example gallery are provided under the MIT License which allows modification, and redistribution for any use provided the copyright and license notices are preserved.

Citing PyMC examples #

To cite this notebook, use the DOI provided by Zenodo for the pymc-examples repository.

Many notebooks are adapted from other sources: blogs, books… In such cases you should cite the original source as well.

Also remember to cite the relevant libraries used by your code.

Here is an citation template in bibtex:

which once rendered could look like:

Statistics: Data analysis and modelling

Chapter 16 introduction to bayesian hypothesis testing.

In this chapter, we will introduce an alternative to the Frequentist null-hypothesis significance testing procedure employed up to now, namely a Bayesian hypothesis testing procedure. This also consists of comparing statistical models. What is new here is that Bayesian models contain a prior distribution over the values of the model parameters. In doing so for both the null and the alternative model, Bayesian model comparisons provide a more direct measure of the relative evidence of the null model compared to the alternative. We will introduce the Bayes Factor as the primary measure of evidence in Bayesian model comparison. We then go on to discuss “default priors”, which can be useful in a Bayesian testing procedure. We end with an overview of some objections to the traditional Frequentist method of hypothesis testing, and a comparison between the two approaches.

16.1 Hypothesis testing, relative evidence, and the Bayes factor

In the Frequentist null-hypothesis significance testing procedure, we defined a hypothesis test in terms of comparing two nested models, a general MODEL G and a restricted MODEL R which is a special case of MODEL G. Moreover, we defined the testing procedure in terms of determining the probability of a test result, or one more extreme, given that the simpler MODEL R is the true model. This was necessary because MODEL G is too vague to determine the sampling distribution of the test statistic.

By supplying a prior distribution to parameters, Bayesian models can be “vague” whilst not suffering from the problem that they effectively make no predictions. As we saw for the prior predictive distributions in Figure 15.3 , even MODEL 1, which assumes all possible values of the parameter \(\theta\) are equally likely, still provides a valid predicted distribution of the data. Because any Bayesian model with a valid prior distribution provides a valid prior predictive distribution, which then also provides a valid value for the marginal likelihood, we do not have to worry about complications that arise when comparing models in the Frequentist tradition, such as that the likelihood of one model will always be higher than the other because we need to estimate an additional parameter by maximum likelihood. The relative marginal likelihood of the data assigned by each model, which can be stated as a marginal likelihood ratio analogous to the likelihood ratio of Chapter 2 , provides a direct measure of the relative evidence for both models. The marginal likelihood ratio is also called the Bayes factor, and can be defined for two general Bayesian models as: \[\begin{equation} \text{BF}_{12} = \text{BayesFactor}(\text{MODEL 1}, \text{MODEL 2}) = \frac{p(Y_1,\ldots,Y_n|\text{MODEL 1})}{p(Y_1,\ldots,Y_n|\text{MODEL 2})} \tag{16.1} \end{equation}\] where \(p(Y_1,\ldots,Y_n|\text{MODEL } j)\) denotes the marginal likelihood of observed data \(Y_1,\ldots,Y_n\) according to MODEL \(j\) .

The Bayes factor is a central statistic of interest in Bayesian hypothesis testing. It is a direct measure of the relative evidence for two models. Its importance can also be seen when we consider the ratio of the posterior probabilities for two models, which is also called the posterior odds . In a Bayesian framework, we can assign probabilities not just to data and parameters, but also to whole models. These probabilities reflect our belief that a model is “true” in the sense that it provides a better account of the data than other models. Before observing data, we can assign a prior probability \(p(\text{model } j)\) to a model, and we can update this to a posterior probability \(p(\text{model } j|Y_1,\ldots,Y_n)\) after observing data \(Y_1,\ldots,Y_n\) . If the marginal likelihood \(p(\text{MODEL 2}|Y_1,\ldots,Y_n)\) is larger than 1, the posterior probability is higher than the prior probability, and hence our belief in the model would increase. If the marginal likelihood is smaller than 1, the posterior probability is lower than the prior probability, and hence our belief in the model would decrease. We can compare the relative change in our belief for two models by considering the posterior odds ratio, which is just the ratio of the posterior probability of two models, and computed by multiplying the ratio of the prior probabilities of the models (the prior odds ratio) by the marginal likelihood ratio: \[\begin{aligned} \frac{p(\text{MODEL 1}|Y_1,\ldots,Y_n)}{p(\text{MODEL 2}|Y_1,\ldots,Y_n)} &= \frac{p(Y_1,\ldots,Y_n|\text{MODEL 1})}{p(Y_1,\ldots,Y_n|\text{MODEL 2})} \times \frac{p(\text{MODEL 1})}{p(\text{MODEL 2})} \\ \text{posterior odds} &= \text{Bayes factor} \times \text{prior odds} \end{aligned}\]

In terms of the relative evidence that the data provides for the two models, the Bayes factor is all that matters, as the prior probabilities do not depend on the data. Moreover, if we assign an equal prior probability to each model, then the prior odds ratio would equal 1, and hence the posterior odds ratio is identical to the Bayes factor.

In a Frequentist framework, we would evaluate the magnitude of the likelihood ratio by considering its place within the sampling distribution under the assumption that one of the models is true. Although in principle we might be able to determine the sampling distribution of the Bayes factor in a similar manner, there is no need. A main reason for going through all this work in the Frequentist procedure was that the models are on unequal footing, with the likelihood ratio always favouring a model with additional parameters. The Bayes Factor does not inherently favour a more general model compared to a restricted one. Hence, we can interpret its value “as is”. The Bayes factor is a continuous measure of relative evidential support, and there is no real need for classifications such as “significant” and “non-significant”. Nevertheless, some guidance in interpreting the magnitude might be useful. One convention is the classification provided by Jeffreys ( 1939 ) in Table 16.1 . Because small values below 1, when the Bayes factor favours the second model, can be difficult to discern, the table also provides the corresponding values of the logarithm of the Bayes factor ( \(\log \text{BF}_{1,2}\) ). On a logarithmic scale, any value above 0 favours the first model, and any value below 0 the second one. Moreover, magnitudes above and below 0 can be assigned a similar meaning.

Table 16.1: Interpretation of the values of the Bayes factor (after ).
\(\text{BF}_{1,2}\) \(\log \text{BF}_{1,2}\) Interpretation
> 100 > 4.61 Extreme evidence for MODEL 1
30 – 100 3.4 – 4.61 Very strong evidence for MODEL 1
10 – 30 2.3 – 3.4 Strong evidence for MODEL 1
3 – 10 1.1 – 2.3 Moderate evidence for MODEL 1
1 – 3 0 – 1.1 Anecdotal evidence for MODEL 1
1 0 No evidence
1/3 – 1 -1.1 – 0 Anecdotal evidence for MODEL 2
1/10 – 1/3 -2.3 – -1.1 Moderate evidence for MODEL 2
1/30 – 1/10 -3.4 – -2.3 Strong evidence for MODEL 2
1/100 – 1/30 -4.61 – -3.4 Very strong evidence for MODEL 2
< 1/100 < -4.61 Extreme evidence for MODEL 2

The Bayes factor is a general measure that can be used to compare any Bayesian models. We do not have to focus on nested models, as we did with null-hypothesis significance testing. But such nested model comparisons are often of interest. For instance, when considering Paul’s psychic abilities, fixing \(\theta = .5\) is a useful model of an octopus without psychic abilities, while a model that allows \(\theta\) to take other values is a useful model of a (somewhat) psychic octopus. For the first model, assigning prior probability to \(\theta\) is simple: the prior probability of \(\theta = .5\) is \(p(\theta = .5) = 1\) , and \(p(\theta \neq .5) = 0\) . For the second model, we need to consider how likely each possible value of \(\theta\) is. Figure 15.3 shows two choices for this prior distribution, which are both valid representations of the belief that \(\theta\) can be different from .5. These choices will give a different marginal likelihood, and hence a different value of the Bayes factor when comparing them to the restricted null-model

\[\text{MODEL 0}: \theta = .5\] The Bayes factor comparing MODEL 1 to MODEL 0 is

\[\text{BF}_{1,0} = 12.003\] which indicates that the data (12 out of 14 correct predictions) is roughly 12 times as likely under MODEL 1 compared to MODEL 0, which in the classification of Table 16.1 means strong evidence for MODEL 1. For MODEL 2, the Bayes factor is

\[\text{BF}_{2,0} = 36.409\] which indicates that the data is roughly 36 times as likely under MODEL 2 compared to MODEL 0, which would be classified as very strong evidence for MODEL 2. In both cases, the data favours the alternative model to the null model and may be taken as sufficient to reject MODEL 0. However, the strength of the evidence varies with the choice of prior distribution of the alternative model. This is as it should be. A model such as such as MODEL 2, which places stronger belief on higher values of \(\theta\) , is more consistent with Paul’s high number of correct predictions.

Bayesian hypothesis testing with Bayes factors is, at it’s heart, a model comparison procedure. Bayesian models consist of a likelihood function and a prior distribution. A different prior distribution means a different model, and therefore a different result of the model comparison. Because there are an infinite number of alternative prior distributions to the one of the null model, there really isn’t a single test of the null hypothesis \(H_0: \theta = .5\) . The prior distribution of MODEL 1, where each possible value of \(\theta\) is equally likely, is the Bayesian equivalent of the alternative hypothesis in a null-hypothesis significance testing, and as such might seem a natural default against which to compare the null hypothesis. But there is nothing to force this choice, and other priors are in principle equally valid, as long as they reflect your a priori beliefs about likely values of the parameter. Notice the “a priori” specification in the last sentence: it is vital that the prior distribution is chosen before observing the data. If you choose the a prior distribution to match the data after having looked at it, the procedure loses some of its meaning as a hypothesis test, even if the Bayes factor is still an accurate reflection of the evidential support of the models.

16.2 Parameter estimates and credible intervals

Maximum likelihood estimation provides a single point-estimate for each parameter. In a Bayesian framework, estimation involves updating prior beliefs to posterior beliefs. What we end up with is a posterior distribution over the parameter values. If you want to report a single estimate, you could chose one of the measures of location: the mean, median, or mode of the posterior distribution. Unless the posterior is symmetric, these will have different values (see Figure 16.1 ), and one is not necessarily better than the other. I would usually choose the posterior mean, but if the posterior is very skewed, a measure such as the mode or median might provide a better reflection of the location of the distribution.

In addition to reporting an estimate, it is generally also a good idea to consider the uncertainty in the posterior distribution. A Bayesian version of a confidence interval (with a more straightforward interpretation!) is a credible interval . A credible interval is an interval in the posterior distribution which contains a given proportion of the probability mass. A common interval is the Highest Density Interval (HDI), which is the narrowest interval which contains a given proportion of the probability mass. Figure 16.1 shows the 95% HDI of the posterior probability that Paul makes a correct prediction, where the prior distribution was the uniform distribution of MODEL 1 in Figure 15.2 .

Figure 16.1: Posterior distribution for the probability that Paul makes a correct prediction, for MODEL 1 in Figure 15.2 .

A slightly different way to compute a credible interval is as the central credible interval . For such an interval, the excluded left and right tail of the distribution each contain \(\tfrac{\alpha}{2}\) of the probability mass (where e.g.  \(\alpha = .05\) for a 95% credible interval). Unlike the HDI, the central credible interval is not generally the most narrow interval which contains a given proportion of the posterior probability. But it is generally more straightforward to compute. Nevertheless, the HDI is more often reported than the central credible interval.

A nice thing about credible intervals is that they have a straightforward interpretation: the (subjective) probability that the true value of a parameter lies within an \(x\) % credible interval is \(x\) %. Compare this to the correct interpretation of an \(x\) % Frequentist confidence interval, which is that for infinite samples from the DGP, and computing an infinite number of corresponding confidence intervals, \(x\) % of those intervals would contain the true value of the parameter.

16.3 A Bayesian t-test

As discussed above, Bayesian hypothesis testing concerns comparing models with different prior distributions for model parameters. If one model, the “null model”, restricts a parameter to take a specific value, such as \(\theta = .5\) , or \(\mu = 0\) , while another model allows the parameter to take different values, we compare a restricted model to a more general one, and hence we can think of the model comparison as a Bayesian equivalent to a null-hypothesis significance test. The prior distribution assigned to the parameter in the more general alternative model will determine the outcome of the test, and hence it is of the utmost importance to choose this sensibly. This, however, is not always easy. Therefore, much work has been conducted to derive sensible default priors to enable researchers to conduct Bayesian hypothesis tests without requiring them to define prior distributions which reflect their own subjective beliefs.

Rouder, Speckman, Sun, Morey, & Iverson ( 2009 ) developed a default prior distribution to test whether two groups have a different mean. The test is based on the two-group version of the General Linear Model (e.g. Section 7.2 ):

\[Y_i = \beta_0 + \beta_1 \times X_{1,i} + \epsilon_i \quad \quad \epsilon_i \sim \textbf{Normal}(0, \sigma_\epsilon)\] where \(X_{1,i}\) is a contrast-coded predictor with the values \(X_{1i} = \pm \tfrac{1}{2}\) for the different groups. Remember that with this contrast code, the slope \(\beta_1\) reflects the difference between the group means, e.g.  \(\beta_1 = \mu_1 - \mu_2\) , and the intercept represents the grand mean \(\beta_0 = \frac{\mu_1 + \mu_2}{2}\) . Testing for group differences involves a test of the following hypotheses:

\[\begin{aligned} H_0\!: & \quad \beta_1 = 0 \\ H_1\!: & \quad \beta_1 \neq 0 \\ \end{aligned}\]

To do this in a Bayesian framework, we need prior distributions for all the model parameters ( \(\beta_0\) , \(\beta_1\) , and \(\sigma_\epsilon\) ). Rouder et al. ( 2009 ) propose to use so-called uninformative priors for \(\beta_0\) and \(\sigma_\epsilon\) (effectively meaning that for these parameters, “anything goes”). The main consideration is then the prior distribution for \(\beta_1\) . Rather than defining a prior distribution for \(\beta_1\) directly, they propose to define a prior distribution for \(\frac{\beta_1}{\sigma_\epsilon}\) , which is the difference between the group means divided by the variance of the dependent variable within each group. This is a measure of effect-size and is also known as Cohen’s \(d\) :

\[\text{Cohen's } d = \frac{\mu_1 - \mu_2}{\sigma_\epsilon} \quad \left(= \frac{\beta_1}{\sigma_\epsilon}\right)\] Defining the prior distribution for the effect-size is more convenient than defining the prior distribution for the difference between the means, as the latter difference is dependent on the scale of the dependent variable, which makes it difficult to define a general prior distribution suitable for all two-group comparisons. The “default” prior distribution they propose is a so-called scaled Cauchy distribution:

\[\frac{\beta_1}{\sigma_\epsilon} \sim \mathbf{Cauchy}(r)\] The Cauchy distribution is identical to a \(t\) -distribution with one degree of freedom ( \(\text{df} = 1\) ). The scaling factor \(r\) can be used to change the width of the distribution, so that either smaller or larger effect sizes become more probable. Examples of the distribution, with three common values for the scaling factor \(r\) (“medium”: \(r = \frac{\sqrt{2}}{2}\) , “wide”: \(r = 1\) , and “ultrawide”: \(r = \sqrt{2}\) ), are depicted in Figure 16.2 .

Figure 16.2: Scaled Cauchy prior distributions on the effect size \(\frac{\beta_1}{\sigma_\epsilon}\)

Rouder et al. ( 2009 ) call the combination of the priors for the effect size and error variance the Jeffreys-Zellner-Siow prior (JZS prior). The “default” Bayesian t-test is to compare the model with these priors to one which assumes \(\beta_1 = 0\) , i.e. a model with a prior \(p(\beta_1 = 0) = 1\) and \(p(\beta_1 \neq 0) = 0\) , whilst using the same prior distributions for the other parameters ( \(\beta_0\) and \(\sigma_\epsilon\) ).

As an example, we can apply the Bayesian t-test to the data from the Tetris study analysed in Chapter 7 . Comparing the Tetris+Reactivation condition to the Reactivation-Only condition, and setting the scale of the prior distribution for the effects size in the alternative MODEL 1 to \(r=1\) , provides a Bayes factor comparing the alternative hypothesis \(H_1\) ( \(\beta \neq 0\) ) to the null-hypothesis \(H_0\) ( \(\beta_1 = 0\) ) of \(\text{BF}_{1,0} = 17.225\) , which can be interpreted as strong evidence against the null hypothesis.

As we indicated earlier, the value of the Bayes factor depends on the prior distribution for the tested parameter in the model representing the alternative hypothesis. This dependence is shown in Figure 16.3 by varying the scaling factor \(r\) .

Figure 16.3: Bayes factor \(\text{BF}_{1,0}\) testing equivalence of the means of the Tetris+Reactivation and Reactivation-Only conditions for different values of the scaling factor \(r\) of the scaled Cauchy distribution.

As this figure shows, the Bayes factor is small for values of \(r\) close to 0. The lower the value of \(r\) , the less wide the resulting Cauchy distribution becomes. In the limit, as \(r\) reaches 0, the prior distribution in the alternative model becomes the same as that of the null model (i.e., assigning only probability to the value \(\beta_1 = 0\) ). This makes the models indistinguishable, and the Bayes factor would be 1, regardless of the data. As \(r\) increases in value, we see that the Bayes factor quickly rises, showing support for the alternative model. For this data, the Bayes factor is largest for a scaling factor just below \(r=1\) . When the prior distribution becomes wider than this, the Bayes factor decreases again. This is because the prior distribution then effectively assigns too much probability to high values of the effect size, and as a result lower probability to small and medium values of the effect size. At some point, the probability assigned to the effect size in the data becomes so low, that the null model will provide a better account of the data than the alternative model. A plot like the one in Figure 16.3 is useful to inspect the robustness of a test result to the specification of the prior distribution. In this case, the Bayes factor shows strong evidence ( \(\text{BF}_{1,0} > 10\) ) for a wide range of sensible values of \(r\) , and hence one might consider the test result quite robust. You should not use a plot like this to determine the “optimal” choice of the prior distribution (i.e. the one with the highest Bayes factor). If you did this, then the prior distribution would depend on the data, which is sometimes referred to as “double-dipping”. You would then end up with similar issues as in Frequentist hypothesis testing, where substituting an unknown parameter with a maximum likelihood estimate biases the likelihood ratio to favour the alternative hypothesis, which we then needed to correct for by considering the sampling distribution of the likelihood ratio statistic under the assumption that the null hypothesis is true. A nice thing about Bayes factors is that we do not need to worry about such complications. But that changes if you try to “optimise” a prior distribution by looking at the data.

16.4 Bayes factors for General Linear Models

The suggested default prior distributions can be generalized straightforwardly to more complex versions of the General Linear Model, such as multiple regression ( Liang, Paulo, Molina, Clyde, & Berger, 2008 ) and ANOVA models ( Rouder, Morey, Speckman, & Province, 2012 ) , by specifying analogous JZS prior distributions over all parameters. This provides a means to test each parameter in a model individually, as well as computing omnibus tests by comparing a general model to one where the prior distribution allows only a single value (i.e.  \(\beta_j = 0\) ) for multiple parameters.

Table 16.2 shows the results of a Bayesian equivalent to the moderated regression model discussed in Section 6.1.5 . The results generally confirm the results of the frequenist tests employed there, although evidence for the interaction between fun and intelligence can be classified as “anecdotal”.

Table 16.2: Results of a Bayesian regression analysis for the Speed Dating data (cf Table ) with a default JZS prior with ‘medium’ scaling factor \(r = \sqrt{2}/4\) (for regression models, default scaling factors are \(\sqrt{2}/4\), \(1/2\), and \(\sqrt{2}/2\) for medium, wide, and ultrawide, respectively). The test of each effect compares the full model to one with that effect excluded.
effect BF
\(\texttt{attr}\) > 1000
\(\texttt{intel}\) > 1000
\(\texttt{fun}\) > 1000
\(\texttt{attr} \times \texttt{intel}\) 37.46
\(\texttt{fun} \times \texttt{intel}\) 2.05

Table 16.3 shows the Bayesian equivalent of the factorial ANOVA reported in Section 8.2.1 . The results show “extreme” evidence for an effect of experimenter belief, and no evidence for an effect of power prime, nor for an interaction between power prime and experimenter belief. In the Frequentist null-hypothesis significance test, the absence of a significant test result can not be taken as direct evidence for the null hypothesis. There is actually no straightforward way to quantify the evidence for the null hypothesis in a Frequentist framework. This is not so for the Bayesian hypothesis tests. Indeed, the Bayes factor directly quantifies the relative evidence for either the alternative or null hypothesis. Hence, we find “moderate” evidence that the null hypothesis is true for power prime, and for the interaction between power prime and experimenter belief. This ability to quantify evidence both for and against the null hypothesis is one of the major benefits of a Bayesian hypothesis testing procedure.

Table 16.3: Results of a Bayesian factorial ANOVA analysis for the social priming data (cf Table ) with a default JZS prior with a ‘medium’ scaling factor of \(r = 1/2\) (for ANOVA models, default scaling factors are \(1/2\), \(\sqrt{2}/2\), and \(1\) for medium, wide, and ultrawide, respectively; this assumes standard effect coding for the contrast-coded predictors, which then matches the priors to those set for the linear regression model). The test of each effect compares the full model to one with that effect excluded.
effect BF
\(\texttt{P}\) 0.127
\(\texttt{B}\) 537.743
\(\texttt{P} \times \texttt{B}\) 0.216

16.5 Some objections to null-hypothesis significance testing

Above, we have presented a Bayesian alternative to the traditional Frequentist null-hypothesis significance testing (NHST) procedure. While still the dominant method of statistical inference in psychology, the appropriateness of the NHST has been hotly debated almost since its inception ( Cohen, 1994 ; Nickerson, 2000 ; Wagenmakers, 2007 ) . One issue is that a significant test result is not the same as a “theoretical” or “practical” significance. For a given true effect not equal to 0, the (expected) \(p\) -value becomes smaller and smaller as the sample size increases, because of the increased power in detecting that effect. As a result, even the smallest effect size will become significant for a sufficiently large sample size. For example, a medicine might result in a significant decrease of a symptom compared to a placebo, even if the effect is hardly noticeable to the patient. I should point out that this is more an issue with testing a “point” null hypothesis (e.g. the hypothesis that the effect is exactly equal to 0), rather than an issue with the Frequentist procedure per se. It is an important limitation of null hypothesis testing procedures in general. A similar objection to these hypotheses is that the null hypothesis is unlikely to ever be exactly true. Thompson ( 1992 ) states the potential issues strongly as:

Statistical significance testing can involve a tautological logic in which tired researchers, having collected data on hundreds of subjects, then, conduct a statistical test to evaluate whether there were a lot of subjects, which the researchers already know, because they collected the data and know they are tired. This tautology has created considerable damage as regards the cumulation of knowledge. ( Thompson, 1992, p. 436 )

There are other objections, which I will go into in the following sections.

16.5.1 The \(p\) -value is not a proper measure of evidential support

It is common practice to interpret the magnitude of the \(p\) -value as an indication of the strength of the evidence against the null hypothesis. That is, a smaller \(p\) -value is taken to indicate stronger evidence against the null hypothesis than a larger \(p\) -value. Indeed, Fisher himself seems to have subscribed to this view ( Wagenmakers, 2007 ) . While it is true that the magnitude is often correlated with the strength of evidence, there are some tricky issues regarding this. If a \(p\) -value were a “proper” measure of evidential support, then if two experiments provide the same \(p\) -value, they should provide the same support against the null hypothesis. But what if the first experiment had a sample size of 10, and the second a sample size of 10,000? Would a \(p\) -value of say \(p=.04\) indicate the same evidence against the null-hypothesis? The general consensus is that sample size is an important consideration in the interpretation of the \(p\) -value, although not always for the same reason. On the one hand, many researchers argue that the \(p\) -value of the larger study provides stronger evidence, possibly because the significant result in the larger study might be less likely due to random sample variability (see e.g. Rosenthal & Gaito, 1963 ) . On the other hand, it can be argued that the smaller study actually provides stronger evidence, because to obtain the same \(p\) -value, the effect size must be larger in the smaller study. Bayesian analysis suggests the latter interpretation is the correct one ( Wagenmakers, 2007 ) . That the same \(p\) -value can indicate a different strength of evidence means that the \(p\) -value does not directly reflect evidential support (at least not without considering the sample size).

Another thing worth pointing out is that, if the null hypothesis is true, any \(p\) -value is equally likely. This is by definition. Remember that the \(p\) -value is defined as the probability of obtaining the same test statistic, or one more extreme, assuming the null-hypothesis is true. A \(p\) -value of say \(p=.04\) indicates that you would expect to find an equal or more extreme value of the test statistic in 4% of all possible replications of the experiment. Conversely, in 4% of all replications would you obtain a \(p\) -value of \(p \leq .04\) . For a \(p\) -value of \(p=.1\) , you would expect to find a similar or smaller \(p\) -value in 10% of all replications of the experiment. The only distribution for which this relation between the value ( \(p\) ) and the probability of obtaining a value equal-or-smaller than it \(p(p-\text{value} \leq p)\) , is the uniform distribution. So, when the null hypothesis is true, there is no reason to expect a large \(p\) -value, because every \(p\) -value is equally likely. When the null hypothesis is false, smaller \(p\) -values are more likely than higher \(p\) -values, especially as the sample size increases. This is show by simulation for a one-sample t-test in Figure 16.4 . Under the null hypothesis (left plot), the distribution of the \(p\) -values is uniform.

Figure 16.4: Distribution of \(p\) -values for 10,000 simulations of a one-sample \(t\) -test. \(\delta = \frac{\mu - \mu_0}{\sigma}\) refers to the effect size. Under the null hypothesis (left plot; \(\delta = 0\) ) the distribution of the \(p\) -values is uniform. When the null-hypothesis is false ( \(\delta = .3\) ), the distribution is skewed, with smaller \(p\) -values being more probable, especially when the sample size is larger (compare the middle plot with \(n=10\) to the right-hand plot with \(n=50\) ).

16.5.2 The \(p\) -value depends on researcher intentions

The sampling distribution of a test statistic is the distribution of the values of the statistic calculated for an infinite number of datasets produced by the same Data Generating Process (DGP). The DGP includes all the relevant factors that affect the data, including not only characteristics of the population under study, but also characteristics of the study, such as whether participants were randomly sampled, how many participants were included, which measurement tools were used, etc. Choices such as when to stop collecting data are part of the study design. That means that the same data can have a different \(p\) -value, depending on whether the sample size was fixed a priori, or whether sampling continued until some criterion was reached. The following story, paraphrased from ( Berger & Wolpert, 1988, pp. 30–33 ) , may highlight the issue:

A scientist has obtained 100 independent observations that are assumed be Normal-distributed with mean \(\mu\) and standard deviation \(\sigma\) . In order to test the null hypothesis that \(\mu=0\) , the scientist consults a Frequentist statistician. The mean of the observations is \(\overline{Y} = 0.2\) , and the sample standard deviation is \(S_Y=1\) , hence the \(p\) -value is \(p = .0482\) , which is a little lower than than the adopted significance level of \(\alpha.05\) . This leads to a rejection of the null hypothesis, and a happy scientist. However, the statistician decides to probe deeper and asks the scientist what he would have done in case that the experiment had not yielded a significant result after 100 observations. The scientist replies he would have collected another 100 observations. As such, the implicit sampling plan was not to collect \(n=100\) observation and stop, but rather to first take 100 observations and check whether \(p <.05\) , and collect another 100 observations (resulting in \(n=200\) ) if not. This is a so-called sequential testing procedure, and requires a different treatment than a fixed-sampling procedure. In controlling the Type 1 error of the procedure as a whole, one would need to consider the possible results after \(n=100\) observations, but also after \(n=200\) observations, which is possible, but not straightforward, as the results of after \(n=100\) are dependent on the results after \(n=100\) observations. But the clever statistician works it out and then convinces the scientist that the appropriate p-value for this sequential testing procedure is no longer significant. The puzzled and disappointed scientist leaves to collect another 100 observations. After lots of hard work, the scientist returns, and the statistician computes a \(p\) -value for the new data, which is now significant. Just to make sure the sampling plan is appropriately reflected in the calculation, the statistician asks what the scientist would have done if the result would not have been significant at this point. The scientist answers “This would depend on the status of my funding; If my grant is renewed, I would test another 100 observations. If my grant is not renewed, I would have had to stop the experiment. Not that this matters, of course, because the data were significant anyway”. The statistician then explains that the correct inference depends on the grant renewal; if the grant is not renewed, the sampling plan stands and no correction is necessary. But if the grant is renewed, the scientist could have collected more data, which calls for a further correction, similar to the first one. The annoyed scientist then leaves and resolves to never again share with the statistician her hypothetical research intentions.

What this story shows is that in considering infinite possible repetitions of a study, everything about the study that might lead to variations in the results should be taken into account. This includes a scientists’ decisions made during each hypothetical replication of the study. As such, the interpretation of the data at hand (i.e., whether the hypothesis test is significant or not significant) depends on hypothetical decisions in situations that did not actually occur. If exactly the same data had been collected by a scientist who would have not have collected more observations, regardless of the outcome of the first test, then the result would have been judged significant. So the same data can provide different evidence. This does not mean the Frequentist NHST is inconsistent. The procedure “does what it says on the tin”, namely providing a bound on the rate of Type 2 errors in decisions , when the null hypothesis is true. In considering the accuracy of the decision procedure, we need to consider all situations in which a decision might be made in the context of a given study. This means considering the full design of the study, including the sampling plan, as well as, to some extent, the analysis plan. For instance, if you were to “explore” the data, trying out different ways to analyse the data, by e.g. including or excluding potential covariates and applying different criteria to excluding participants or their responses until you obtain a significant test result for an effect of interest, then the significance level \(\alpha\) for that test needs to be adjusted to account for such a fishing expedition. This fishing expedition is also called p-hacking ( Simmons, Nelson, & Simonsohn, 2011 ) and there really isn’t a suitable correction for it. Although corrections for multiple comparisons exist, which allow you to test all possible comparisons within a single model (e.g. the Scheffé correction), when you go on to consider different models, and different subsets of the data to apply that model to, all bets are off. This, simply put, is just really bad scientific practice. And it would render the \(p\) -value meaningless.

16.5.3 Results of a NHST are often misinterpreted

I have said it before, and I will say it again: the \(p\) -value is the probability of observing a particular value of a test statistic, or one more extreme, given that the null-hypothesis is true. This is the proper, and only, interpretation of the \(p\) -value. It is a tricky one, to be sure, and the meaning of the \(p\) -value is often misunderstood. Some common misconceptions (see e.g., Nickerson, 2000 ) are:

  • The \(p\) -value is the probability that the null-hypothesis is true, given the data, i.e.  \(p = p(H_0|\text{data})\) . This posterior probability can be calculated in a Bayesian framework, but not in a Frequentist one.
  • One minus the \(p\) -value is the probability that the alternative hypothesis is true, given the data, i.e.  \(1-p = p(H_1|\text{data})\) . Again, the posterior probability of the alternative hypothesis can be obtained in a Bayesian framework, when the alternative hypothesis is properly defined by a suitable prior distribution. In the conventional Frequentist NHST, the alternative hypothesis is so poorly defined, that it can’t be assigned any probability (apart from perhaps \(p(H_1) = p(H_1|\text{data}) = 1\) , which does not depend on the data, and just reflects that e.g.  \(-\infty \leq \mu - \mu_0 \leq \infty\) will have some value).
  • The \(p\) -value is the probability that the results were due to random chance. If you take a statistical model seriously, then all results are, to some extent, due to random chance. Trying to work out the probability that something is a probability seems a rather pointless exercise (if you want to know the answer, it is 1. It would have been more fun if the answer was 42, but alas, the scale of probabilities does not allow this particular answer).

Misinterpretations of \(p\) -values are mistakes by practitioners, and do not indicate a problem with NHST itself. However, it does point to a mismatch between what the procedure provides, and what the practitioner would like the procedure to provide. If one desires to know the probability that the null hypothesis is true, or the probability that the alternative hypothesis is true, than one has to use a Bayesian procedure. Unless you consider a wider context, where the truth of hypotheses can be sampled from a distribution, then there is no “long-run frequency” for the truth of hypotheses, and hence no Frequentist definition of that probability.

16.6 To Bayes or not to Bayes? A pragmatic view

At this point, you might feel slightly annoyed. Perhaps even very annoyed. We have spent all the preceding chapters focusing on the Frequentist null hypothesis significance testing procedure, and after all that work I’m informing you of these issues. Why? Was all that work for nothing?

No, obviously not. Although much of the criticism regarding the NHST is appropriate, as long as you understand what it does and apply the procedure properly, there is no need to abandon it. The NHST is designed to limit the rate of Type 1 errors (rejecting the null hypothesis when it is true). It does this well. And, when using the appropriate test statistic, in the most powerful way possible. Limiting Type 1 errors is, whilst modest, a reasonable concern in scientific practice. The Bayesian alternative allows you to do more, such as evaluate the relative evidence for and against the null hypothesis, and even calculate the posterior probability of both (as long as you are willing to assign a prior probability to both as well).

An advantage of the NHST is its “objectiveness”: once you have determined a suitable distribution of the data, and decided on a particular value for a parameter to test, there are no other decisions to make apart from setting the significance level of the test. In the Bayesian hypothesis testing procedure, you also need to specify a prior distribution for the parameter of interest in the alternative hypothesis. Although considering what parameter values you would expect if the null hypothesis were false is an inherently important consideration, it is often not straightforward when you start a research project, or rely on measures you have not used before in a particular context. Although much work has been devoted to deriving sensible “default priors”, I don’t believe there is a sensible objective prior applicable to all situations. Given a freedom to choose a prior distribution for the alternative hypothesis, this makes the Bayesian testing procedure inherently subjective. This is perfectly in keeping with the subjectivist interpretation of probability as the rational belief of an agent endowed with (subjective) prior beliefs. Moreover, at some point, if you were to accumulate all data, the effect of prior beliefs “washes out” (as long as you don’t assign a probability of zero to the true parameter value).

My pragmatic answer to the question whether you should use a Bayesian test or a Frequentist one is then the following: if you can define a suitable prior distribution to reflect what you expect to observe in a study, before you actually conduct that study, then use a Bayesian testing procedure. This will allow you to do what you most likely would like to do, namely quantify the evidence for your hypotheses against alternative hypotheses. If you are unable to form any expectations regarding the effects within your study, you probably should consider a traditional NHST to assess whether there is an indication of any effect, and limiting your Type 1 error rate in doing so. In some sense, this is a “last resort”, but in psychology, where quantitative predictions are inherently difficult, something I reluctantly have to rely on quite frequently. Instead of a hypothesis test, you could also consider simply estimating the effect-size in that case, with a suitable credible interval.

16.7 In practice

The steps involved in conducting a Bayesian hypothesis test are not too different from the steps involved in conducting a Frequentist hypothesis test, with the additional step of choosing prior distributions over the values of the model parameters.

Explore the data. Plot distributions of the data within the conditions of the experiment (if any), pairwise scatterplots between numerical predictors and the dependent variable, etc. Consider what model you might use to analyse the data, and assess the validity of the underlying assumptions.

Choose an appropriate general statistical model. In many cases, this will be a version of the GLM or an extension such as a linear mixed-effects model, perhaps using suitably transformed dependent and independent variables.

Choose appropriate prior distributions for the model parameters. This is generally the most difficult part. If you have prior data, then you could base the prior distributions on this. If not, then ideally, formulate prior distributions which reflect your beliefs about the data. You can check whether the prior distributions lead to sensible predictions by simulating data from the resulting model (i.e., computing prior predictive distributions). Otherwise, you can resort to “default” prior distributions.

Conduct the analysis. To test null-hypotheses, compare the general model to a set of restricted models which fix a parameter to a particular value (e.g. 0), and compute the Bayes Factor for each of these comparisons. To help you interpret the magnitude of the Bayes Factor, you can consult Table 16.1 . Where possible, consider conducting a robustness analysis, by e.g. varying the scaling factor of the prior distributions. This will inform you about the extent to which the results hinge on a particular choice of prior, or whether they hold for a range of prior distributions.

Report the results. Make sure that you describe the statistical model, as well as the prior distributions chosen. The latter is crucial, as Bayes Factors are not interpretable without knowing the prior distributions. For example, the results of the analysis in Table 16.2 , with additional results from the posterior parameter distributions, may be reported as:

To analyse the effect of rated attractiveness, intelligence, and fun on the liking of dating partners, we used a Bayesian linear regression analysis ( Rouder & Morey, 2012 ) . In the model, we allowed the effect of attractiveness and fun to be moderated by intelligence. All predictors were mean-centered before entering the analysis. We used a default JZS-prior for all parameters, with a medium scaling factor of \(r = \sqrt{2}/4\) , as recommended by Richard D. Morey & Rouder ( 2018 ) . The analysis showed “extreme” evidence for effects of attractiveness, intelligence, and fun ( \(\text{BF}_{1,0} > 1000\) ; comparing the model to one with a point-prior at 0 for each effect). All effects were positive, with the posterior means of the slopes equalling \(\hat{\beta}_\text{attr} = 0.345\) , 95% HDI [0.309; 0.384], \(\hat{\beta}_\text{intel} = 0.257\) , 95% HDI [0.212; 0.304], and \(\hat{\beta}_\text{fun} = 0.382\) , 95% HDI [0.342; 0.423]. In addition, we found “very strong” evidence for a moderation of the effect of attractiveness by intelligence ( \(BF_{0,1} = 37.459\) ). For every one-unit increase in rated intelligence, the effect of attraciveness reduced by \(\hat{\beta}_{\text{attr} \times \text{intel}} = 0.043\) , 95% HDI [-0.066; -0.02]. There was only “anecdotal” evidence for a moderation of the effect of fun by intelligence ( \(BF_{0,1} = 2.052\) ). Although we don’t place too much confidence in this result, it indicates that for every one-unit increase in rated intelligence, the effect of fun increased by \(\hat{\beta}_{\text{fun} \times \text{intel}} = 0.032\) , 95% HDI [0.01; 0.055].

16.8 “Summary”

Figure 16.5: ‘Piled Higher and Deeper’ by Jorge Cham www.phdcomics.com. Source: https://phdcomics.com/comics/archive.php?comicid=905

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

Bayesian nonparametric machine learning for Python

Folders and files

NameName
136 Commits

Repository files navigation

Bnpy : bayesian nonparametric machine learning for python..

Project Website • Example Gallery • Installation • Team • Academic Papers • Report an Issue

This python module provides code for training popular clustering models on large datasets. We focus on Bayesian nonparametric models based on the Dirichlet process, but also provide parametric counterparts.

bnpy supports the latest online learning algorithms as well as standard offline methods. Our aim is to provide an inference platform that makes it easy for researchers and practitioners to compare models and algorithms.

Supported probabilistic models (aka allocation models)

Mixture models

  • FiniteMixtureModel : fixed number of clusters
  • DPMixtureModel : infinite number of clusters, via the Dirichlet process

Topic models (aka admixtures models)

  • FiniteTopicModel : fixed number of topics. This is Latent Dirichlet allocation.
  • HDPTopicModel : infinite number of topics, via the hierarchical Dirichlet process

Hidden Markov models (HMMs)

  • FiniteHMM : Markov sequence model with a fixture number of states
  • HDPHMM : Markov sequence models with an infinite number of states

Supported data observation models (aka likelihoods)

  • Gauss : Full-covariance
  • DiagGauss : Diagonal-covariance
  • ZeroMeanGauss : Zero-mean, full-covariance
  • AutoRegGauss

Supported learning algorithms:

These are all variants of variational inference , a family of optimization algorithms.

Example Gallery

You can find many examples of bnpy in action in our curated Example Gallery .

These same demos are also directly available as Python scrips inside the examples/ folder of the project Github repository .

Quick Start

You can use bnpy from a command line/terminal, or from within Python. Both options require specifying a dataset, an allocation model, an observation model (likelihood), and an algorithm. Optional keyword arguments with reasonable defaults allow control of specific model hyperparameters, algorithm parameters, etc.

Below, we show how to call bnpy to train a 8 component Gaussian mixture model on a default toy dataset stored in a .csv file on disk. In both cases, log information is printed to stdout, and all learned model parameters are saved to disk.

Calling from the terminal/command-line

Calling directly from python, advanced examples.

Train Dirichlet-process Gaussian mixture model (DP-GMM) via full-dataset variational algorithm (aka "VB" for variational Bayes).

Train DP-GMM via memoized variational, with birth and merge moves, with data divided into 10 batches.

Installation

To use bnpy for the first time, follow the documentation's Installation Instructions .

Primary investigators

Mike Hughes Assistant Professor (Aug. 2018 - present) Tufts University, Dept. of Computer Science Website: https://www.michaelchughes.com

Erik Sudderth Professor University of California, Irvine Website: https://www.ics.uci.edu/~sudderth/

Contributors

  • Soumya Ghosh
  • William Stephenson
  • Sonia Phene
  • Leah Weiner
  • Alexis Cook
  • Mert Terzihan
  • Jincheng Li
  • Xi Chen (Tufts)

Academic Papers

Conference publications based on bnpy, nips 2015 hdp-hmm paper.

Our NIPS 2015 paper describes inference algorithms that can add or remove clusters for the sticky HDP-HMM.
  • "Scalable adaptation of state complexity for nonparametric hidden Markov models." Michael C. Hughes, William Stephenson, and Erik B. Sudderth. NIPS 2015. [paper] [supplement] [scripts to reproduce experiments]

AISTATS 2015 HDP topic model paper

Our AISTATS 2015 paper describes our algorithms for HDP topic models.
  • "Reliable and scalable variational inference for the hierarchical Dirichlet process." Michael C. Hughes, Dae Il Kim, and Erik B. Sudderth. AISTATS 2015. [paper] [supplement] [bibtex]

NIPS 2013 DP mixtures paper

Our NIPS 2013 paper introduced memoized variational inference algorithm, and applied it to Dirichlet process mixture models.
  • "Memoized online variational inference for Dirichlet process mixture models." Michael C. Hughes and Erik B. Sudderth. NIPS 2013. [paper] [supplement] [bibtex]

Workshop papers

Our short paper from a workshop at NIPS 2014 describes the vision for bnpy as a general purpose inference engine.
  • "bnpy: Reliable and scalable variational inference for Bayesian nonparametric models." Michael C. Hughes and Erik B. Sudderth. Probabilistic Programming Workshop at NIPS 2014. [paper]

Target Audience

Primarly, we intend bnpy to be a platform for researchers. By gathering many learning algorithms and popular models in one convenient, modular repository, we hope to make it easier to compare and contrast approaches. We also hope that the modular organization of bnpy enables researchers to try out new modeling ideas without reinventing the wheel.

Contributors 7

  • Python 87.9%
  • Jupyter Notebook 6.3%
  • Cython 0.2%
  • Makefile 0.1%

Bayesian Hypothesis Testing Illustrated: An Introduction for Software Engineering Researchers

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, 1 introduction, 2 an example, 2.1 is c faster than python, 2.2 the initial experiment.

Trial #1234567891011121314
C faster?
Trial #1516171819202122232425262728
C faster?
# of Trials ( )28
Frequency[C faster] ( )17
Proportion[C faster] ( )60.71%

2.3 Formulating the Hypotheses

3 the frequentist approach, 3.1 the general process, 3.2 choosing the right test for the example, 3.3 calculating the test statistic, 3.4 interpreting the test statistic.

(observed)0.6071  28   
\(r_0\) (expected)0.5000 Normal? Odds
7-8 Diff0.1071 Conf. Level0.95  \(H_0\) 1.0000
StdErr0.0945  -critical1.6449  \(H_1\) 1.5455
-score1.1339 p-val0.1284 O.R.1.5455
Lower C.L.0.4517 Reject \(H_0\) ?   

4 The Bayesian Approach

4.1 prior knowledge.

We know from past work that C programs are generally faster than their Java counterparts. The past work shows C is faster than Java between 40% and 80% of the time. We have valid reasons for believing that the C results could be even better relative to Python than they were relative to Java.

bayesian hypothesis testing python

4.2 Some Terminology

4.3 bayes factor, 4.4 interpretation guidance for bayes factor.

bayesian hypothesis testing python

5 Calculating Bayes Factor

5.1 marginal likelihoods for initial experiment.

40%50%60%70%80%
\(P[r]\) 0.20000.20000.20000.20000.2000
\(H_0\) or \(H_1\) ? \(H_0\) \(H_0\) \(H_1\) \(H_1\) \(H_1\)
\(P[r|H_0]\) 0.50000.50000.00000.00000.0000
\(P[r|H_1]\) 0.00000.00000.33330.33330.3333
      
\(P[H_0]\) 0.4000
\(P[H_1]\) 0.6000
\(\mathrm{Prior\ Odds}_{10}\) 1.5000

5.2 Conditional Likelihoods of an Observation under a Specific Proportion

5.3 evaluating marginal likelihood integrals, 5.4 final step: bayes factor.

bayesian hypothesis testing python

6 Posterior Model

6.1 transforming prior distribution to posterior distribution.

bayesian hypothesis testing python

6.2 Answering Additional Questions Using the Posterior Distribution

6.3 posterior odds, 7 building evidence through replications, 7.1 research trajectory: more datasets, more samples.

StudyAcronym# of trialsFreq. C faster
( )( )
Initial ExperimentExp2817
Replication 1Rep12519
Replication 2Rep2128
Replication 3Rep3129
Replication 4Rep4118
Replication 5Rep5107
Replication 6Rep654

7.2 Analyzing Replications with the Frequentist Approach

Study Freq.Prop.    Param.
# ofCC Effect L.C.Uncert.
trialsfasterfasterSig.SizeRej.Limit(Max.
( )( )( \(r = n/N\) )(p-val)(O.R.) \(H_0\) ?(L.C.L.) \(-\) L.C.L.)
Exp1281760.71%0.12841.5455No45.17%34.83%
Rep1251976.00%0.00473.1667Yes59.55%20.45%
Rep212866.67%0.12412.0000No42.93%37.07%
Rep312975.00%0.04163.0000Yes51.26%28.74%
Rep411872.73%0.06582.6667No47.93%32.07%
Rep510770.00%0.10302.3333No43.99%36.01%
Rep610880.00%0.02894.0000Yes53.99%26.01%
Rep75480.00%0.1875*4.0000NoN/AN/A

bayesian hypothesis testing python

7.3 What Else Can We Do?

7.4 analyzing replications incrementally with the bayesian approach.

bayesian hypothesis testing python

Study Freq.Prop.   
# ofCC Odds \(_{10}\) Param.
trialsfasterfaster (PosteriorUncert.
( )( )( \(r = n/N\) )BF \(_{10}\) Odds)(Stdev of )
PriorN/AN/AN/AN/A1.561.25%
Exp1281760.71%1.79092.686338.99%
Rep1251976.00%18.715050.27527.82%
Rep212866.67%1.811891.08925.10%
Rep312975.00%3.906355.86122.13%
Rep411872.73%2.96041053.4919.76%
Rep510770.00%2.18152298.2117.97%
Rep610880.00%5.038211578.915.69%
Rep75480.00%2.275426346.114.96%

bayesian hypothesis testing python

Study Freq.Prop.   
# ofCC Odds \(_{10}\) Param.
trialsfasterfaster (PosteriorUncert.
( )( )( \(r = n/N\) )BF \(_{10}\) Odds)(Stdev of )
PriorN/AN/AN/AN/A437.42%
Exp1281760.71%2.07688.307322.53%
Rep1251976.00%11.524395.73523.22%
Rep212866.67%1.8039172.69522.59%
Rep312975.00%3.2797566.39223.45%
Rep411872.73%2.66621510.1023.14%
Rep510770.00%2.07923139.7622.50%
Rep610880.00%4.348813654.320.25%
Rep75480.00%2.153429403.118.76%

8 Conclusions

8.1 further reading, tools, and software engineering applications, 8.2 takeaways.

  • Liu X Zhao Y Xu T Wahab F Sun Y Chen C (2023) Efficient False Positive Control Algorithms in Big Data Mining Applied Sciences 10.3390/app13085006 13 :8 (5006) Online publication date: 16-Apr-2023 https://doi.org/10.3390/app13085006

Index Terms

General and reference

Cross-computing tools and techniques

Empirical studies

Mathematics of computing

Mathematical software

Statistical software

Probability and statistics

Probabilistic inference problems

Bayesian computation

Hypothesis testing and confidence interval computation

Software and its engineering

Software creation and management

Software development process management

Software development techniques

Software verification and validation

Empirical software validation

Recommendations

An introduction to the imprecise dirichlet model for multinomial data.

The imprecise Dirichlet model (IDM) was recently proposed by Walley as a model for objective statistical inference from multinomial data with chances @q. In the IDM, prior or posterior uncertainty about @q is described by a set of Dirichlet ...

Bayesian hypothesis testing in machine learning

Most hypothesis testing in machine learning is done using the frequentist null-hypothesis significance test, which has severe drawbacks. We review recent Bayesian tests which overcome the drawbacks of the frequentist ones.

Analysis of type I and II error rates of Bayesian and frequentist parametric and nonparametric two-sample hypothesis tests under preliminary assessment of normality

Testing for differences between two groups is among the most frequently carried out statistical methods in empirical research. The traditional frequentist approach is to make use of null hypothesis significance tests which use p values to reject a ...

Information

Published in.

cover image ACM Computing Surveys

University of Sydney, Australia

Association for Computing Machinery

New York, NY, United States

Publication History

Permissions, check for updates, author tags.

  • Bayesian statistics
  • Bayesian inference
  • Bayesian analysis
  • Bayesian hypothesis testing
  • frequentist analysis
  • frequentist inference
  • null hypothesis significance testing
  • empirical software engineering
  • software engineering research

Contributors

Other metrics, bibliometrics, article metrics.

  • 1 Total Citations View Citations
  • 2,073 Total Downloads
  • Downloads (Last 12 months) 1,131
  • Downloads (Last 6 weeks) 92

View options

View or Download as a PDF file.

View online with eReader .

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

bayesian-testing 0.7.0

pip install bayesian-testing Copy PIP instructions

Released: Jun 29, 2024

Bayesian A/B testing with simple probabilities.

Verified details  (What is this?)

Maintainers.

Avatar for Matt525252 from gravatar.com

Unverified details

Project links.

  • License: MIT License (MIT)
  • Author: Matus Baniar
  • Tags ab testing, bayes, bayesian statistics
  • Requires: Python <4.0.0, >=3.7.1

Classifiers

  • OSI Approved :: MIT License
  • Python :: 3
  • Python :: 3.8
  • Python :: 3.9
  • Python :: 3.10
  • Python :: 3.11
  • Python :: 3.12

Project description

bayesian hypothesis testing python

Bayesian A/B testing

bayesian_testing is a small package for a quick evaluation of A/B (or A/B/C/...) tests using Bayesian approach.

Implemented tests:

  • Input data - binary data ( [0, 1, 0, ...] )
  • Designed for conversion-like data A/B testing.
  • Input data - normal data with unknown variance
  • Designed for normal data A/B testing.
  • Input data - lognormal data with zeros
  • Designed for revenue-like data A/B testing.
  • Input data - normal data with zeros
  • Designed for profit-like data A/B testing.
  • Input data - categorical data with numerical categories
  • Designed for discrete data A/B testing (e.g. dice rolls, star ratings, 1-10 ratings, etc.).
  • Input data - non-negative integers ( [1, 0, 3, ...] )
  • Designed for poisson data A/B testing.
  • Input data - exponential data (non-negative real numbers)
  • Designed for exponential data A/B testing (e.g. session/waiting time, time between events, etc.).

Implemented evaluation metrics:

  • Expected value from the posterior distribution for a given variant.
  • Probability that a given variant is best among all variants.
  • By default, the best is equivalent to the greatest (from a data/metric point of view), however it is possible to change this by using min_is_best=True in the evaluation method (this can be useful if we try to find the variant while minimizing the tested measure).
  • "Risk" of choosing particular variant over other variants in the test.
  • Measured in the same units as a tested measure (e.g. positive rate or average value).

Probability of Being Best and Expected Loss are calculated using simulations from posterior distributions (considering given data).

Installation

bayesian_testing can be installed using pip:

Alternatively, you can clone the repository and use poetry manually:

Basic Usage

The primary features are classes:

BinaryDataTest

Normaldatatest, deltalognormaldatatest.

  • DeltaNormalDataTest

DiscreteDataTest

Poissondatatest, exponentialdatatest.

All test classes support two methods to insert the data:

  • add_variant_data - Adding raw data for a variant as a list of observations (or numpy 1-D array).
  • add_variant_data_agg - Adding aggregated variant data (this can be practical for a large data, as the aggregation can be done already on a database level).

Both methods for adding data allow specification of prior distributions (see details in respective docstrings). Default prior setup should be sufficient for most of the cases (e.g. cases with unknown priors or large amounts of data).

To get the results of the test, simply call the method evaluate .

Probability of being best and expected loss are approximated using simulations, hence the evaluate method can return slightly different values for different runs. To stabilize it, you can set the sim_count parameter of the evaluate to a higher value (default value is 20K), or even use the seed parameter to fix it completely.

Class for a Bayesian A/B test for the binary-like data (e.g. conversions, successes, etc.).

Class for a Bayesian A/B test for the normal data.

Class for a Bayesian A/B test for the delta-lognormal data (log-normal with zeros). Delta-lognormal data is typical case of revenue per session data where many sessions have 0 revenue but non-zero values are positive values with possible log-normal distribution. To handle this data, the calculation is combining binary Bayes model for zero vs non-zero "conversions" and log-normal model for non-zero values.

Note : Alternatively, DeltaNormalDataTest can be used for a case when conversions are not necessarily positive values.

Class for a Bayesian A/B test for the discrete data with finite number of numerical categories (states), representing some value. This test can be used for instance for dice rolls data (when looking for the "best" of multiple dice) or rating data (e.g. 1-5 stars or 1-10 scale).

Class for a Bayesian A/B test for the poisson data.

note: Since we set min_is_best=True (because received goals are "bad"), probability and loss are in a favor of variants with lower posterior means.

Class for a Bayesian A/B test for the exponential data.

Development

To set up a development environment, use Poetry and pre-commit :

To be implemented

Additional metrics:

  • Potential Value Remaining
  • Credible Interval
  • bayesian_testing package itself depends only on numpy package.
  • Work on this package (including default priors selection) was inspired mainly by a Coursera course Bayesian Statistics: From Concept to Data Analysis .

Project details

Release history release notifications | rss feed.

Jun 29, 2024

May 18, 2024

Dec 23, 2023

Dec 20, 2023

Nov 12, 2023

Aug 17, 2023

Feb 23, 2023

Dec 26, 2022

Dec 11, 2022

Dec 10, 2022

Aug 5, 2022

Jul 16, 2022

Mar 30, 2022

Feb 21, 2022

Feb 20, 2022

Jan 3, 2022

Jan 2, 2022

Jan 1, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages .

Source Distribution

Uploaded Jun 29, 2024 Source

Built Distribution

Uploaded Jun 29, 2024 Python 3

Hashes for bayesian_testing-0.7.0.tar.gz

Hashes for bayesian_testing-0.7.0.tar.gz
Algorithm Hash digest
SHA256
MD5
BLAKE2b-256

Hashes for bayesian_testing-0.7.0-py3-none-any.whl

Hashes for bayesian_testing-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256
MD5
BLAKE2b-256
  • português (Brasil)

Supported by

bayesian hypothesis testing python

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Entropy (Basel)

Logo of entropy

A Review of Bayesian Hypothesis Testing and Its Practical Implementations

Associated data.

The sleep and the ToothGrowth datasets are built in R. The Poisson repeated-measures dataset is simulated according to Appendix A .

We discuss hypothesis testing and compare different theories in light of observed or experimental data as fundamental endeavors in the sciences. Issues associated with the p -value approach and null hypothesis significance testing are reviewed, and the Bayesian alternative based on the Bayes factor is introduced, along with a review of computational methods and sensitivity related to prior distributions. We demonstrate how Bayesian testing can be practically implemented in several examples, such as the t -test, two-sample comparisons, linear mixed models, and Poisson mixed models by using existing software. Caveats and potential problems associated with Bayesian testing are also discussed. We aim to inform researchers in the many fields where Bayesian testing is not in common use of a well-developed alternative to null hypothesis significance testing and to demonstrate its standard implementation.

1. Introduction

Hypothesis testing is an important tool in modern research. It is applied in a wide range of fields, from forensic analysis, business intelligence, and manufacturing quality control, to the theoretical framework of assessing the plausibility of theories in physics, psychology, and fundamental science [ 1 , 2 , 3 , 4 , 5 ]. The task of comparing competing theories based on data is essential to scientific activity, and therefore, the mechanism of conducting these comparisons requires thoughtful consideration [ 6 , 7 ].

The dominant approach for these comparisons is based on hypothesis testing using a p -value, which is the probability, under repeated sampling, of obtaining a test statistic at least as extreme as the observed under the null hypothesis [ 4 , 8 ]. Records of conceptualizing the p -value date back at least two hundred years before Ronald Fisher established the p -value terminology and technique [ 9 , 10 , 11 ]. These records are an indication of how compelling and popular the approach is, and the long history explains the widespread acceptance of a decision rule with a fixed type I error rate, which further resulted in the adoption of a 5% significance-level cutoff. Despite its prevalence, there has been an intense debate about the misuse of the p -value approach [ 7 , 12 ]. The major criticisms about the p -value are its inability to quantify evidence for the null hypothesis and its tendency to overestimate the evidence against the null hypothesis [ 4 ]. For example, a possible decision based on the p -value is the rejection of the null hypothesis but not its acceptance. Under the null hypothesis, the  p -value will have a uniform [0, 1] distribution regardless of the sample size. This is by construction. The Bayesian approach behaves rather differently under the null hypothesis, and increasing sample sizes will provide increasing degrees of evidence in favor of the null hypothesis [ 13 ].

Besides the misuse, the hypothesis testing approach based on the p -value can be easily misinterpreted. A list of twenty-five examples of misinterpretations in classical hypothesis testing is provided in [ 14 ]. Eighteen of these items are directly related to the misunderstanding of the p -value, and the others are related to p -values in the context of confidence intervals and statistical power. Some of these points are also shared in [ 15 ], including the common misconceptions that a nonsignificant difference means that there is no difference between groups and that the p -value represents the chance of a type I error. The author also highlights an alternative approach, based on the Bayes factor as a measure of true evidential meaning about the hypotheses [ 16 , 17 ]. Private pages of Alan Turing independently discovered this quantity around the same time as Jeffrey [ 16 , 18 , 19 ]. Other authors have also recommended the Bayes factor as a better solution to hypothesis testing compared with the practice of p -values and null hypothesis significance testing (NHST), specifically criticizing the p -value’s dependence on hypothetical data, which are likely to be manipulated by the researcher’s intentions [ 8 ].

While the majority of the issues with classical hypothesis testing are crucial and widely known, a less acknowledged but important misinterpretation happens when two or more results are compared by their degrees of statistical significance [ 20 ]. To illustrate this issue, consider the following example introduced in [ 14 ]. Suppose two independent studies have effect estimates and standard errors of 25 ± 10 and 10 ± 10 . In that case, the first study has a mean that is 2.5 standard errors away from 0, being statistically significant at an alpha level of 1%. The second study has a mean that is 1 standard error away from 0 and is not statistically significant at the same alpha level. It is tempting to conclude that the results of the studies are very different. However, the estimated difference in treatment effects is 25 − 10 = 15 , with a standard error 10 2 + 10 2 ≈ 14 . Thus, the mean of 15 units is less than 1 standard error away from 0, indicating that the difference between the studies is not statistically significant. If a third independent study with a much larger sample size had an effect estimate of 2.5 ± 1.0 , then it would have a mean that is 2.5 standard errors away from 0 and indicate statistical significance at an alpha level of 1%, as in the first study. In this case, the difference between the results of the third and the first studies would be 22.5 with a standard error 10 2 + 1 ≈ 10 . Thus, the mean of 22.5 units would be more than 2 standard errors away from 0, indicating a statistically significant difference between the studies. Therefore, the researchers in [ 20 ] recommend that the statistical significance of the difference between means be considered, rather than the difference between the significance levels of the two hypotheses.

To prevent the misuse and misinterpretation of p -values, the American Statistical Association (ASA) issued a statement clarifying six principles for the proper use and interpretation of classical significance testing [ 12 ]: (i) p -values can indicate how incompatible the data are with a specified statistical model; (ii) p -values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone; (iii) scientific conclusions and business or policy decisions should not be based only on whether a p -value passes a specific threshold; (iv) proper inference requires full reporting and transparency; (v) p -value, or statistical significance, does not measure the size of an effect or the importance of a result; and (vi) by itself, a  p -value does not provide a good measure of evidence regarding a model or hypothesis.

The profound criticism of the p -value approach has promoted the consideration and development of alternative methods for hypothesis testing [ 4 , 8 , 12 , 21 ]. The Bayes factor is one such instance [ 18 , 22 ], since it only depends on the observed data and allows an evaluation of the evidence in favor of the null hypothesis. The seminal paper by Kass and Raftery [ 17 ] discusses the Bayes factor along with technical and computational aspects and presents several applications in which the Bayes factor can solve problems that cannot be addressed by the p -value approach. Our review differs in that it is targeted towards researchers in fields where the p -value is still in dominant use, and there are many such fields where this is the case. Our emphasis is to provide these researchers with an understanding of the methodology and potential issues, and a review of the existing tools to implement the Bayes factor in statistical practice.

Two potential issues for the implementation of the Bayes factor are the computation of integrals related to the marginal probabilities that are required to evaluate them and the subjectivity regarding the choosing of the prior distributions [ 7 , 17 ]. We will review these issues in Section 2 and Section 3 , respectively. Despite these difficulties, there are many advantages to the use of the Bayes factor, including (i) the quantification of the evidence in favor of the null hypothesis [ 15 ], (ii) the ease of combining Bayes factors across experiments, (iii) the possibility of updating results when new data are available, (iv) interpretable model comparisons, and (v) the availability of open-source tools to compute Bayes factors in a variety of practical applications.

This paper aims to provide examples of practical implementations of the Bayes factor in different scenarios, highlighting the availability of tools for its computation for those with a basic understanding of statistics. In addition, we bring attention to the over-reliance on the classical p -value approach for hypothesis testing and its inherent pitfalls. The remainder of the article is structured as follows. In  Section 2 , we define the Bayes factor and discuss technical aspects, including its numerical computation. In  Section 3 , we discuss prior distributions and the sensitivity of the Bayes factor to prior distributions. Section 4 presents several applications of the Bayes factor using open-source code involving R software. We illustrate the computation of the Bayes factor using a variety of approximation techniques. Section 5 presents a discussion and summary.

2. Bayes Factor Definition and Technical Aspects

2.1. definition.

The Bayes factor is defined as the ratio of the probability of the observed data , conditional on two competing hypotheses or models. Given the same data D and two hypotheses H 0 and H 1 , it is defined as

If there is no previous knowledge in favor of one theory over the other, i.e., the hypotheses H 0 and H 1 are equally probable a priori ( p ( H 1 ) = p ( H 0 ) ), the Bayes factor represents the ratio of the data-updated knowledge about the hypotheses, i.e., the Bayes factor is equal to the posterior odds, where the posterior probability is defined as the conditional probability of the hypothesis given the data. Using the definition of conditional probability and under the assumption that the hypotheses are equally probable a priori,

Based on Equation ( 2 ), we can interpret the Bayes factor as the extent to which the data update the prior odds, and therefore, quantify the support for one model over another. A Bayes factor value smaller than one indicates that the data are more likely under the denominator model than they are under the numerator model. A model with the highest Bayes factor shows the relatively highest amount of evidence in favor of the model compared to the other models. Similarly, by switching the indices in ( 1 ), B F 01 is defined as

where larger values of B F 01 represent higher evidence in favor of the null hypothesis.

The Bayes factor can be viewed as a summary of the evidence given by data in support of one hypothesis in contrast to another [ 7 , 17 ]. Reporting Bayes factors can be guided by setting customized thresholds according to particular applications. For example, Evett [ 1 ] argued that for forensic evidence alone to be conclusive in a criminal trial, it would require a Bayes factor of at least 1000 rather than the value of 100 suggested by the Jeffreys scale of interpretation [ 18 ]. A generally accepted table provided in [ 17 ] is replicated in Table 1 , and other similar tables are available in [ 21 ]. Thus, using the Bayes factor can result in reporting evidence in favor of the alternative hypothesis, evidence in favor of the null hypothesis, or reporting that the data are inconclusive.

General-purpose interpretation of Bayes factor values from [ 17 ].

Interpretation of Evidence against
1 to 3Not worth more than a bare mention
3 to 20Positive
20 to 150Strong
>150Very Strong

The Bayes factor can avoid the drawbacks associated with p -values and assess the strength of evidence in favor of the null model along with various additional advantages. First, Bayes factors inherently include a penalty for complex models to prevent overfitting. Such a penalty is implicit in the integration over parameters required to obtain marginal likelihoods. Second, the Bayes factor can be applied in statistical settings that do not satisfy common regularity conditions [ 17 ].

Despite its apparent advantages, there are a few disadvantages to the Bayes factor approach. First, the choice of a prior distribution is subjective [ 4 , 7 , 17 ] and might be a concern for some researchers. However, the authors in [ 7 ] challenge the criticism, claiming that there is nothing about the data, by itself, that assures it counts as evidence. The pathway from the data to evidence is filled with subjective evaluations when combing the theoretical viewpoint with the research question. Therefore, the Bayesian approach makes explicit assumptions based on the prior likelihood statement. A way to avoid the explicit selection of prior densities is through the use of the Bayesian information criterion (BIC), which can give a rough interpretation of evidence in Table 1 .

Another potential disadvantage is the computational difficulty of evaluating marginal likelihoods, and this is discussed in Section 2.2 . However, the issue is being mitigated by the growth of computational power and the availability of open-source statistical tools for this computation. Examples of these tools are BayesFactor , brms , and BFpack R packages [ 23 , 24 , 25 ]; and JASP [ 26 ] software. In  Section 4 , we illustrate the required R scripting for a number of examples widely used in data analysis. As Python has become increasingly popular among quantitative practitioners [ 27 , 28 ], R packages for the computation of Bayes factors can be imported into Python using the rpy2 package [ 29 ]. Thanks to these advancements, Bayes factors are gradually gaining wider attention in research [ 30 , 31 , 32 ].

2.2. Computation of the Bayes Factor

To calculate the Bayes factor, both the numerator and the denominator in the Bayes factor definition ( 1 ) (the marginal likelihood of the data under a given model) involve integrals over the parameter space:

where θ k is the parameter vector under the hypothesis H k , and  p ( θ k | H k ) is the prior probability density function of the parameter vector for the hypothesis H k . It is typical for ( 4 ) to be an integral over many dimensions so that the computational problem can be difficult.

If we assume the data are a random sample from an exponential family distribution and assume conjugate priors, it is possible to solve the integral in ( 4 ) analytically. Without conjugacy, these integrals are often intractable, and numerical methods are needed. Many available numerical integration techniques are inefficient to calculate such integrals because it is difficult to find the regions where the probability mass is accumulating in higher dimensions. For regular problems in the large sample setting, the probability mass will accumulate and tend to peak around the maximum likelihood estimator (MLE) [ 17 , 33 ]. This notion underlies the Laplace approximation and its variations which can be used to obtain an approximation to the Bayes factor. These methods rely on a quadratic approximation to the logarithm of the integrand obtained using a Taylor expansion about the MLE and a normal distribution matching. Laplace’s methods are usually fast but not very accurate. An alternative approximation known as the Savage–Dickey density ratio [ 34 ] can be applied to obtain a better approximation for the case of nested models when testing a constrained model against an unrestricted alternative, the Bayes factor is approximated by dividing the value of the posterior density over the parameters for the alternative model evaluated at the hypothesized value, by the prior for the same model evaluated at the same point [ 35 ].

For the general case of Bayes factor computations, it is common to resort to sampling-based numerical procedures adjusted to the context of marginal likelihood computation as in ( 4 ). Evans and Swartz [ 36 ] reviewed several numerical strategies for assessing the integral related to the Bayes factor and later published a book on the topic [ 37 ]. Among the methods for estimating the integral of the marginal likelihood, the bridge sampling technique has gained prominence [ 38 ]. The method encompasses three special cases, namely the “naïve” [ 33 ] or “simple” [ 17 ] Monte Carlo estimator, the importance sampling, and the generalized harmonic mean estimator. The bridge sampling estimate stands out for not being dominated by samples from the tails of the distribution [ 33 ]. An entitled bridgesampling R package to estimate integrals with the bridge sampling algorithm for Bayesian models implemented in Stan [ 39 ] or JAGS [ 40 ] is available [ 41 ]. In  Section 4 , we provide examples of using the bridgesampling package and the BayesFactor R package [ 23 ] to enable the computation of Bayes factors for several important experimental designs.

3. Prior Elicitation and Sensitivity Analysis

Based on its definition in ( 1 ), the Bayes factor is a ratio of the marginal likelihood of two competing models. The marginal likelihood for a model class is a weighted average of the likelihood over all the parameter values represented by the prior distribution [ 42 ]. Therefore, carefully choosing priors and conducting a prior sensitivity analysis play an essential role when using Bayes factors as a model selection tool. This section briefly discusses the prior distributions, prior elicitation, and prior sensitivity analysis.

3.1. Prior Distributions

In Bayesian statistical inference, a prior probability distribution (or simply called the prior) estimates the probability of incorporating one’s beliefs or prior knowledge about an uncertain quantity before collecting the data. The unknown quantity may be a parameter of the model or a latent variable. In Bayesian hierarchical models, we have more than one level of prior distribution corresponding to a hierarchical model structure. The parameters of a prior distribution are called hyperparameters. We can either assume values for the hyperparameters or assume a probability distribution, which is referred to as a hyperprior.

It is common to categorize priors into four types: informative priors, weakly informative priors, uninformative priors, and improper priors [ 43 ]. The Bayes factor computation requires proper priors, i.e., a prior distribution that integrates to 1. Various available software provide default priors, but it is the researchers’ responsibility to perform sensitivity analysis to check the impact of applying different priors.

3.2. Prior Elicitation

The prior distribution is an important ingredient of the Bayesian paradigm and must be designed coherently to make Bayesian inference operational [ 44 ]. Priors can be elicited using multiple methods, e.g., from past information, such as previous experiments, or elicited purely from the experts’ subjective assessments. When no prior information is available, an uninformative prior can be assumed, and most of the model information that is given by the posterior will come from the likelihood function itself. Priors can also be chosen according to some principles, such as symmetry or maximum entropy, given constraints. Examples are the Jeffreys prior [ 18 ] and Bernardo’s reference prior [ 45 ]. When a family of conjugate priors exist, choosing a prior from that family simplifies the calculation of the posterior distribution.

With the advancement of computational power, ad hoc searching for priors can be done more systemically. Hartmann et al. [ 46 ] utilized the prior predictive distribution implied by the model to automatically transform experts’ judgments about plausible outcome values to suitable priors on the parameters. They also provided computational strategies to perform inference and guidelines to facilitate practical use. Their methodology can be summarized as follows: (i) define the parametric model for observable data conditional on the parameters θ and a prior distribution with hyperparameters λ for the parameters θ , (ii) obtain experts’ beliefs or probability for each mutually exclusive data category partitioned from the overall data space, (iii) model the elicited probabilities from step 2 as a function of the hyperparameters λ , (iv) perform iterative optimization of the model from step 3 to obtain an estimate for λ best describing the expert opinion within the chosen parametric family of prior distributions, and (v) evaluate how well the predictions obtained from the optimal prior distribution can describe the elicited expert opinion. Prior predictive tools relying on machine learning methods can be useful when dealing with hierarchical modeling where a grid search method is not possible [ 47 ].

3.3. Sensitivity Analysis

In the Bayesian approach, it is important to evaluate the impact of prior assumptions. This is performed through a sensitivity analysis where the prior is perturbed, and the change in the results is examined. Various authors have demonstrated how priors affect Bayes factors and provided ways to address the issue. When comparing two nested models in a low dimensional parameter space, the authors in [ 48 ] propose a point mass prior Bayes factor approach. The point mass prior distribution for the Bayes factor is computed for a grid of extra parameter values introduced by a generalized alternative model. The resulting Bayes factor is obtained by averaging the point mass prior Bayes factor over the prior distribution of the extra parameter(s).

For binomial data, Ref. [ 42 ] shows the impact of different priors on the probability of success. The authors used four different priors: (i) a uniform distribution, (ii) the Jeffreys prior, which is a proper Beta(0.5,0.5) distribution, (iii) the Haldane prior by assuming a Beta(0,0) distribution (an improper prior), and (iv) an informative prior. The uniform, Jeffreys, and Haldane priors are noninformative in some sense. Although the resulting parameter estimation is similar in all four scenarios, the resulting Bayes factor and posterior probability of H 1 vary. Using the four different priors produces very different Bayes factors with values of 0.09 for the Haldane, 0.6 for the Jeffreys, 0.91 for the uniform, and  1.55 for the informative prior. The corresponding posterior probabilities of H 1 are 0.08 (Haldane), 0.38  (Jeffreys), 0.48 (uniform), and  0.61 (informative). In this example, the sensitivity analysis reveals that the effect of the priors on the posterior distribution is different from their effect on the Bayes factor. The authors emphasize that Bayes factors should be calculated, ideally, for a wide range of plausible priors whenever used as a model selection tool. Besides using the Bayes factor based on prior predictive distribution, they also suggest seeking agreement with the other model selection criterion designed to assess local model generalizability (i.e., based on posterior predictive distribution).

The author in [ 49 ] describe several interesting points with regards to prior sensitivity. The author views prior sensitivity analysis in theory testing as an opportunity rather than a burden. They argue that it is an attractive feature of a model evaluation measure when psychological models containing quantitatively instantiated theories are sensitive to priors. Ref.  [ 49 ] believes that using an informative prior expressing a psychological theory and evaluating models using prior sensitivity measures can serve to advance knowledge. Finally, sensitivity analysis is accessible through an interactive Shiny Application developed by the authors in [ 50 ]. The software is designed to help user understand how to assess the substantive impact of prior selection in an interactive way.

4. Applications of the Bayes Factor Using R Packages

In this section, we illustrate how to calculate Bayes factors using various techniques available in R, including the R package BayesFactor [ 23 ]. Various authors have used this package to compute Bayes factors in different settings such as linear correlations, Bayesian t -tests, analysis of variance (ANOVA), linear regression, single proportions, and contingency tables [ 51 , 52 , 53 , 54 ]. Comparisons between Bayesian and frequentist approaches are provided in the vignettes of [ 23 ]. We provide the R code to compute the Bayes factor for a one-sample t -test, a multiway ANOVA, a repeated-measures design, and a Poisson generalized linear mixed model (GLM).

4.1. One-Sample t-Test

The authors in [ 52 ] derived the Jeffreys Zellner Siow (JZS) Bayes factor as a function of the t -score and the sample size. To illustrate how the ttestBF function of the BayesFactor package performs a Bayesian paired t -test, they analyzed the sleep dataset [ 55 ], which includes the variable, i.e., the length of increased sleep (in hours) after taking two drugs when compared to regular nights where no drug was administered. The Bayesian paired t -test can evaluate if the levels of effectiveness of two drugs are significantly different (a null hypothesis is that the standardized effect size is zero) [ 7 , 52 ].

Let y 1 , … , y n ∼ i . i . d . N ( σ δ , σ 2 ) , where the standardized effect size is given by δ = μ / σ , μ is a grand mean, and  σ 2 is the error variance. We test the following hypotheses:

The following script of R code implements the Bayesian paired t -test and presents the p -value of the classical approach for comparison.

An external file that holds a picture, illustration, etc.
Object name is entropy-24-00161-i001.jpg

The value r = 0.707 ( 2 / 2 ) denotes the scale of a Cauchy prior distribution of δ . The Bayes factor value of 17.259 favors the alternative hypothesis, indicating that the effect size is significant in this case. Using the interpretation in Table 1 , the evidence against the null hypothesis is “positive”. The classical p -value of around 0.3% is also in favor of the alternative, usually considered strong evidence against the null hypothesis.

For this example, the Bayes factor can also be computed by employing a bridge sampling estimate. The R packages bridgesampling and R2jags used concepts of object-oriented programming and were developed with methods to interact with customizable Markov chain Monte Carlo object routines [ 41 , 56 ]. That is to say, a self-coded in JAGS model can feed the bridgesampling ’s function bridge_sampler to obtain the log marginal likelihood for the model. Their source code (assuming the same priors in [ 23 ]) is available at https://osf.io/3yc8q/ (accessed on 28 December 2021). The Bayes factor value in [ 41 ] for the sleep data is 17.260, which is almost identical to the BayesFactor package result, 17.259. Both the BayesFactor and bridgesampling packages suit the analysis needs. On the one hand, no additional programming knowledge is required to call the functions in the BayesFactor package due to the default prior settings, which are user friendly. On the other hand, the  bridgsampling along with JAGS allows for more sophisticated customization and flexibility in model specifications, which makes more feasible to conduct the sensitivity analysis.

4.2. Multiway ANOVA

Consider a two-way ANOVA model M 1 : y i j k = μ + σ ( τ i + β j + γ i j ) + ϵ i j k , for  i = 1 , ⋯ , a , j = 1 , ⋯ , b , and  k = 1 , ⋯ , n , where y i j k is the response for the k th subject at the i th level of Factor 1 and the j th level of Factor 2, μ is the overall mean effect, τ i is the standardized effect size of the i th level of Factor 1, β j is the standardized effect size of the j th level of Factor 2, γ i j is the standardized effect size of the interaction between two factors, ϵ i j k is a white noise with the mean zero and variance σ 2 . We consider comparing the full top-level model M 1 versus M 0 : y i j k = μ + ϵ i j k .

Equivalently, the competing models can be expressed in the matrix-vector form as in [ 53 ], i.e.,

where y is a column vector of N observations, 1 is a column vector of N ones, τ , β , and  γ are column vectors of standardized effect parameters of length a , b , and  a b , respectively, X ’s are design matrices, and  ϵ   ∣   σ 2 ∼ N ( 0 , σ 2 I ) .

The anovaBF function of the BayesFactor package compares these linear models (including the reduced models). The  ToothGrowth dataset [ 57 ] is used to study the effects of vitamin C dosage and supplement type on tooth growth in guinea pigs. The  anovaBF function allows the model comparison (single-factor models, additive model, and full model) against the null model (intercept only). The following script of R code implements the multiway ANOVA.

An external file that holds a picture, illustration, etc.
Object name is entropy-24-00161-i002.jpg

The percentage, e.g.,  ± 2.73 % is the proportional Monte Carlo error estimate on the Bayes factor. The Bayes factor value of 7.94 × 10 14 suggests, according to Table 1 , very strong evidence in favor of the full model.

It is worth noting that the one-way ANOVA with two levels is consistent with the two-sample t -test, when using the default priors. For example, considering the sleep data example, one can check that:

An external file that holds a picture, illustration, etc.
Object name is entropy-24-00161-i003.jpg

return the same Bayes factor value (but the dataset is not meant for the independent tests).

4.3. Repeated-Measures Design

Linear mixed-effects models extend simple linear models to allow both fixed (parameters that do not vary across subjects) and random effects (parameters that are themselves random variables), particularly used when the data are dependent, multilevel, hierarchical, longitudinal, or correlated. In relation to the previous model in Section 4.2 , a linear mixed-effects model M 1 adds the standardized subject-specific random effect b k . We now consider comparing

We take the sleep dataset as an example and specify the argument whichRandom in the anovaBF function of the BayesFactor package, so that it computes the Bayes factor for such a repeated-measures design (or called a within-subjects design). The following script of R code implements the one-way repeated-measures design, where the dataset needs to be in the long format: one column is for the continuous response variable, one column is for the subject indicator, and another column is for the categorical variable indicating the levels.

An external file that holds a picture, illustration, etc.
Object name is entropy-24-00161-i004.jpg

This code generates a Bayes factor of about 11.468 in favor of the alternative hypothesis. The conclusion inferred from the repeated-measures designs is consistent with the earlier paired t -test result. One limitation of calling anovaBF function is that it only aims to construct the Bayes factor for a homoscedastic case.

4.4. Poisson Mixed-Effects Model

A GLM Poisson mixed-effects approach aims to model a discrete count event that was repeatedly measured at several conditions for each subject, e.g., longitudinal studies [ 58 ]. The model assumes that the response variable follows a Poisson distribution at the first level. Unlike the cases of normally distributed repeated-measures data, software used to calculate Bayes factors have not been extensively discussed and developed in the context of Bayesian Poisson models. Thus, we illustrate code for sampling the posterior using JAGS, and then the Savage–Dickey density ratio is used to approximate the Bayes factor.

When testing a nested model against an unrestricted alternative, the Bayes factor is computationally and graphically simplified as the ratio calculated by dividing the value of the posterior distribution over the parameters for the alternative model evaluated at the hypothesized value, by the prior for the same model evaluated at the same point [ 35 ] and this is the Savage–Dickey density ratio [ 34 ]. We demonstrate the use of the Savage–Dickey density ratio described in [ 59 ]. We consider fitting a Poisson mixed-effects model to a simulated dataset obtained from Appendix A . We note that the Poisson first level of this example can be changed to many other specifications from the exponential family (e.g., binomial or exponential) with only minor alterations to the code below. With data in the repeated-measures setting, the set of counts obtained from a given subject can be associated. Thus, the standard independence assumption is violated, which is a feature of repeated-measures data.

We utilize the JAGS software and rjags R package [ 60 ] to fit the model and the polspline R package to approximate the log posterior distribution [ 61 ] required to evaluate the Savage–Dickey density ratio.

An external file that holds a picture, illustration, etc.
Object name is entropy-24-00161-i005a.jpg

The data are simulated from 48 subjects, and a count is simulated for each of two conditions on each subject. On the log-scale, conditional on the random effects, the mean in condition one is set to α 1 = 2 when the data are simulated and the corresponding value is α 2 = 2.2 for the second condition. Thus, the data are simulated under the alternative hypothesis. After fitting the model to the simulated data, the Bayes factor in favor of the alternative is B F 10 = 25.247 indicating strong evidence in favor of the alternative.

A sensitivity analysis using JAGS or Stan is convenient by passing different parameter values or changing the families of prior distributions. We specified five different prior distributions for the nuisance parameters (the intercept and the precision of the random effects) in the model statement and then examined the Bayes factors computed via the Savage–Dickey density ratio approximation. Four additional combinations of priors are shown in Table 2 . Some variation in the value of the Bayes factor is observed though the conclusion remains stable across these prior specifications.

Prior sensitivity analysis for the Poisson repeated-measures data.

Reportbetatau_b
1dnorm(0, 0.01)dgamma(0.01, 0.01)0.04025.247
2dnorm(0, 0.1)dgamma(0.01, 0.01)0.05418.377
3dnorm(0, 0.01)dgamma(2, 2)0.04224.059
4dnorm(0, 0.1)dgamma(2, 2)0.03230.859
5dnorm(0, 0.5)dgamma(1, 4)0.02342.816

We have addressed the activity of hypothesis testing in light of empirical data. Several issues with the classical p -values and NHST approaches were reviewed to reach researchers who rarely use Bayesian testing, and NHST is still the dominant vehicle for hypothesis testing. We noted that the debate about the overuse of the p -value has been long-lasting, and there are many discussions about the misuse and misinterpretations in the literature.

Following the third principle of the ASA’s statement on p -values—i.e., research practice, business, or policy decisions should not solely rely on a p -value passing an arbitrary threshold—a Bayesian alternative method based on the Bayes factor was introduced, and the advantages and disadvantages of this approach were brought discussed. One possible caveat of the Bayes factor is its numerical computation, which has been mitigated by the advances of computational resources. We reviewed computational methods employed to approximate the marginal likelihoods, such as the bridge sampling estimator, which has an R package implementation available as an open-source solution.

Issues related to prior distributions were discussed, and we recommended a careful choice of priors via elicitation, combined with prior sensitivity analysis when using Bayes factors as a model selection tool. The Bayesian analysis and hypothesis testing are appealing, but going directly from the NHST to Bayesian hypothesis testing may require a challenging leap. Thus, we showed how, using existing software, one can practically implement statistical techniques related to the discussed Bayesian approach, and provided examples of the usage of packages intended to compute the Bayes factor, namely, in applications of the one-sample t -test, multiway ANOVA, repeated-measures designs, and Poisson mixed-effects model.

The Bayes factor is only one of many aspects of Bayesian analysis, and it serves as a bridge to Bayesian inference for researchers interested in testing. The Bayes factor can provide evidence in favor of the null hypothesis and is a relatively intuitive approach for communicating statistical evidence with a meaningful interpretation. The relationships between the Bayes factor and other aspects of the posterior distribution, for example, the overlap of Bayesian highest posterior density intervals, form a topic of interest, and we will report on this issue in another manuscript.

Appendix A. Poisson Repeated-Measures Data Simulation

The sim_Poisson R function returns a repeated-measures data frame in the long format with 2 n rows and three columns. Three columns are subject ID, count response variable, and condition levels.

An external file that holds a picture, illustration, etc.
Object name is entropy-24-00161-i006a.jpg

Author Contributions

Conceptualization, F.S.N.; methodology, Z.W.; software, F.S.N. and Z.W.; validation, M.F.M. and A.Y.; formal analysis, Z.W.; investigation, L.R.; resources, F.S.N.; data curation, Z.W.; writing—original draft preparation, L.R., Z.W. and A.Y.; writing—review and editing, Z.W., F.S.N. and M.F.M.; visualization, Z.W.; supervision, F.S.N. and M.F.M.; project administration, F.S.N.; funding acquisition, F.S.N. and M.F.M. All authors have read and agreed to the published version of the manuscript.

This work was supported by discovery grants to Farouk S. Nathoo and Michelle F. Miranda from the Natural Sciences and Engineering Research Council: RGPIN-2020-06941. Farouk S. Nathoo holds a Tier II Canada Research Chair in Biostatistics.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Data availability statement, conflicts of interest.

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

  • FOR INSTRUCTOR
  • FOR INSTRUCTORS

9.1.8 Bayesian Hypothesis Testing

To be more specific, according to the MAP test, we choose $H_0$ if and only if

$\quad$ $H_0$: $X=1$, $\quad$ $H_1$: $X=-1$.

  • in Example 9.10 , we arrived at the following decision rule: We choose $H_0$ if and only if \begin{align} y \geq c, \end{align} where \begin{align} c=\frac{\sigma^2}{2} \ln \left(\frac{1-p}{p}\right). \end{align} Since $Y|H_0 \; \sim \; N(1, \sigma^2)$, \begin{align} P( \textrm{choose }H_1 | H_0)&=P(Y \lt c|H_0)\\ &=\Phi\left(\frac{c-1}{\sigma} \right)\\ &=\Phi\left(\frac{\sigma}{2} \ln \left(\frac{1-p}{p}\right)-\frac{1}{\sigma}\right). \end{align} Since $Y|H_1 \; \sim \; N(-1, \sigma^2)$, \begin{align} P( \textrm{choose }H_0 | H_1)&=P(Y \geq c|H_1)\\ &=1-\Phi\left(\frac{c+1}{\sigma} \right)\\ &=1-\Phi\left(\frac{\sigma}{2} \ln \left(\frac{1-p}{p}\right)+\frac{1}{\sigma}\right). \end{align} Figure 9.4 shows the two error probabilities for this example. Therefore, the average error probability is given by \begin{align} P_e &=P( \textrm{choose }H_1 | H_0) P(H_0)+ P( \textrm{choose }H_0 | H_1) P(H_1)\\ &=p \cdot \Phi\left(\frac{\sigma}{2} \ln \left(\frac{1-p}{p}\right)-\frac{1}{\sigma}\right)+(1-p) \cdot \left[ 1-\Phi\left(\frac{\sigma}{2} \ln \left(\frac{1-p}{p}\right)+\frac{1}{\sigma}\right)\right]. \end{align}

error-prob-Bayes-Hyp

Minimum Cost Hypothesis Test:

$\quad$ $H_0$: There is no fire, $\quad$ $H_1$: There is a fire.

$\quad$ $C_{10}$: The cost of choosing $H_1$, given that $H_0$ is true. $\quad$ $C_{01}$: The cost of choosing $H_0$, given that $H_1$ is true.

$\quad$ $H_0$: No intruder is present. $\quad$ $H_1$: There is an intruder.

  • First note that \begin{align} P(H_0|y)=1-P(H_1|y)=0.95 \end{align} The posterior risk of accepting $H_1$ is \begin{align} P(H_0|y) C_{10} =0.95 C_{10}. \end{align} We have $C_{01}=10 C_{10}$, so the posterior risk of accepting $H_0$ is \begin{align} P(H_1|y) C_{01} &=(0.05) (10 C_{10})\\ &=0.5 C_{10}. \end{align} Since $P(H_0|y) C_{10} \geq P(H_1|y) C_{01}$, we accept $H_0$, so no alarm message needs to be sent.

The print version of the book is available on .


Count Bayesie

Probably a Probability Blog

Bayesian A/B Testing: A Hypothesis Test that Makes Sense

This post is part of our Guide to Bayesian Statistics and received a update as a chapter in Bayesian Statistics the Fun Way!

We've covered the basics of Parameter Estimation pretty well at this point. We've seen how to use the PDF, CDF and Quantile function to learn the likelihood of certain values, and we've seen how we can add a Bayesian prior to our estimate . Now we want to use our estimates to compare two unknown parameters.

Keeping with our email example we are going to set up an A/B Test. We want to send out a new email and see if adding an image to the email helps or hurts the conversion rate. Normally when the weekly email is sent out it includes some image, for our test we're going to send one Variant with the image like we always do and another without the image. The test is called an A/B Test because we are comparing Variant A (with image) and Variant B (without).

We'll assume at this point we have 600 subscribers. Because we want to exploit the knowledge gained during our experiment we're only going to be running our test on 300 of these subscribers, that way we can give the remaining 300 what we believe to be the best variant. The 300 people we're going to test will be split up into two groups, A and B. Group A will receive an email like we always send, with a big picture at the top, and group B's will not have the picture.

Next we need to figure out what prior probability we are going to use. We've run an email campaign every week so we have a reasonable expectation that the probability of the recipient clicking the link to the blog on any given email should be around 30%. To make things simple we'll use the same prior for A and B. We'll also choose a pretty weak version of our prior because we don't really know how well we expect B to do, and this is a new email campaign so maybe other factors would cause a better or worse conversion anyway. We'll settle on Beta(3,7):

Different Beta distributions can represent varying strengths in belief in known priors

Different Beta distributions can represent varying strengths in belief in known priors

Next we need our actual data. We send out our emails and get these responses:

Our observed evidence

Our observed evidence

Given what we already know about parameter estimation we can look at each of these variants as two different parameters, we're trying to estimate. Variant A is going to be represented by Beta(36+3,114+7) and Variant B by Beta(50+3,100+7) (if you're confused by the +3 and +7 they are our Prior, which you can refresh on in the post on Han Solo ). Here we can see the estimates for each parameter side by side:

The overlap between the distributions is what we care about.

The overlap between the distributions is what we care about.

Clearly our data suggests that Variant B is the superior variant. However, from our ealier discussion on Parameter Estimation we know that the true conversion rate can be a range of possible values. We can also clearly see here that there is an overlap between the possible true conversion rates for A and B. What if we got unlucky in our A responses and A's true conversion rate is in fact much higher? What if we were also really lucky with B and its conversion rate is in fact much lower? If both of these conditions held it is easy to see a possible world in which A is the better variant even though it did worse on our test. The real question we have is how likely is it that B is actually the better variant?

Monte Carlo to the Rescue!

I've mentioned before that I'm a huge fan of Monte Carlo Simulations , and so we're going to tackle this question using a Monte-Carlo Simulation. R has a rbeta  function that allows us to sample from a Beta distribution. We can now literally ask, by simulation, "What is the probability that B is actually superior to A". We'll simply sample 100,000 times from each distribution we have modeled here and see what it tells us:

We end up with:

p.b_superior = 0.96

This is equivalent to getting a p-value of 0.04 from a single-tailed T-test. In terms of classical statistics, we would be able to call this result "Statistically Significant"! So why didn't we just use a T-test then? For starters I'm willing to bet these few lines of code are dramatically more intuitive to understand than Student's T-Distribution. But there's actually a much better reason.

Magnitude is more important than Significance

The focus of a classic Null-Hypothesis Significance Tests (NHST) is to establish whether two different distributions are likely to be result of sampling from the same distribution or not. Statistical Significance can at most tell us "these two things are not likely the same" (this is what rejecting the Null Hypothesis is saying). That's not really a great answer for an A/B Test. We're running this test because we want to improve conversions. Results that say "Variant B will probably do better" are okay, but don't you really want to know how much better ? Classical statistics tells us Significance, but what we're really after is Magnitude!

This is the real power of our Monte-Carlo Simulation. We can take the exact results from our last simulation and now look at how much of an improvement Variant B is likely to be. Now we'll simply plot the ratio of \(\frac{\text{B Samples}}{\text{A Samples}}\), this will give us a distribution of the relative improvements we've seen in our simulations.

This histogram describes all the possible differences between A and B

This histogram describes all the possible differences between A and B

From this histogram we can see that our most likely cases is about a 40% improvement over A, but we can see an entire range of values. As we discussed in our first post on Parameter Estimation , the Cumulative Distribution Function (CDF) is much more useful for reasoning about our results.

The line here represents the median improvement seen in the simulation

The line here represents the median improvement seen in the simulation

Now we can see that there is really just a small, small chance that A is better, but even if it is better it's not going to be better by much. We can also see that there's about a 25% chance that Variant B is a 50% or more improvement over A, and even a reasonable chance it could be more than double the conversion rate! Now in choosing B over A we can actually reason about our risk: "The chance that B is 20% worse is roughly the same that it's 100% better." Sounds like a good bet to me, and a much better statement of our knowledge than "There is a Statistically Significant chance that B is better than A."

There are many discussion of A/B Testing that you can find that would give dramatically different methodology than what we have done here. Orthodox Null Hypothesis Significance Testing differs in more ways than simply using a T-Test, and will likely be the topic of a future post. The key insight here is that we have shown how the ideas of Hypothesis Testing and Parameter Estimation can be viewed, from a Bayesian perspective, as the same problem. Additionally I have found that there is no mystery in the approach outlined here. Every conclusion we draw is based on data (including our prior) and the basics of Probability. Through this and the other two posts we have built up a Hypothesis Testing framework entirely from first principles. I'll leave deriving Student's T- distribution as an exercise for the reader.

Bayesian-statistics-the-fun-way.png

Learn more about this topic in the book Bayesian Statistics the Fun Way!

If you enjoyed this post please  subscribe  to keep up to date and follow  @willkurt !

  • Search Menu
  • Sign in through your institution
  • Supplements
  • Advance articles
  • Editor's Choice
  • Special Issues
  • Author Guidelines
  • Submission Site
  • Why Publish With Us?
  • Open Access
  • About Nicotine & Tobacco Research
  • About Society for Nicotine & Tobacco Research
  • Editorial Board
  • Advertising and Corporate Services
  • Journals Career Network
  • Self-Archiving Policy
  • Dispatch Dates
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

Article Contents

Introduction, conceptualizing hypothesis testing via bayes factors, empirical example 1: is a coin fair or tail-biased, empirical example 2: do health warnings for e-cigarettes increase worry about health, conclusions, declaration of interests.

  • < Previous

Bayesian Inference: An Introduction to Hypothesis Testing Using Bayes Factors

  • Article contents
  • Figures & tables
  • Supplementary Data

Sabeeh A Baig, Bayesian Inference: An Introduction to Hypothesis Testing Using Bayes Factors, Nicotine & Tobacco Research , Volume 22, Issue 7, July 2020, Pages 1244–1246, https://doi.org/10.1093/ntr/ntz207

  • Permissions Icon Permissions

Monumental advances in computing power in recent decades have contributed to the rising popularity of Bayesian methods among applied researchers. This series of commentaries seeks to raise awareness among nicotine and tobacco researchers of Bayesian methods for analyzing experimental data. The current commentary introduces statistical inference via Bayes factors and demonstrates how they can be used to present evidence in favor of both alternative and null hypotheses.

Bayesian inference is a fully probabilistic framework for drawing scientific conclusions that resembles how we naturally think about the world. Often, we hold an a priori position on a given issue. On a daily basis, we are confronted with facts about that issue. We regularly update our position in light of those facts. Bayesian inference follows this exact updating process. Formally stated, given a research question, at least one unknown parameter of interest, and some relevant data, Bayesian inference follows three basic steps. The process begins by specifying a prior probability distribution on the unknown parameter that often reflects accumulated knowledge about the research question. Next, the observed data, summarized using a likelihood function, are conditioned on the prior distribution. Finally, the resulting posterior distribution represents an updated state of knowledge about the unknown parameter and, by extension, the research question. Simulating data many times from the posterior distribution will ideally yield representative samples of the unknown parameter that we can interpret to answer the research question.

In an experimental context, we are often interested in evaluating two competing positions or hypotheses in light of data and making a determination about which to accept. In the context of Bayesian inference, hypothesis testing can be framed as a special case of model comparison where a model refers to a likelihood function and a prior distribution. Given two competing hypotheses and some relevant data, Bayesian hypothesis testing begins by specifying separate prior distributions to quantitatively describe each hypothesis. The combination of the likelihood function for the observed data with each of the prior distributions yields hypothesis-specific models. For each of the hypothesis-specific models, averaging (ie, integrating) the likelihood with respect to the prior distribution across the entire parameter space yields the probability of the data under the model and, therefore, the corresponding hypothesis. This quantity is more commonly referred to as the marginal likelihood and represents the average fit of the model to the data. The ratio of the marginal likelihoods for both hypothesis-specific models is known as the Bayes factor.

The Bayes factor is a central quantity of interest in Bayesian hypothesis testing. A Bayes factor has a range of near 0 to infinity and quantifies the extent to which data support one hypothesis over another. Bayes factors can be interpreted continuously so that a Bayes factor of 30 indicates that there is 30 times more support in the data for a given hypothesis than the alternative. They can also be interpreted discretely so that a Bayes factor of 3 or higher supports accepting a given hypothesis, 0.33 or lower supports accepting its alternative, and values in between are inconclusive. 1 , 2 Intuitively, the Bayes factor is the ratio of the odds of observing two competing hypotheses after examining relevant data compared to the odds of observing those hypotheses before examining the data. Therefore, the Bayes factor represents how we should update our knowledge about the hypotheses after examining data. We present two empirical examples with simulated data to demonstrate the computation and use of Bayes factors to test hypotheses.

Deciding whether a coin is fair or tail-biased is a simple, but useful example to illustrate hypothesis testing via Bayes factors. Let the null hypothesis be that the coin is fair, and let the alternative hypothesis be that the coin is tail-biased. We further intuit that coins, fair or not, can exhibit a considerable degree of variation in their head-tail biases depending on quality control issues during the minting process. Therefore, we use a Beta(5, 5) prior distribution to describe the null hypothesis. This distribution places the bulk of the probability density at or around 0.5 (ie, equal probability of heads or tails). Similarly, we use a Beta(3.8, 6.2) prior distribution to describe the alternative hypothesis. This skewed distribution places the bulk of the density at or around 0.35 (ie, lower probability of heads) and places less density on values greater than 0.4. The Beta prior is appropriate to describe hypotheses about a coin (and other binary variables) because it is continuously defined on the interval between 0 and 1 that the bias of a coin is also defined on; has hyperparameters that can be interpreted as the number of heads and tails; and provides flexibility in describing hypotheses because it does not have to be symmetric.

To test these hypotheses, we conduct a simple experiment by flipping the coin 20 times, recording 5 heads and 15 tails. We summarize this data using a Bernoulli(5, 15) likelihood function. After computing the marginal likelihoods of the models for both hypotheses, we find that the Bayes factor comparing the alternative hypothesis to the null is 2.65. This indicates that the data supports the alternative hypothesis that the coin is tail-biased over the null hypothesis that it is fair only by a factor of 2 or so. We further note that the Bayes factor falls into the range of inconclusive values. Therefore, we conclude that we need more experimental data to determine whether the coin is fair or tail-biased with greater certainty.

A more pertinent illustrative example of hypothesis testing via Bayes factors is deciding whether health warnings for e-cigarettes increase worry about one’s health. Let the null hypothesis be that health warnings have exactly no effect on worry. Let the first alternative hypothesis be one-sided that health warnings increase worry, and let the second alternative hypothesis also be one-sided that health warnings decrease worry. Bayes factors with the Jeffreys-Zellner-Siow (JZS) default prior can be used to evaluate these hypotheses. 3 In comparison to other priors, default priors have mathematical properties that simplify the computation of Bayes factors. The JZS default prior describes hypotheses in terms of possible effect sizes (ie, Cohen’s d ). As such, under the null hypothesis that health warnings have exactly no effect on worry, the prior distribution places the entire density on an effect size of 0 ( Figure 1 ). Given that effect sizes in behavioral research in tobacco control are usually small, 4–6 the prior distributions for the alternative hypotheses use a scale parameter of 1/2 to distribute the density mostly over small positive or negative effect sizes.

Prior distributions quantitatively describing competing hypotheses about the effect of e-cigarette health warnings on worry about one’s own health due to tobacco product use.

Prior distributions quantitatively describing competing hypotheses about the effect of e-cigarette health warnings on worry about one’s own health due to tobacco product use.

To test these hypotheses, we conduct a simple online experiment with 200 adults who vape every day or some days. The experiment randomizes participants to receive a stimulus depicting 1 of 5 e-cigarette devices (eg, vape pen) with or without a corresponding health warning. After viewing the stimulus for 10 seconds, participants complete a survey that includes an item on worry, “How worried are you about your health because of your e-cigarette use?”, 7 with a response scale of 1 (“not at all”) to 5 (“extremely”). Participants who receive a health warning elicit mean worry of 2.38 ( SD = 0.87), and those who do not elicit mean worry of 2.33 ( SD = 0.84). The Bayes factors comparing the first and second alternative hypotheses to the null hypothesis are 0.16 and 0.30, respectively. These Bayes factors indicate that there is more support in the data for the null hypothesis than the alternative hypotheses. Taking the reciprocal of these Bayes factors indicates that there is approximately 3 to 6 times more support in the data for the null hypothesis that health warnings have no effect than either alternative. Therefore, we conclude that health warnings for e-cigarettes do not appear to affect worry based on the experimental data.

The hallmark of Bayesian model comparison (and other Bayesian approaches) is the incorporation of uncertainty at all stages of inference, particularly through the use of properly specified prior distributions. As a result, Bayesian model comparison has three practical advantages over conventional methods. First, Bayesian model comparison is not limited to tests of point null hypotheses. 8 , 9 In fact, the first empirical example essentially conceptualized the possibility of the coin being fair as an interval null hypothesis by permitting some unfair head-coin biases. Indeed, a great deal has already been written on how the use of point null hypotheses can lead to overstatements about the evidence for alternative hypotheses. 10 Second, Bayesian model comparison is flexible enough to permit tests of any meaningful hypotheses. 11 As a result, the second empirical example demonstrated tests of two one-sided hypotheses against the same null hypothesis. Third, Bayesian model comparison uses the marginal likelihood, which is a measure of the average fit of a model across the parameter space. 12 Doing so leads to more accurate characterizations of the evidence for competing hypotheses because they account for uncertainty in parameter values even after observing the data instead of only focusing on the most likely values of those parameters.

Bayes factors specifically have three advantages over other inferential statistics. First, Bayes factors can provide direct evidence for the common null hypothesis of no difference. 13 Second, they can reveal when experimental data is insensitive to the null and alternative hypotheses, clearly suggesting that the researcher should withhold judgment. 13 Third, they can be interpreted continuously and thus provide an indication of the strength of the evidence for the null or alternative hypothesis. While Bayesian model comparison via Bayes factors leads to robust tests of competing hypotheses, this advantage is only realized when all hypotheses are quantitatively described using carefully chosen priors that are calibrated in light of accumulated knowledge. Furthermore, two analysts may choose different priors to describe the same hypothesis. This subjectivity in the choice of prior has promoted the development of a large class of Bayes factors for common analyses (eg, difference of means as illustrated in the second empirical example) that use default priors. 14–16 Thus, the analyst only needs to choose values for important parameters, as in the second empirical example, without having to select the functional form of the prior (eg, a Beta prior) as in the first empirical example. Published Bayesian analyses will often list priors and justify why they were chosen for full transparency (see Baig et al. 17 for one succinct example). The next commentary will focus on informative hypotheses, prior specification when computing corresponding Bayes factors, and some Bayesian solutions for multiple testing. For the curious reader, the JASP package provides access to Bayes factors that use default priors for common analyses through a point-and-click interface similar to SPSS. 18

This work was supported by the Office of The Director, National Institutes of Health (award number DP5OD023064).

None declared.

Rouder JN , Morey RD , Verhagen J , Swagman AR , Wagenmakers EJ . Bayesian analysis of factorial designs . Psychol Methods. 2017 ; 22 ( 2 ): 304 – 321 .

Google Scholar

Jeon M , De Boeck P . Decision qualities of Bayes factor and p value-based hypothesis testing . Psychol Methods. 2017 ; 22 ( 2 ): 340 – 360 .

Hoijtink H , van Kooten P , Hulsker K . Why bayesian psychologists should change the way they use the Bayes factor . Multivariate Behav Res . 2016 ; 51 ( 1 ): 2 – 10 . doi:10.1080/00273171.2014.969364

Baig SA , Byron MJ , Boynton MH , Brewer NT , Ribisl KM . Communicating about cigarette smoke constituents: an experimental comparison of two messaging strategies . J Behav Med. 2017 ; 40 ( 2 ): 352 – 359 .

Brewer NT , Morgan JC , Baig SA , et al.  Public understanding of cigarette smoke constituents: three US surveys . Tob Control. 2016 ; 26 ( 5 ): 592 – 599 .

Morgan JC , Byron MJ , Baig SA , Stepanov I , Brewer NT . How people think about the chemicals in cigarette smoke: a systematic review . J Behav Med . 2017 ; 40 ( 4 ): 553 – 564 . doi:10.1007/s10865-017-9823-5

Mendel JR , Hall MG , Baig SA , Jeong M , Brewer NT . Placing health warnings on e-cigarettes: a standardized protocol . Int J Environ Res Public Health . 2018 ; 15 ( 8 ): 1578 . doi:10.3390/ijerph15081578

Morey RD , Rouder JN . Bayes factor approaches for testing interval null hypotheses . Psychol Methods. 2011 ; 16 ( 4 ): 406 – 419 .

West R . Using Bayesian analysis for hypothesis testing in addiction science . Addiction . 2016 ; 111 ( 1 ): 3 – 4 . doi:10.1111/add.13053

Berger JO , Sellke T . Testing a point null hypothesis: the irreconcilability of p-values and evidence . J Am Stat Assoc . 1987 ; 82 ( 397 ): 112 – 122 . doi:10.1080/01621459.1987.10478397

Etz A , Haaf JM , Rouder JN , Vandekerckhove J . Bayesian inference and testing any hypothesis you can specify . Adv Methods Pract Psychol Sci . 2018 ; 1 ( 2 ): 281 – 295 . doi:10.1177/2515245918773087

Etz A . Introduction to the concept of likelihood and its applications . Adv Methods Pract Psychol Sci . 2018 ; 1 ( 1 ): 60 – 69 . doi:10.1177/2515245917744314

Dienes Z , Coulton S , Heather N . Using Bayes factors to evaluate evidence for no effect: examples from the SIPS project . Addiction. 2018 ; 113 ( 2 ): 240 – 246 .

Nuijten MB , Wetzels R , Matzke D , Dolan CV , Wagenmakers E-J . A default Bayesian hypothesis test for mediation . Behav Res Methods . 2014 ; 47 ( 1 ): 85 – 97 . doi:10.3758/s13428-014-0470-2

Ly A , Verhagen J , Wagenmakers E-J . Harold Jeffreys’s default Bayes factor hypothesis tests: explanation, extension, and application in psychology . J Math Psychol . 2016 ; 72 : 19 – 32 . doi:10.1016/j.jmp.2015.06.004

Rouder JN , Speckman PL , Sun D , Morey RD , Iverson G . Bayesian t tests for accepting and rejecting the null hypothesis . Psychon Bull Rev. 2009 ; 16 ( 2 ): 225 – 237 .

Baig SA , Byron MJ , Lazard AJ , Brewer NT . “Organic,” “natural,” and “additive-free” cigarettes: comparing the effects of advertising claims and disclaimers on perceptions of harm . Nicotine Tob Res . 2019 ; 21 ( 7 ): 933 – 939 .

Wagenmakers E-J , Love J , Marsman M , et al.  Bayesian inference for psychology. Part ii: example applications with JASP . Psychon Bull Rev . 2018 ; 25 ( 1 ): 58 – 76 .

  • electronic cigarettes
Month: Total Views:
November 2019 111
December 2019 53
January 2020 55
February 2020 97
March 2020 84
April 2020 106
May 2020 88
June 2020 196
July 2020 200
August 2020 242
September 2020 362
October 2020 568
November 2020 609
December 2020 578
January 2021 593
February 2021 625
March 2021 682
April 2021 619
May 2021 663
June 2021 536
July 2021 432
August 2021 428
September 2021 495
October 2021 598
November 2021 467
December 2021 396
January 2022 448
February 2022 467
March 2022 517
April 2022 569
May 2022 617
June 2022 462
July 2022 429
August 2022 380
September 2022 381
October 2022 485
November 2022 505
December 2022 373
January 2023 453
February 2023 594
March 2023 717
April 2023 592
May 2023 596
June 2023 462
July 2023 359
August 2023 403
September 2023 430
October 2023 657
November 2023 562
December 2023 390
January 2024 468
February 2024 544
March 2024 610
April 2024 696
May 2024 485
June 2024 333
July 2024 337
August 2024 307

Email alerts

Citing articles via.

  • About Nicotine & Tobacco Research
  • Recommend to your Library

Affiliations

  • Online ISSN 1469-994X
  • Copyright © 2024 Society for Research on Nicotine and Tobacco
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

IMAGES

  1. Hands-On Bayesian Methods with Python : Hypothesis Testing with Bayesian Method

    bayesian hypothesis testing python

  2. Introduction to Bayesian inference with PyStan

    bayesian hypothesis testing python

  3. Bayesian Inference in Python

    bayesian hypothesis testing python

  4. Introduction to Bayesian A/B testing in Python

    bayesian hypothesis testing python

  5. Bayesian AB Testing using Python

    bayesian hypothesis testing python

  6. An Interactive Guide to Hypothesis Testing in Python

    bayesian hypothesis testing python

VIDEO

  1. Bayesian A/B testing using Python By Vaibhav Pawar

  2. Bayesian Statistics 11172023

  3. What to Learn for Statistics in Artificial Intelligence

  4. Hypothesis Testing

  5. Math Stat: Bayesian Hypothesis Testing

  6. Bayesian A/B Testing

COMMENTS

  1. Bayesian A/B Testing with Python: the easy guide

    Figure 3: The probability for the Test option CR to be greater than 3E-03. In Math language, you calculate the area by integrate over the curve between the two limits: 0.003 (the limit that we choose) and 1 (the hard limit). With Python, we can calculate this integral exactely (still: no Monte Carlo needed), courtesy of the Mpmath library:

  2. Introduction to Bayesian A/B testing in Python

    Data used for the studied A/B test Frequentist approach. For this A/B test, the frequentist analysis led to the reject of the null hypothesis, but only after almost 60 days of A/B testing.We chose ...

  3. Bayesian Inference in Python: A Comprehensive Guide with Examples

    Bayesian inference is a statistical method based on Bayes's theorem, which updates the probability of an event as new data becomes available. It is widely used in various fields, such as finance, medicine, and engineering, to make predictions and decisions based on prior knowledge and observed data. In Python, Bayesian inference can be ...

  4. Introduction to Objective Bayesian Hypothesis Testing

    The building blocks for objective Bayesian hypothesis testing are the Bayes factor and objective priors. Let M_1 and M_2 denote two competing models with parameters θ_1 and θ_2 and proper priors π_1 and π_2. Suppose we then observe data x. Then the posterior probability for one of the models is given by.

  5. Easy as ABC: A Quick Introduction to Bayesian A/B Testing in Python

    A/B testing is a valuable and in-demand skills that data analysts, BI developers, and data scientists have in their analytical toolkits. This beginner-orient...

  6. Python Package for Bayesian Testing

    Python Package. I created a small python package for Bayesian A/B (or A/B/C/…) testing that could be used for both of the cases mentioned above. To install it, simply use pip: pip install bayesian_testing Example of use. For the sake of this example, I generated some conversion data with revenue information. It is available on GitHub here ...

  7. Bayesian A/B Testing from Scratch

    If the issue persists, it's likely a problem on our side. Unexpected token < in JSON at position 4.

  8. Bayesian Data Analysis in Python Course

    Course Description. Bayesian data analysis is an increasingly popular method of statistical inference, used to determine conditional probability without having to rely on fixed constants such as confidence levels or p-values. In this course, you'll learn how Bayesian data analysis works, how it differs from the classical approach, and why it ...

  9. The Complete Guide to A/B Testing in Python

    The Hypotheses: The hypothesis is that the new design performs better than the old design and leads to a higher conversion rate. Null hypothesis Hₒ : p = pₒ Two designs have the same impacts ...

  10. 11 Bayesian hypothesis testing

    11. Bayesian hypothesis testing. This chapter introduces common Bayesian methods of testing what we could call statistical hypotheses . A statistical hypothesis is a hypothesis about a particular model parameter or a set of model parameters. Most often, such a hypothesis concerns one parameter, and the assumption in question is that this ...

  11. Bayesian Hypothesis Testing with PyMC

    The Bayesian framework for hypothesis testing relies on the calculation of the posterior odds of the hypotheses, Odds(HA | x) = P(HA | x) P(H0 | x) = BF(x) ⋅ πA π0, where BF(x) is the Bayes factor. In our situation, the Bayes factor is. BF(x) = ∫ΘAf(x | μ)ρA(μ) dμ f(x | 0). The Bayes factor is the Bayesian counterpart of the ...

  12. Introduction to Bayesian A/B Testing

    The above examples demonstrate how to calculate perform A/B testing analysis for a two-variant test with the simple Beta-Binomial model, and the benefits and disadvantages of choosing a weak vs. strong prior. In the next section we provide a guide for handling a multi-variant ("A/B/n") test. Generalising to multi-variant tests#

  13. 17 Statistical Hypothesis Tests in Python (Cheat Sheet)

    In this post, you will discover a cheat sheet for the most popular statistical hypothesis tests for a machine learning project with examples using the Python API. Each statistical test is presented in a consistent way, including: The name of the test. What the test is checking. The key assumptions of the test. How the test result is interpreted.

  14. How To Do Bayesian A/B Testing, FAST!

    Calculating the Risk. Now we have finally arrived to the important part: The Risk measure is the most important measure in Bayesian A/B testing. It replaces the P-value as a decision rule, but also serves as a stopping rule — since the Bayesian A/B test has a dynamic sample size. It is interpreted as "When B is worse than A, If I choose B ...

  15. Chapter 16 Introduction to Bayesian hypothesis testing

    Chapter 16 Introduction to Bayesian hypothesis testing. Chapter 16. Introduction to Bayesian hypothesis testing. In this chapter, we will introduce an alternative to the Frequentist null-hypothesis significance testing procedure employed up to now, namely a Bayesian hypothesis testing procedure. This also consists of comparing statistical models.

  16. bnpy/bnpy: Bayesian nonparametric machine learning for Python

    About. This python module provides code for training popular clustering models on large datasets. We focus on Bayesian nonparametric models based on the Dirichlet process, but also provide parametric counterparts. bnpy supports the latest online learning algorithms as well as standard offline methods. Our aim is to provide an inference platform ...

  17. Bayesian Hypothesis Testing Illustrated: An Introduction for Software

    The odd Python program that runs faster than its C version is not rare enough. If we wanted to use the result to decide whether to rewrite all of our existing Python programs in C or not, then we would probably not deem such an effort worthwhile at this level of evidence. ... If Bayesian hypothesis testing ended with the calculation of the ...

  18. bayesian-testing · PyPI

    Bayesian A/B testing. bayesian_testing is a small package for a quick evaluation of A/B (or A/B/C/...) tests using Bayesian approach.. Implemented tests: BinaryDataTest. Input data - binary data ([0, 1, 0, ...]; Designed for conversion-like data A/B testing. NormalDataTest. Input data - normal data with unknown variance; Designed for normal data A/B testing.

  19. A Review of Bayesian Hypothesis Testing and Its Practical

    A way to avoid the explicit selection of prior densities is through the use of the Bayesian information criterion (BIC), which can give a rough interpretation of evidence in Table 1. Another potential disadvantage is the computational difficulty of evaluating marginal likelihoods, and this is discussed in Section 2.2.

  20. Bayesian Hypothesis Testing

    9.1.8 Bayesian Hypothesis Testing. Suppose that we need to decide between two hypotheses H0 H 0 and H1 H 1. In the Bayesian setting, we assume that we know prior probabilities of H0 H 0 and H1 H 1. That is, we know P(H0) = p0 P ( H 0) = p 0 and P(H1) = p1 P ( H 1) = p 1, where p0 + p1 = 1 p 0 + p 1 = 1. We observe the random variable (or the ...

  21. Hypothesis Testing with Python: Step by step hands-on tutorial with

    It tests the null hypothesis that the population variances are equal (called homogeneity of variance or homoscedasticity). Suppose the resulting p-value of Levene's test is less than the significance level (typically 0.05).In that case, the obtained differences in sample variances are unlikely to have occurred based on random sampling from a population with equal variances.

  22. Bayesian A/B Testing: A Hypothesis Test that Makes Sense

    Orthodox Null Hypothesis Significance Testing differs in more ways than simply using a T-Test, and will likely be the topic of a future post. The key insight here is that we have shown how the ideas of Hypothesis Testing and Parameter Estimation can be viewed, from a Bayesian perspective, as the same problem.

  23. Bayesian Inference: An Introduction to Hypothesis Testing Using Bayes

    In the context of Bayesian inference, hypothesis testing can be framed as a special case of model comparison where a model refers to a likelihood function and a prior distribution. Given two competing hypotheses and some relevant data, Bayesian hypothesis testing begins by specifying separate prior distributions to quantitatively describe each ...