There’s a philosophical statistics debate in the optimization world: Bayesian vs Frequentist.
This is not a new debate; Thomas Bayes wrote “An Essay towards solving a Problem in the Doctrine of Chances” in 1763, and it’s been an academic argument ever since.
Recently, the issue has become relevant in the CRO world – especially with the announcement that VWO will be using Bayesian decisions (Google Experiments also uses Thompson sampling, which is informed by a Bayesian perspective).
So what the hell does Bayesian statistics mean for a/b testing? First, let’s summarize Bayesian and Frequentist approaches, and what the difference between them is.
Note: I’m not going to go too in-depth in the philosophical debate between the two approaches or the intricacies of Bayes’ Theorem. I’ve listed some further reading at the bottom of each section if you’re interested in learning more.
The Quick and Dirty Difference Between Frequentist and Bayesian Statistics
The Frequentist Approach
You’re probably familiar with the Frequentist approach to testing. It’s the model of statistics taught in most core-requirement college classes, and it’s the approach most often used by a/b testing software.
Basically, using a Frequentist method means making predictions on underlying truths of the experiment using only data from the current experiment.
As Leonid Pekelis wrote in an Optimizely article, “Frequentist arguments are more counter-factual in nature, and resemble the type of logic that lawyers use in court. Most of us learn frequentist statistics in entry-level statistics courses. A t-test, where we ask, “Is this variation different from the control?” is a basic building block of this approach.”
The Bayesian Approach
According to the Leonid Pekelis, “Bayesian statistics take a more bottom-up approach to data analysis. This means that past knowledge of similar experiments is encoded into a statistical device known as a prior, and this prior is combined with current experiment data to make a conclusion on the test at hand.”
So, the biggest distinction is that Bayesian probability specifies that there is some prior probability.
The Bayesian approach goes something like this (summarized from this discussion):
- Define the prior distribution that incorporates your subjective beliefs about a parameter. The prior can be uninformative or informative.
- Gather data.
- Update your prior distribution with the data using Bayes’ theorem (though you can have Bayesian methods without explicit use of Bayes’ rule – see non-parametric Bayesian) to obtain a posterior distribution. The posterior distribution is a probability distribution that represents your updated beliefs about the parameter after having seen the data.
- Analyze the posterior distribution and summarize it (mean, median, sd, quantiles,…).
To explain Bayes reasoning in relation to conversion rates, Chris Stucchio gives the example of a hypothetical startup, BeerBnB. His initial marketing efforts (ads in bar bathrooms) drew 794 unique visitors, 12 of whom created an account giving the effort a 1.5% conversion rate. Suppose the company can reach 10,000 visitors via toilet ads around the city, how many people can we expect to sign up? About 150.
Another example was something I found when reading Lean Analytics. There’s a case study about a restaurant, Solare. They know that if by 5 p.m. there are 50 reservations, then they can predict there will be around 250 covers for the night. This is a prior and can be updated with new sets of data.
Or as Boundless Rationality wrote, “A fundamental aspect of Bayesian inference is updating your beliefs in light of new evidence. Essentially, you start out with a prior belief and then update it in light of new evidence. An important aspect of this prior belief is your degree of confidence in it.”
So, why the controversy?
He says it’s much easier to debate minute tasks and equations than it is to discuss the testing discipline and the whole role of optimization in an organization.
Dr. Rob Balon, CEO of The Benchmark Company, agrees:
Though, Andrew Gelman was cited in a New York Times article defending Bayesian methods as a sort of double-check on spurious results. As an example, he re-evaluated a study using Bayesian statistics. The study had concluded that women who were ovulating were 20 percent more likely to vote for President Obama in 2012 than those who were not. Andrew added in data showing that people rarely change their voting preference during an election cycle – even during a menstrual cycle. Adding this info, the study’s statistical significance disappeared.
In my research, it’s clear there’s a large divide based on the philosophy of each approach. In essence, they tackle the same problems in slightly difference ways.
- An Intuitive Explanation of Bayes’ Theorem (amazing resource!)
- An Intuitive (and Short) Explanation of Bayes’ Theorem (abridged version of the above)
- A Technical Explanation of Technical Explanation (a more advanced reading)
- The Odds, Continually Updated (NY Times article)
- Bayesian or Frequentist, Which Are You? (video lecture)
- A List of Data Science and Machine Learning Resources
- Tutorials on Bayesian Nonparametrics
What Does This Have to Do With A/B Testing?
“A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule.”
Though Rob Balon referred to the debate as mostly ‘esoteric tail-wagging’ (Matt Gershoff used the term ‘statistical theater’), there are business implications when it comes to a/b testing. Everyone wants faster and more accurate results that are easier to understand and communicate, and that’s what both methods attempt to do.
Since VWO announced their change to Bayesian decisions, there has been interest in the mathematical methods behind the tools. There has also been a lot of misunderstanding about this, but mostly stems from a few common questions. Which is more accurate? Does it matter which one I use?
Though, as Matt Gershoff explains: “Often, and I think this is a massive hole in the CRO thinking, is that we are trying to ESTIMATE the parameters for a given model (think targeting) in some rational way.”
Does it Matter Which You Use?
Some say yes, and some say no. Like almost everything, the answer is complicated and has proponents on both sides. Let’s start with the pro-bayesian argument…
In Defense of Bayesian Decisions
Lyst actually wrote an article last year about using Bayesian decisions. According to them, ”We think this helps us avoid some common pitfalls of statistical testing and makes our analysis easier to understand and communicate to non-technical audiences.”
They say they prefer Bayesian methods for two reasons:
- Their end result is a probability distribution, rather than a point estimate. “Instead of having to think in terms of p-values, we can think directly in terms of the distribution of possible effects of our treatment…This makes it much easier to understand and communicate the results of the analysis.”
- Using an informative prior allows them to alleviate many of the issues that plague classical significance testing (they cite repeated testing and low base rate problem – though Evan Miller disputed the latter argument on this Hacker News thread)
The also offered the following visuals, where they drew two samples from a Bernoulli distribution (yes/no, tails/heads), computed the p parameter (probability of heads) estimates for each sample, and then took their difference:
The article is a solid argument in favor of using a Bayesian method (they have a calculator you can use too), but they do leave with a caveat:
“It is worth noting that there is nothing magical about Bayesian methods. The advantages described above are entirely due to using an informative prior. If instead we used a flat (or uninformative) prior—where every possible value of our parameters is equally likely—all the problems would come back.”
Chris Stucchio explains some of the reasons for VWO using Bayesian decisions:
But some disagree…
So there’s a good amount of support for Bayesian methods. While there aren’t many anti-Bayesians, there are a few Frequentists and people that just generally think there are more important things to worry about.
Andrew Anderson, for instance, says that for 99% of users, it doesn’t really matter.
Dr. Rob Balon agrees, saying the Bayesian vs Frequentist argument is really not that relevant to A/B testing:
- Bayesian A/B Testing by Evan Miller
- Hacker News discussion on Bayesian A/B Testing
- Probabilistic Programming & Bayesian Methods for Hackers
- Easy Evaluation of Decision Rules in Bayesian A/B testing
Tools and Methods
Most tools use Frequentist methods – though as mentioned above, VWO will be using Bayesian decisions. Google Experiments also uses Thompson sampling, which is informed by a Bayesian perspective.
Optimizely’s Stats Engine is based on Wald’s sequential test. This is the sequential version of Pearson-Neyman hypothesis testing approach, so this is a frequentist approach, but uses flavors of Bayes as well.
Conductrics’ next release will blend together ideas from empirical Bayes, with targeting, to improve the efficiency of its Reinforcement Learning engine.
Though, Andrew Anderson doesn’t think we should spend much time worrying about the methods behind it. As he said about tools advertising different methods as features:
Though you could dig forever and find strong arguments for and against each side, what it comes down to is that we’re solving the same problem in two different ways.
I like the analogy that Optimizely gave using bridges: “Just like a suspension and arch bridges both successfully get cars across a gap, both Bayesian and Frequentist statistical methods provide to an answer to the question: which variation performed best in an A/B test?”
Andrew Anderson also had a fun way of looking at it: “In many cases this debate is the same as arguing the style of the screen door on a submarine. It’s a fun argument that will change how things look, but the very act of having it means that you are drowning.”