There’s a lot of controversy around one-tailed vs two-tailed testing.
Articles like this lambast the shortcomings of one-tailed testing, saying that “unsophisticated users love them.” On the flip side, some articles and discussions take a more balanced approach and say there’s a time and a place for both.
In fact, many people don’t realize that there are two ways to determine whether an experiment’s results are statistically valid. There’s still a lot of confusion and misunderstanding about one-tailed and two-tailed testing.
The commotion comes from a justifiable worry: are my lifts imaginary? As mentioned in this SumAll article, sometimes A/A tests will come up with some quirky results, thus making you question the efficacy of your tools and your a/b testing plan.
So when we’re talking about one-tailed vs. two-tailed tests, we’re really talking about whether we can trust the results of our a/b tests and take action based on them.
So what’s the difference? Does it matter? When should you use one-tailed tests? Two-tailed tests?
One-tailed vs two-tailed: What’s the difference?
If you’re just learning about testing, Khan Academy offers a clearly laid out illustration of the difference between one-tailed and two-tailed tests:
In essence, one-tailed tests allow for the possibility of an effect in just one direction where with two-tailed tests, you are testing for the possibility of an effect in two directions – both positive and negative.
Chris Stucchio does a great job explaining the difference between the two tests in context:
Put simply, the two-tailed test can show evidence that the control and variation are different, but the one-tailed test is used to show evidence if variation is better than the control.
Does it matter which method one you use?
Okay, so now that we went over what the tests actually are, we can ask the important question: does it even matter which you use? Turns out, that’s a complicated question. It’s where a lot of the derision arises.
Pros and Cons of Each Method
Other factors in validity
So there are other factors when it comes to testing for statistical validity. Still, there are strong opinions around one-tailed and two-tailed testing.
The case for two-tailed testing
Two-tailed tests mitigate type I errors (false positives) and cognitive bias errors. Furthermore, as Kyle Rush said, “unless you have a superb understanding of statistics, you should use a two-tailed test.”
Here’s what Andrew Anderson had to say:
Neal Cole, Conversion Specialist at a leading online gaming company, agrees:
When can I use one-tailed tests?
According to some, there is a time and a place for a/b testing. It is often contextual and depends on how you intend to act on the data. As Luke Stokebrand said, “One-tailed tests are not always bad, it is just important to understand their downside. In fact, there are many times when it makes sense to use an one-tailed test to validate your data.”
Andy Hunt from UpliftROI, though acknowledging the faults of one-tailed tests, takes a realistic approach:
Similarly, Jeff Sauro from MeasuringU reiterates that while you should normally use the 2-sided p-value, “you should only use the 1-sided p-value when you have a very strong reason to suspect that one version is really superior to the other.”
Kyle Rush echoes this:
Which tools use which method?
When you ask the question of which a/b testing software uses which method, you enter a world of murky answers and ambiguity. That’s to say, not many of them list it specifically. So here’s what I got from research and from asking testing experts (correct me if I’m wrong or need to add something):
Tools that use one-tailed tests
- Google Content Experiments (uses Bandit, but one-tail if you disable that feature)
- Conductrics (plus option to run 2-tail via API along with Bandit options)
Tools that use two-tailed tests:
Of course, certain tools have custom frameworks as well (like Google’s Multi-armed Bandit). Kyle Rush explains Optimizely’s Stats Engine:
The issue of using 1-tailed vs 2-tailed testing is important, though the decision can’t be made with statistics alone. As Chris Stucchio said, “it needs to be decided from within the context of a decision procedure.”
He continues to say that, “When running an A/B test, the goal almost always to increase conversions rather than simply answer idle curiosity. To decide whether one tailed or two tailed is right for you, you need to understand your entire decision procedure rather than simply the statistics.”
So if you’d like to learn more about one-tailed and two-tailed testing, there are many resources. Here are a few that are easy to understand:
Otherwise, I’ll close with something Peep said about the subject: “One vs two-tailed issue is minor (it’s perfectly fine to use a one tailed test in a lot of cases) compared to test sample sizes and test duration. Ending tests too soon is by far #1 testing sin there is.”