Reality check: the level of statistical literacy is pretty poor in CRO world. A major portion of your test results are probably invalid.
While testing tools are getting more sophisticated, blogs are brimming with ‘inspiring’ case studies, and experimentation is becoming more and more common for marketers – statistics know-how is still severely lacking.
Stop being one of “those people”, and get your act together. It’s actually not that complicated. If you don’t know basic statistics, you won’t be able to tell whether your split tests suck.
Why Do I Need to Know A/B Testing Statistics?
I know statistics isn’t necessarily a fun thing to learn. It’s probably a lot more fun to put up a test between a red and green button and wait until your testing tool tells you one of them has beat the other.
If this is your strategy, you’re ripe for disappointment.
This approach isn’t much better than guessing. Often it will end up with you testing things for a year yet ending up with the exact same conversion rate as when you started.
Statistics provide inference on your results, and they help you make practical business decisions. Lack of understanding of statistics can lead to errors and unreliable outcomes.
Here’s an analogy from Matt:
So it is with conversion rates. Conversion optimization is also a balancing act between exploration and exploitation. It’s about balancing risk, which is a fundamental problem solved by statistics. As Ton Wesseling from Testing.Agency put it:
Building Blocks: Mean, Variance, and Sampling
There are three terms you should know before we dive into the nitty-gritty of A/B testing statistics:
The mean is the average. For conversion rates, it is the number of events multiplied by the probability of success (n*p).
In our coffee example, this would be the process of measuring the temperature of each cup of coffee that we sample and dividing by the total number of cups, reaching an average temperate that is hopefully representative of the actual average.
In online experimentation, since we can’t know the “true” conversion rate, we’re measuring the mean conversion rate of each variation.
Variance is the average variability of our data. The higher the variability, the less precise the mean will be as a predictor of any individual data point.
It’s basically, on average, how far off each of the individual cups of coffee in each collection is from the collection’s average temperature. In other words, how close will the mean be to each cup’s actual temperature? The smaller the variance, the better the mean will be as a guess for each cup’s temperature. Many things can cause variance (e.g. how long ago the coffee cup was made, who made it, how hot the water was, etc).
In terms of conversion optimization, Marketing Experiments gave a great example of variance in this blog post:
The two images above are the exact same – except the treatment came up with 15% more conversions. This is an A/A test.
A/A tests, which are often used to detect whether your testing software is working, are also used to detect the natural variability of a website. It splits traffic between two identical pages, and if you discover that there is a statistically significant lift on one of the variations, you need to investigate the cause.
Since we can’t measure ‘true conversion rate,’ we have to select a sample that is statistically representative of the whole.
In terms of our coffee measuring example, we don’t know the mean temperature of coffee from each restaurant. Therefore, we need to collect data on the temperature in order to estimate the average temperature. So, unlike comparing individual cups of coffee, we don’t measure ALL possible cups of coffee from McDonalds and Starbucks, we collect some of them and use inference to estimate the total.
The more cups we measure, the more likely it is that the sample is representative of the actual temperature. The variance shrinks with a larger sample size, and it’s more likely that our mean will be accurate.
Similarly, in conversion optimization, the larger the sample size, in general, the more accurate your test will be.
Statistical Significance Is Not A Stopping Rule
Let’s start with the obvious question: what is statistical significance?
Evan Miller wrote a well-known blog post on the topic, which you should definitely read. As he explained:
“When an A/B testing dashboard says there is a “95% chance of beating original” or “90% probability of statistical significance,” it’s asking the following question: Assuming there is no underlying difference between A and B, how often will we see a difference like we do in the data just by chance?”
Statistical significance is a major quantifier in null-hypothesis statistical testing. Simply put, a low significance level means that there is a big chance that your ‘winner’ is not a real winner. Insignificant results carry a larger risk of false positives (type I errors).
Problem is, if you don’t predetermine a sample size for your test and a point at which your test will end, you’re likely getting subpar or completely inaccurate results.
That’s because most A/B testing tools do not wait for a fixed horizon (a set point in time) to call statistical significance. The test can and will oscillate between significant and insignificant at many points throughout the experiment:
That’s one of the big reasons we say that statistical significance is not a stopping rule. The biggest mistake beginning optimizers make is calling their tests early.
Here’s an example we’ve given before. Two days after the test started, here were the results:
Variation clearly lost, right? 0% chance to beat original seems pretty unambiguous? Not so fast. Statistically significant? Yes, but check out the results 10 days later:
That’s why you shouldn’t peek at results. The more you peek at the results, the more you risk alpha error inflation (read about it here). So set a sample size and a fixed horizon, and don’t stop the test until then.
Also know that when you read case studies that claim statistical significance yet don’t publish full numbers, you should be wary. Many of them may be statistically significant, yet only have a handful of conversions and a sample size of like 100.
What the Heck is a P-Value?
If you do some follow up reading on statistical significance, you’ll likely come across the term ‘P-value.’ The P-Value is basically a measure of evidence against the null hypothesis (the control in A/B Testing parlance). Matt Gershoff gave a great example and explanation in a previous article:
Formally, the p-value is the probability of seeing a particular result (or greater) from zero, assuming that the null hypothesis is true. If ‘null hypothesis is true’ is tricking you up, just think instead, ‘assuming we had really run an A/A’ Test.
If our test statistic is in the surprise region, we reject the Null (reject that it was really an A/A test). If the result is within the Not Surprising area, then we Fail to Reject the null. That’s it.
What you actually need to know about P-Values
Remember this: P-value does not tell us the probability that B is better than A.
Similarly, it doesn’t tell us the probability that we will make a mistake in selective B over A. These are both extraordinarily commons misconceptions, but they are false.
Remember the p-value is just the probability of seeing a result or more extreme given that the null hypothesis is true. Or, “How surprising is that result?”
Small note: there’s a large debate in the scientific community about P-Values. Much of this comes from the controversial practice of ‘P-Hacking’ to manipulate the results of an experiment into reaching significance so the author can get published. To learn everything you need to know about P-Values, read this post my Matt Gershoff.
Statistical Power: Detecting an Effect That Is Actually There
While statistical significance is the term you’ll hear most often, many people forget about statistical power. Where significance is the probability of seeing an effect when none exists, power is the probability of seeing an effect where it does actually exist.
So when you have low power levels, there is a big change that a real winner is not recognized. Evan Miller put together a great chart explaining the differences:
Effect Size FAQ’s summarizes it really well in plain English:
“Statistical power is the likelihood that a study will detect an effect when there is an effect there to be detected. If statistical power is high, the probability of making a Type II error, or concluding there is no effect when, in fact, there is one, goes down.”
So how do you calculate statistical power? You can read this post, which explains how to do so. Know that the four main factors that affect the power of any test of statistical significance are:
- the effect size
- the sample size (N)
- the alpha significance criterion (α)
- statistical power, or the chosen or implied beta (β)
However, for practical purposes, all you really need to know is that 80% power is the standard for testing tools. To reach such a level, you need either a large sample size, a large effect size, or a longer duration test.
Small note: if your test lasts too long you risk sample pollution. Read this post to learn more.
Confidence Intervals and Margin of Errors
Next on our list of statistical jargon you should be aware of is confidence intervals. What are they? Confidence intervals are the amount of error allowed in A/B testing – the measure of the reliability of an estimate. Example from PRWD:
Of course, we can’t measure true conversion rate which is why we do experimentation. Since statistics is inferential, we use confidence intervals to mitigate the risk of sampling errors. In that sense, we’re managing the risk associated with implementing a new variation. So if your tool says something like, “We are 95% confident that the conversion rate is X% +/- Y%,” then you need to account for the +/- Y% as the margin of error.
One practical implication here is that you should watch if confidence intervals overlap. Here’s how Michael Aagaard put it:
“So, the conversion range can be described as the margin of error you’re willing to accept. The smaller the conversion range – the more accurate your results will be. As a rule of thumb – if the 2 conversion ranges overlap, you’ll need to keep testing in order to get a valid result.”
John Quarto-vonTivadar has a great visual explaining confidence intervals:
Confidence intervals shrink as you collect more data, but at a certain point they are subject to the law of diminishing returns.
Reading right to left, as we increase the size of our sample, our sampling error falls. However, it falls at a decreasing rate – which means that we get less and less information from each addition to our sample.
Now if you were to do further research on the subject, you might be confused by the interchangeability of the terms confidence interval and margin of error. For all practical purposes, here’s the difference: the confidence interval is what you see on your testing tool as ‘20% +/- 2%,’ and the margin of error is the ‘+/- 2%.’
Matt Gershoff gave an illustrative example:
Regression To The Mean
A common question one might have when first starting testing is, “what is the reason for the wild fluctuations at the beginning of the test?” Here’s what I mean:
What’s happening here is a regression to the mean. Essentially, a regression to the mean is defined as “the phenomenon that if a variable is extreme on its first measurement, it will tend to be closer to the average on its second measurement.”
A great example comes from Wikipedia:
Imagine you give a class of students a 100-item true/false test on a subject. Suppose all the students choose their all their answers randomly. Then, each student’s score would be a realization of independent and identically distributed random variables, with an expected mean of 50. Of course, some students would score much above 50 and some much below.
So say you take only the top 10% of students and give them a second test where they, again, guess randomly on all questions. Since the mean would still be expected to be near 50, it’s expected that the students’ scores would regress to the mean – their scores would go down and be closer to the mean.
Essentially, if you’re calling a test early, based only on reaching significance, it’s possible you’re seeing a false positive. And it’s likely your ‘winner’ will regress to the mean.
Something related, that the internet always gets confused on, is called the novelty effect. That’s when the novelty of your changes (bigger blue button) brings more attention to the variation. With time, the lift disappears because the change is no longer novel.
Adobe outlined a method to distinguish the difference between a novelty effect and actual inferiority:
To determine if the new offer underperforms because of a novelty effect or because it’s truly inferior, you can segment your visitors into new and returning visitors and compare the conversion rates. If it’s just the novelty effect, the new offer will win with new visitors. Eventually, as returning visitors get accustomed to the new changes, the offer will win with them, too.
What You Need to Know About Segmenting
The key to learning in A/B testing is segmenting. Even though B might lose to A in the overall results, B might beat A in certain segments (organic, Facebook, mobile, etc). For segments, the same stopping rules apply.
Make sure that you have enough sample size within the segment itself too (calculate it in advance, be wary if it’s less than 250-350 conversions PER variation within that one segment you’re looking at).
As Andre Morys from Web Arts said in a previous article, searching for lifts within segments that have no statistical validity is a big mistake:
You can learn a lot from segmenting your test data, but make sure you’re applying the same statistical rules to the smaller data sets.
Confounding Variables and External Factors
There’s a challenge with running A/B tests: the data is non-stationary.
In other words, a stationary time series is one whose statistical properties (mean, variance, autocorrelation, etc) are constant over time. For many reasons, website data is non-stationary, which means we can’t make the same assumptions as with stationary data. Here are a few reasons data might fluctuate:
- Day of the week
- Press (positive or negative)
There are many more, but here’s a practical example for you, and why it’s essential to test for full weeks.
Test for Full Weeks
Run a conversions per day of the week report on your site, see how much fluctuation there is:
You can see that Saturday’s conversion rate is much lower than Thursday’s. So if you started the test on a Friday and ended on a Sunday, you’d be skewing your results.
Holidays and Promotions
If you’re running a test during Christmas, your winning test might not be a winner by the time February comes. Again, this is another product of web data being nonstationary. The fix? If you have tests that win over the holidays, run repeat tests on them once the shopping season is over. Same thing with promotions.
Fact is, you’ve got to be aware of all the the external factors that could affect your test. They definitely affect your test results, so when in doubt, run a follow-up test. (or look into bandit tests for short promotions.)
Learning the underlying A/B testing statistics allows you to avoid stupid mistakes. It’s worth learning the pertinent, practical information to inform your decisions.
As for the practical implications of the above, here are some testing heuristics:
- Test for full weeks.
- Test for two business cycles.
- Make sure your sample size is large enough (use a calculator before you start the test).
- Keep in mind confounding variables and external factors (holidays, etc)
- Set a fixed horizon and sample size for your test before you run it.
- You can’t ‘see a trend,’ regression to mean will occur. Wait until the test ends to call it.