Reality check: **the level of statistical literacy is pretty poor in CRO world.** A major portion of your test results are probably invalid.

While testing tools are getting more sophisticated, blogs are brimming with ‘inspiring’ case studies, and experimentation is becoming more and more common for marketers – statistics know-how is still severely lacking.

Stop being one of “those people”, and get your act together. It’s actually not that complicated. If you don’t know basic statistics, you won’t be able to tell whether your split tests suck.

## Why Do I Need to Know A/B Testing Statistics?

I know statistics isn’t necessarily a fun thing to learn. It’s probably a lot more fun to put up a test between a red and green button and wait until your testing tool tells you one of them has beat the other.

If this is your strategy, you’re ripe for disappointment.

This approach isn’t much better than guessing. Often it will end up with you testing things for a year yet ending up with the exact same conversion rate as when you started.

Statistics provide inference on your results, and they help you make practical business decisions. Lack of understanding of statistics can lead to errors and unreliable outcomes.

As Matt Gershoff from Conductrics said, quoting his college math professor, “*how can you make cheese if you don’t know where milk comes from?!*”

Here’s an analogy from Matt:

“So, for example, say I have two cups of coffee, and I want to know which one is hotter and by how much. First, I would measure the temperature of each coffee. Next, I would see which one has the highest temp. Finally, I would subtract the lower temp coffee from the higher to get the difference in temperature. Obvious and super simple.

Now, let’s say you want to ask, “which place in my town has the hotter coffee, McDonald’s or Starbucks?” Well, each place makes lots of cups of coffee, so I am going to have to compare a collection of cups of coffee. Any time we have to measure and compare collections of things, we need to use statistics.”

So it is with conversion rates. Conversion optimization is also a balancing act between exploration and exploitation. It’s about balancing risk, which is a fundamental problem solved by statistics. As Ton Wesseling from Testing.Agency put it:

“Within experimenting the trade off is always between exploration and exploitation. Do you want to know the real answer, or do you just want to make a profit?

If you are testing medicines that should stop someone from dying – and the A variant is not getting those medicines, when do you stop feeding Variant A with nothing when you see Variation B is working? During the test? After a couple of positive tests? How many people need to die before you decide to give everyone Variation B.

Of course running a business is less dramatic than the example, but still – you want to outgrow your competitors – so you want to learn, but not be fully 100% sure, because that will make you too slow – not adjusting rapid enough to user wishes.”

Knowing statistics will make you a better marketer. Clearing the air on some A/B testing statistics topics should hopefully help you increase your actual conversion rates and revenue.

## Building Blocks: Mean, Variance, and Sampling

There are three terms you should know before we dive into the nitty-gritty of A/B testing statistics:

- Mean
- Variance
- Sampling

### Mean

The mean is the average. For conversion rates, it is the number of events multiplied by the probability of success (n*p).

In our coffee example, this would be the process of measuring the temperature of each cup of coffee that we sample and dividing by the total number of cups, reaching an average temperate that is hopefully representative of the actual average.

In online experimentation, since we can’t know the “true” conversion rate, we’re measuring the mean conversion rate of each variation.

### Variance

Variance is the average variability of our data. The higher the variability, the less precise the mean will be as a predictor of any individual data point.

It’s basically, on average, how far off each of the individual cups of coffee in each collection is from the collection’s average temperature. In other words, how close will the mean be to each cup’s actual temperature? The smaller the variance, the better the mean will be as a guess for each cup’s temperature. Many things can cause variance (e.g. how long ago the coffee cup was made, who made it, how hot the water was, etc).

In terms of conversion optimization, Marketing Experiments gave a great example of variance in this blog post:

The two images above are the exact same – except the treatment came up with 15% more conversions. This is an A/A test.

A/A tests, which are often used to detect whether your testing software is working, are also used to detect the natural variability of a website. It splits traffic between two identical pages, and if you discover that there is a statistically significant lift on one of the variations, you need to investigate the cause.

### Sampling

Since we can’t measure ‘true conversion rate,’ we have to select a sample that is statistically representative of the whole.

In terms of our coffee measuring example, we don’t know the mean temperature of coffee from each restaurant. Therefore, we need to collect data on the temperature in order to estimate the average temperature. So, unlike comparing individual cups of coffee, we don’t measure ALL possible cups of coffee from McDonalds and Starbucks, we collect some of them and use inference to estimate the total.

The more cups we measure, the more likely it is that the sample is representative of the actual temperature. The variance shrinks with a larger sample size, and it’s more likely that our mean will be accurate.

Similarly, in conversion optimization, the larger the sample size, in general, the more accurate your test will be.

## Statistical Significance Is Not A Stopping Rule

Let’s start with the obvious question: what is statistical significance?

Evan Miller wrote a well-known blog post on the topic, which you should definitely read. As he explained:

“When an A/B testing dashboard says there is a “95% chance of beating original” or “90% probability of statistical significance,” it’s asking the following question: Assuming there is no underlying difference between A and B, how often will we see a difference like we do in the data just by chance?”

Statistical significance is a major quantifier in null-hypothesis statistical testing. Simply put, a low significance level means that there is a big chance that your ‘winner’ is not a real winner. Insignificant results carry a larger risk of false positives (type I errors).

Problem is, if you don’t predetermine a sample size for your test and a point at which your test will end, you’re likely getting subpar or completely inaccurate results.

That’s because most A/B testing tools do not wait for a fixed horizon (a set point in time) to call statistical significance. The test can and will oscillate between significant and insignificant at many points throughout the experiment:

That’s one of the big reasons we say that statistical significance is not a stopping rule. The biggest mistake beginning optimizers make is calling their tests early.

Here’s an example we’ve given before. Two days after the test started, here were the results:

Variation clearly lost, right? 0% chance to beat original seems pretty unambiguous? Not so fast. Statistically significant? Yes, but check out the results 10 days later:

That’s why you shouldn’t peek at results. The more you peek at the results, the more you risk alpha error inflation (read about it here). So set a sample size and a fixed horizon, and don’t stop the test until then.

Also know that when you read case studies that claim statistical significance yet don’t publish full numbers, you should be wary. Many of them may be statistically significant, yet only have a handful of conversions and a sample size of like 100.

## What the Heck is a P-Value?

If you do some follow up reading on statistical significance, you’ll likely come across the term ‘P-value.’ The P-Value is basically a measure of evidence against the null hypothesis (the control in A/B Testing parlance). Matt Gershoff gave a great example and explanation in a previous article:

“If we go out and offer 50 prospects our ‘Analytics200’ promotion, we would expect, on average, to have 5 conference signups. However, we wouldn’t really be that surprised if we saw a few less or a few more.

But what is a few? Would we be surprised if we saw 4? What about 10, or 25, or zero? It turns out that the P-Value answers the question, How surprising is this result?”

Formally, the p-value is the probability of seeing a particular result (or greater) from zero, assuming that the null hypothesis is true. If ‘null hypothesis is true’ is tricking you up, just think instead, ‘assuming we had really run an A/A’ Test.

If our test statistic is in the surprise region, we reject the Null (reject that it was really an A/A test). If the result is within the Not Surprising area, then we Fail to Reject the null. That’s it.

### What you actually need to know about P-Values

Remember this: P-value does not tell us the probability that B is better than A.

Similarly, it doesn’t tell us the probability that we will make a mistake in selective B over A. These are both extraordinarily commons misconceptions, but they are false.

Remember the p-value is just the probability of seeing a result or more extreme given that the null hypothesis is true. Or, “How surprising is that result?”

Small note: there’s a large debate in the scientific community about P-Values. Much of this comes from the controversial practice of ‘P-Hacking’ to manipulate the results of an experiment into reaching significance so the author can get published. To learn everything you need to know about P-Values, read this post my Matt Gershoff.

## Statistical Power: Detecting an Effect That Is Actually There

While statistical significance is the term you’ll hear most often, many people forget about statistical power. Where significance is the probability of seeing an effect when none exists, power is the probability of seeing an effect where it does actually exist.

So when you have low power levels, there is a big change that a real winner is not recognized. Evan Miller put together a great chart explaining the differences:

Effect Size FAQ’s summarizes it really well in plain English:

“Statistical power is the likelihood that a study will detect an effect when there is an effect there to be detected. If statistical power is high, the probability of making a Type II error, or concluding there is no effect when, in fact, there is one, goes down.”

So how do you calculate statistical power? You can read this post, which explains how to do so. Know that the four main factors that affect the power of any test of statistical significance are:

- the effect size
- the sample size (N)
- the alpha significance criterion (α)
- statistical power, or the chosen or implied beta (β)

However, for practical purposes, all you really need to know is that 80% power is the standard for testing tools. To reach such a level, you need either a large sample size, a large effect size, or a longer duration test.

Listen to the advice of Ton Wesseling of Testing.Agency here:

“You want to test as long as possible – at least 1 purchase cycle – the more data, the higher the Statistical Power of your test! More traffic means you have a higher chance of recognizing your winner on the significance level your testing on!

Because…small changes can make a big impact, but big impacts don’t happen too often – most of the times, your variation is slightly better – so you need much data to be able to notice a significant winner.”

Small note: if your test lasts too long you risk sample pollution. Read this post to learn more.

## Confidence Intervals and Margin of Errors

Next on our list of statistical jargon you should be aware of is confidence intervals. What are they? Confidence intervals are the amount of error allowed in A/B testing – the measure of the reliability of an estimate. Example from PRWD:

Of course, we can’t measure true conversion rate which is why we do experimentation. Since statistics is inferential, we use confidence intervals to mitigate the risk of sampling errors. In that sense, we’re managing the risk associated with implementing a new variation. So if your tool says something like, “We are 95% confident that the conversion rate is X% +/- Y%,” then you need to account for the +/- Y% as the margin of error.

One practical implication here is that you should watch if confidence intervals overlap. Here’s how Michael Aagaard put it:

“So, the conversion range can be described as the margin of error you’re willing to accept. The smaller the conversion range – the more accurate your results will be. As a rule of thumb – if the 2 conversion ranges overlap, you’ll need to keep testing in order to get a valid result.”

John Quarto-vonTivadar has a great visual explaining confidence intervals:

Confidence intervals shrink as you collect more data, but at a certain point they are subject to the law of diminishing returns.

Reading right to left, as we increase the size of our sample, our sampling error falls. However, it falls at a decreasing rate – which means that we get less and less information from each addition to our sample.

Now if you were to do further research on the subject, you might be confused by the interchangeability of the terms confidence interval and margin of error. For all practical purposes, here’s the difference: the confidence interval is what you see on your testing tool as ‘20% +/- 2%,’ and the margin of error is the ‘+/- 2%.’

Matt Gershoff gave an illustrative example:

“Say your buddy is coming to visit you from Round Rock and is taking TX-1 at 5pm. She wants to know how long it should take her. You say I have a 95% confidence that it will take you about 60 minutes plus or minus 20 minutes. So your margin of error is 20 minutes or 33%.

If she is coming at 11am you might say “it will take you 40 min, plus or minus 10 min,” so the margin of error is 10 minutes, or 25%. So while both are at the 95% confidence level, the margin of error is different.”

## Regression To The Mean

A common question one might have when first starting testing is, “what is the reason for the wild fluctuations at the beginning of the test?” Here’s what I mean:

What’s happening here is a regression to the mean. Essentially, a regression to the mean is defined as “the phenomenon that if a variable is extreme on its first measurement, it will tend to be closer to the average on its second measurement.”

A great example comes from Wikipedia:

Imagine you give a class of students a 100-item true/false test on a subject. Suppose all the students choose their all their answers randomly. Then, each student’s score would be a realization of independent and identically distributed random variables, with an expected mean of 50. Of course, some students would score much above 50 and some much below.

So say you take only the top 10% of students and give them a second test where they, again, guess randomly on all questions. Since the mean would still be expected to be near 50, it’s expected that the students’ scores would regress to the mean – their scores would go down and be closer to the mean.

In A/B testing, it can happen for a variety of reasons. Here’s a great example from Andre Morys, excerpt from his talk at Conversion Jam earlier this year:

Essentially, if you’re calling a test early, based only on reaching significance, it’s possible you’re seeing a false positive. And it’s likely your ‘winner’ will regress to the mean.

Something related, that the internet always gets confused on, is called the novelty effect. That’s when the novelty of your changes (bigger blue button) brings more attention to the variation. With time, the lift disappears because the change is no longer novel.

Adobe outlined a method to distinguish the difference between a novelty effect and actual inferiority:

To determine if the new offer underperforms because of a novelty effect or because it’s truly inferior, you can segment your visitors into new and returning visitors and compare the conversion rates. If it’s just the novelty effect, the new offer will win with new visitors. Eventually, as returning visitors get accustomed to the new changes, the offer will win with them, too.

## What You Need to Know About Segmenting

The key to learning in A/B testing is segmenting. Even though B might lose to A in the overall results, B might beat A in certain segments (organic, Facebook, mobile, etc). For segments, the same stopping rules apply.

Make sure that you have enough sample size within the segment itself too (calculate it in advance, be wary if it’s less than 250-350 conversions PER variation within that one segment you’re looking at).

As Andre Morys from Web Arts said in a previous article, searching for lifts within segments that have no statistical validity is a big mistake:

“We just ran a test for a big fashion retailer in the middle of the summer sale phase. It was very interesting to see how the results dramatically dropped during the “hard sale phase” with 70% and more – but it recovered 1 week after the phase ended. We would never have learned this, if the test hadn’t run for nearly 4 weeks.

Our “rule of thumb” is this: 3000-4000 conversions per variation and 3-4 week test duration. That is enough traffic so we can even talk about valid data if we drill down into segments.

→ Testing “sin” no 1: searching for uplifts within segments although you have no statistical validity – e.g. 85 vs 97 conversions – that’s bullshit.”

You can learn a lot from segmenting your test data, but make sure you’re applying the same statistical rules to the smaller data sets.

**Confounding Variables and External Factors**

There’s a challenge with running A/B tests: the data is non-stationary.

In other words, a stationary time series is one whose statistical properties (mean, variance, autocorrelation, etc) are constant over time. For many reasons, website data is non-stationary, which means we can’t make the same assumptions as with stationary data. Here are a few reasons data might fluctuate:

- Season
- Day of the week
- Holidays
- Press (positive or negative)
- PPC/SEM
- SEO
- Word-of-Mouth

There are many more, but here’s a practical example for you, and why it’s essential to test for full weeks.

**Test for Full Weeks**

Run a conversions per day of the week report on your site, see how much fluctuation there is:

You can see that Saturday’s conversion rate is much lower than Thursday’s. So if you started the test on a Friday and ended on a Sunday, you’d be skewing your results.

### Holidays and Promotions

If you’re running a test during Christmas, your winning test might not be a winner by the time February comes. Again, this is another product of web data being nonstationary. The fix? If you have tests that win over the holidays, run repeat tests on them once the shopping season is over. Same thing with promotions.

Fact is, you’ve got to be aware of all the the external factors that could affect your test. They definitely affect your test results, so when in doubt, run a follow-up test. (or look into bandit tests for short promotions.)

## Conclusion

Learning the underlying A/B testing statistics allows you to avoid stupid mistakes. It’s worth learning the pertinent, practical information to inform your decisions.

As for the practical implications of the above, here are some testing heuristics:

- Test for full weeks.
- Test for two business cycles.
- Make sure your sample size is large enough (use a calculator before you start the test).
- Keep in mind confounding variables and external factors (holidays, etc)
- Set a fixed horizon and sample size for your test before you run it.
- You can’t ‘see a trend,’ regression to mean will occur. Wait until the test ends to call it.

Hello Alex,

This reminds me of basic statistics in college. I can now see the applicability of the terms used back then ;)

Now, back to the topic at hand, statistical knowledge is important for predicting and making accurate decision online. If the right data are extracted and imposed into the right data then it becomes simpler to carryout proper A/B testing.

Every marketer should understand the basics of Testing so as to build on progressive changes in online marketing!

I left the above comment after reading this post in kingged.com

Great article Alex – a colleague linked it to me after I shared some sampling sizes for my A/B tests. I’d run an A/A test and with the first 200 leads, the “variation” was losing by -20%. After letting it run longer, at 2,000+ leads, it had evened out to a 0.59% difference. This helped clue me in that I might have been stopping some of my earlier tests too early and gave me a good mark to shoot for without going to the extreme of 4 weeks for a single test.

Love this, and so glad you guys wrote a clear write-up on these concepts.

It’s funny that confidence intervals and p-values have been set as the standard for reporting legit test results when so many research experts now openly state that p-values are like, the least interesting/useful thing to report in a study (the most interesting/useful thing being a properly calculated effect size).

I’ve often wondered: Do popular testing tools like Optimizely/VWO actually calculate effect size indices or are they just reporting the absolute difference between test means? Like, when a blog post about a recent test says something like “We got 120% increase in conversions with 99% significance”: is the 120% actually a legit calculation of effect size or just raw difference (and thus, in the latter case, bullshit)?

If anyone’s interested, another great jargon-free and comprehensive post on the statistical dos and don’ts of testing can be found here: http://www.optimizesmart.com/understanding-ab-testing-statistics-to-get-real-lift-in-conversions/

I won’t speak about Optimizely, but at VWO we calculate the effect size. (Disclaimer: I’m the statistician who helped build VWO SmartStats.)

In fact, we heavily de-emphasize the midpoint estimate in the UI in order to accurately communicate the uncertainty. Our test summary says something like: “V1 Blue Button has shown to have between a 0.4% and 6.7% better conversion rate than all other variants. Recommended action: Discard all other variants and go with V1 Blue Button.”

These numbers are calculated using Bayesian credible intervals at the 99% level.

Great post as always Alex!

Check out VWO’s SmartStats here – https://vwo.com/bayesian-ab-testing

Results and reports are much easier to understand. P value always used to confuse me too :)

Excellent post! I’ve been looking for something like this for a long time now.

Thanks, Jocelyn, glad you liked the article!