How to Run A/B Tests by Peep Laja

The last lesson was about what to test. Now we need to validate our hypotheses and learn. Pick a testing tool, and create treatments / alternative variations to test against the current page (control).

There’s no shortage of testing tools, one is even built into Google Analytics and completely free. I use Optimizely and VWO the most, but there’s also Qubit, Adobe Target, Convert.com, and many others.

A thing to keep in mind is that you want to take testing seriously. You either need the help of a developer or you need to learn some HTML, CSS, and JavaScript/jQuery.

You can only use the visual editor if you’re making small changes, like tweaking the copy. For anything else, you’re risking your test failing due to cross-browser and cross-device compatibility issues.

Testing is no joke – you have to test right. Bad testing is even worse than no testing at all because you might be confident that solutions A, B and C work well when in reality, they hurt your business.

Poor A/B testing methodologies are costing online retailers up to $13 billion a year in lost revenue, according to research from Qubit. 
Don’t take this lightly!

I often hear of businesses that run 100 tests over a year, yet their conversion rate is where it was when they began. Why? Because they did it wrong. Most of their tests were false positives or false negatives. Massive waste of time, money and human potential.

There are 3 things you need to pay attention to when deciding when your test is done…

1. You need to make sure your sample size is big enough.

In order to be confident that the results of your test are actually valid, you need to know how big of a sample size you need.

There are several calculators out there for this – like this or this.

You need a minimum number of observations for the right statistical power. Using the number you get from the sample size calculators as a ballpark is perfectly valid, but the test may not be as powerful as you had originally planned.

The only real danger is in stopping the test early after looking at preliminary results. There’s no penalty to having a larger sample size (except that it takes more time).

As a very rough ballpark, I typically recommend ignoring your test results until you have at least 350 conversions per variation (definitely more if you want to look at the results across segments).

But don’t make the mistake of thinking 350 is a magic number; it’s not. This is science, not magic. Always calculate the needed sample size ahead of time!

Related Reading: Stopping A/B Tests: How Many Conversions Do I Need?


2. You need to test for multiple business cycles.

For some high-traffic sites, you would get the needed sample size in a day or two. But that is not a representative sample. It does not include a full business cycle, all weekdays, weekends, phases of the moon, traffic sources, your blog publishing and email newsletter schedule, and all other possible variables.

So for a valid test both conditions – an adequate sample size and a long enough period to account for all factors (a full business 
cycle or, better yet, two) – should be met. For most businesses, this is 2-4 weeks. Always run tests full weeks at a time (stop tests at the 7, 14, 21 or 28 day mark).

3. You need statistical significance.

When an A/B testing dashboard (i.e. Optimizely or a similar frequentist statistics tool) says there is a “95% chance of beating original”, it’s asking the following question: Assuming there is no underlying difference between A and B, how often will we see a difference like we do in the data just by chance?

The answer to that question is called the significance level, and “statistically significant results” mean that the significance level is low (e.g. 5% or 1%). Dashboards usually take the complement of this (e.g. 95% or 99%) and report it as a “chance of beating the original” or something like that.

If the results are not statistically significant, the results might be caused by random factors and there’s no relationship between the changes you made and the test results (this called the null hypothesis).

But don’t confuse statistical significance with validity. Once your testing tool says you’ve achieved 95% statistical significance (or higher), that doesn’t mean anything if you don’t have a big enough sample size. Achieving significance is not a test stopping rule. Read this blog post to learn why. It’s very, very important.

Consider this: One thousand A/A tests (two identical pages tested against each other) were run.

  • 771 experiments out of 1,000 reached 90% significance at some point.
  • 
531 experiments out of 1,000 reached 95% significance at some point
.

Quote from the experimenter:


”This means if you’ve run 1,000 experiments and didn’t control for repeat testing error in any way, a rate of successful positive experiments up to 25% might be explained by a false positive rate. But you’ll see a temporary significant effect in around half of your experiments!
”

So if you stop your test as soon as you see significance, there’s a 50% chance it’s a complete fluke. A coin toss. Totally kills the idea of testing in the first place.

Always make sure that when you end your test, you have:

  • a big enough sample size (pre-calculated).
  • a 
long enough test duration (~2 business cycles)
.
  • statistical significance (95% or higher
).

Until the first 2 criteria are matched, the statistical significance means little.

Run Separate Tests for Your Desktop and Mobile Segments

While running A/B tests on all your traffic at once might seem like a good idea (to get a bigger sample size faster), in reality, it’s not. You need to target mobile and desktop audiences separately. (Note: You can combine tablet with desktop.)

Here are 5 reasons why:

  1. Different things work. What works for mobile might not work for desktop (and vice versa)
.
  2. Your desktop and mobile traffic volumes are different. So while your desktop segment might have a big enough sample size, you can’t stop the test because the mobile segment needs more samples
.
  3. Not all mobile traffic is equal. People on different devices / mobile operating systems behave differently
.
  4. You might want to optimize for different outcomes (e.g. purchases for desktop, but email captures for mobile)
.
  5. You can create more tests faster. If you create tests targeting only a single device category, it will take less development and quality assurance time per test, hence you’re able to launch tests faster.

Read more about this here.

What If I Have a Low Traffic Website?

Many sites have low traffic and a low total monthly transaction count. So in order to call a test within 4 weeks (you shouldn’t run tests longer than than, or you get sample pollution), you need a big lift.

If you have bigger wins (e.g. +50%), you can definitely get by with smaller sample sizes. But it would be naive to think that smaller sites can somehow get bigger wins more easily than large sites can. Everyone wants big wins. So saying “I’m going to swing big” is quite meaningless.

The only true tidbit here is that in order to get a more radical lift, you also need to test a more radical change. You can’t expect a large win when you just change the call to action. Conduct conversion research, identify problems and issues with your website, and test all those changes at once. Your chances of a higher lift go up.

Also, keep in mind: Testing is not a must-have, mandatory component of optimization. You can also improve without testing.

No Substitute for Experience

Start running tests now.

There’s quite a bit to know about all this, but the content above will make you smarter than most about running tests.

Takeaways

  1. Calculate your sample size before you begin testing. Don’t stop your test until that sample size is reached and it’s been at least one full business cycle, but preferably two.
  2. You want to achieve at least 95% significance, but once you see significance has been reached, you can’t just stop your test. Leave it running until the conditions above are met.
  3. Segment your traffic. Test desktop / tablet traffic separate from mobile traffic to account for variations in volume, intent, compatibility, etc.