How to Run A/B Tests

The last lesson was about what to test. Now we need to validate our hypotheses and learn. Pick a testing tool, and create treatments / alternative variations to test against the current page (control).

There’s no shortage of testing tools, one is even built into Google Analytics and completely free. I use Optimizely and VWO the most, but there’s also Qubit, Adobe Target, Convert.com, and many others.

A thing to keep in mind is that you want to take testing seriously. You either need the help of a developer or you need to learn some HTML, CSS, and JavaScript/jQuery.

You can only use the visual editor if you’re making small changes, like tweaking the copy. For anything else, you’re risking your test failing due to cross-browser and cross-device compatibility issues.

Testing is no joke – you have to test right. Bad testing is even worse than no testing at all because you might be confident that solutions A, B and C work well when in reality, they hurt your business.

Poor A/B testing methodologies are costing online retailers up to $13 billion a year in lost revenue, according to research from Qubit.  Don’t take this lightly!

I often hear of businesses that run 100 tests over a year, yet their conversion rate is where it was when they began. Why? Because they did it wrong. Most of their tests were false positives or false negatives. Massive waste of time, money and human potential.

There are 3 things you need to pay attention to when deciding when your test is done…

1. You need to make sure your sample size is big enough.

In order to be confident that the results of your test are actually valid, you need to know how big of a sample size you need.

There are several calculators out there for this – like this or this.

You need a minimum number of observations for the right statistical power. Using the number you get from the sample size calculators as a ballpark is perfectly valid, but the test may not be as powerful as you had originally planned.

The only real danger is in stopping the test early after looking at preliminary results. There’s no penalty to having a larger sample size (except that it takes more time).

As a very rough ballpark, I typically recommend ignoring your test results until you have at least 350 conversions per variation (definitely more if you want to look at the results across segments).

But don’t make the mistake of thinking 350 is a magic number; it’s not. This is science, not magic. Always calculate the needed sample size ahead of time!

 2. You need to test for multiple business cycles.

For some high-traffic sites, you would get the needed sample size in a day or two. But that is not a representative sample. It does not include a full business cycle, all weekdays, weekends, phases of the moon, traffic sources, your blog publishing and email newsletter schedule, and all other possible variables.

So for a valid test both conditions – an adequate sample size and a long enough period to account for all factors (a full business  cycle or, better yet, two) – should be met. For most businesses, this is 2-4 weeks. Always run tests full weeks at a time (stop tests at the 7, 14, 21 or 28 day mark).

3. You need statistical significance.

When an A/B testing dashboard (i.e. Optimizely or a similar frequentist statistics tool) says there is a “95% chance of beating original”, it’s asking the following question: Assuming there is no underlying difference between A and B, how often will we see a difference like we do in the data just by chance?

The answer to that question is called the significance level, and “statistically significant results” mean that the significance level is low (e.g. 5% or 1%). Dashboards usually take the complement of this (e.g. 95% or 99%) and report it as a “chance of beating the original” or something like that.

If the results are not statistically significant, the results might be caused by random factors and there’s no relationship between the changes you made and the test results (this called the null hypothesis).

But don’t confuse statistical significance with validity. Once your testing tool says you’ve achieved 95% statistical significance (or higher), that doesn’t mean anything if you don’t have a big enough sample size. Achieving significance is not a test stopping rule. Read this blog post to learn why. It’s very, very important.

Consider this: One thousand A/A tests (two identical pages tested against each other) were run.

771 experiments out of 1,000 reached 90% significance at some point.
 531 experiments out of 1,000 reached 95% significance at some point .

Quote from the experimenter:

 ”This means if you’ve run 1,000 experiments and didn’t control for repeat testing error in any way, a rate of successful positive experiments up to 25% might be explained by a false positive rate. But you’ll see a temporary significant effect in around half of your experiments! ”

So if you stop your test as soon as you see significance, there’s a 50% chance it’s a complete fluke. A coin toss. Totally kills the idea of testing in the first place.

Always make sure that when you end your test, you have:

a big enough sample size (pre-calculated).
a  long enough test duration (~2 business cycles) .
statistical significance (95% or higher ).

Until the first 2 criteria are matched, the statistical significance means little.

Run Separate Tests for Your Desktop and Mobile Segments

While running A/B tests on all your traffic at once might seem like a good idea (to get a bigger sample size faster), in reality, it’s not. You need to target mobile and desktop audiences separately. (Note: You can combine tablet with desktop.)

Here are 5 reasons why:

Different things work. What works for mobile might not work for desktop (and vice versa) .
Your desktop and mobile traffic volumes are different. So while your desktop segment might have a big enough sample size, you can’t stop the test because the mobile segment needs more samples .
Not all mobile traffic is equal. People on different devices / mobile operating systems behave differently .
You might want to optimize for different outcomes (e.g. purchases for desktop, but email captures for mobile) .
You can create more tests faster. If you create tests targeting only a single device category, it will take less development and quality assurance time per test, hence you’re able to launch tests faster.

What If I Have a Low Traffic Website?

Many sites have low traffic and a low total monthly transaction count. So in order to call a test within 4 weeks (you shouldn’t run tests longer than than, or you get sample pollution), you need a big lift.

If you have bigger wins (e.g. +50%), you can definitely get by with smaller sample sizes. But it would be naive to think that smaller sites can somehow get bigger wins more easily than large sites can. Everyone wants big wins. So saying “I’m going to swing big” is quite meaningless.

The only true tidbit here is that in order to get a more radical lift, you also need to test a more radical change. You can’t expect a large win when you just change the call to action. Conduct conversion research, identify problems and issues with your website, and test all those changes at once. Your chances of a higher lift go up.

Also, keep in mind: Testing is not a must-have, mandatory component of optimization. You can also improve without testing.

No Substitute for Experience

Start running tests now.

There’s quite a bit to know about all this, but the content above will make you smarter than most about running tests.

Takeaways

Calculate your sample size before you begin testing. Don’t stop your test until that sample size is reached and it’s been at least one full business cycle, but preferably two.
You want to achieve at least 95% significance, but once you see significance has been reached, you can’t just stop your test. Leave it running until the conditions above are met.
Segment your traffic. Test desktop / tablet traffic separate from mobile traffic to account for variations in volume, intent, compatibility, etc.

Introduction to Conversion Optimization by Brian Massey
Copywriting, A/B testing, analytics, psychology... we'll cover it all. But first, Brian Massey, the Conversion Scientist, reminds you of the basics.
How to Write Copy That Sells Like a Mofo by Joanna Wiebe
Joanna Wiebe of Copy Hackers and Airstory on how to write copy that converts like crazy.
Introduction to Designing for Conversions by David Kadavy
David Kadavy, author of Design for Hackers, on designing for conversions. He debunks today's biggest design myths and tells you what actually matters.
How to Use Psychology in Conversion Optimization by Bart Schutz
Bart Schutz of Online Dialogue and The Wheel of Persuasion on using psychology to increase conversions.
Emotional Targeting 101 by Talia Wolf
Talia Wolf of Conversioner talking emotional persuasion. Building on what we learned from Bart, she explains how to appeal to your visitors' emotions.
How to Build a List and Send Emails That Convert by Justine Jordan
Justine Jordan from Litmus walks you through collecting emails, improving your open rate, designing for all browsers / devices / email clients, A/B testing emails and more.
How to Setup Analytics and Measure the Right Stuff by Chris Mercer
Chris Mercer from SeriouslySimpleMarketing.com on how to setup your analytics in a meaningful way that ensures you're gathering useful data.
How to Use Analytics to Find Insights by Yehoshua Coren
Yehoshua Coren, the Analytics Ninja, walks you through extracting insights from your analytics using segmentation.
How to Gather Qualitative Data for Insights by Jen Havice
Jen Havice of Make Mention on how to use qualitative research to answer one of the most important questions in conversion optimization: Why?
What to Test (Conversion Research) by Michael Aagaard
Michael Aagaard, senior conversion optimizer at Unbounce, on how to strategically decide what to test using conversion research.
How to Run A/B Tests by Peep Laja
Peep Laja teaches you everything you need to know to run valuable, statistically valid tests that will actually lead to applicable insights.
How to Create a CRO Process by Peep Laja
Peep on how to combine everything you've learned into a systematic, repeatable CRO process.