12 A/B Split Testing Mistakes I See Businesses Make All The Time

12 A/B Split Testing Mistakes I See Businesses Make All The Time

A/B testing is fun. With so many easy-to-use tools around, anyone can (and should) do it. However, there’s actually more to it than just setting up a test. Tons of companies are wasting their time and money by making these 12 mistakes.

Here are the top mistakes I see again and again. Are you guilty of making these mistakes? Read and find out.

#1: A/B tests are called early

Statistical significance is what tells you whether version A is actually better than version B—if the sample size is large enough. 5o% statistical significance is a coin toss. If you’re calling tests at 50%, you should change your profession. And no, 75% statistical confidence is not good enough either.

Any seasoned tester has had plenty of experiences where a “winning” variation at 80% confidence ends up losing bad after giving it a chance (read: more traffic).

What about 90%? Come on, that’s pretty good!

Nope. Not good enough. You’re performing a science experiment here. Yes, you want it to be true. You want that 90% to win, but more important than having a “declared winner” is getting to the truth.



Image credit

As an optimizer, your job is to figure out the truth. You have to put your ego aside. It’s very human to get attached to your hypothesis or design treatment, and it can hurt when your best hypotheses end up not being significantly different. Been there, done that. Truth above all, or it all loses meaning.

A very common scenario, even for companies that test a lot: they run one test after another for 12 months and have many tests they declare as winners and roll out. A year later the conversion rate of their site is the same as it was when they started. Happens all the damn time.

Why? Because tests are called too early and/or sample sizes are too small. You should not call tests before you’ve reached 95% or higher. 95% means that there’s only a 5% chance that the results are a complete fluke. A/B split testing tools like Optimizely or VWO both tend to call tests too early: their minimum sample sizes are way too small.

Here’s what Optimizely tells you: optimizelybs A sample size of 100 visitors per variation is not enough. Optimizley leads many people to call tests early and doesn’t have a setting where you may change the minimum sample size needed before declaring a winner.

VWO has a sample size feature, but their default is incredibly low. You can configure it in the test settings:

vwo Conspiracy theorists say VWO and Optimizely do it on purpose to generate excitement about testing so users keep on paying them. Not sure that’s true, but they really should stop calling tests early. Here’s an example I’ve used before. Two days after starting a test these were the results:
The variation I built was losing bad—by more than 89% (and no overlap in the margin of error). Some tools would already call it and say statistical significance was 100%. The software I used said Variation 1 has 0% chance to beat Control. My client was ready to call it quits.

However since the sample size here was too small (only a little over 100 visits per variation) I persisted and this is what it looked like 10 days later: That’s right, the variation that had 0% chance of beating control was now winning with 95% confidence.

Watch out for A/B testing tools “calling it early” and always double check the numbers. The worst thing you can do is have confidence in data that’s actually inaccurate. That’s going to lose you money and quite possibly waste months of work.

How big of a sample size do I need?

You don’t want to make conclusions based on a small sample size. A good ballpark is to aim for at least 350-400 conversions per variation (can be less in certain circumstances – like when the discrepancy between control and treatment is very large). BUT – magic numbers don’t exist. Don’t get stuck with a number – this is science, not magic.

You NEED TO calculate the actual needed sample size ahead of time, using sample size calculators like this or other similar ones. This is a pretty useful tool for understanding the relation between uplift percentages and needed sample sizes: http://www.testsignificance.com.

What if I have 350 conversions per variation, and confidence is still not 95% (or higher)?

If the needed sample size has been achieved, this means there is no significant difference between the variations. Check the test results across segments to see if significance was achieved in one segment or another (great insights lie always in the segments – but you also need enough sample size for each segment). In any case, you need to improve your hypothesis and run a new test.

#2: Tests are not run for full weeks

Let’s say you have a high traffic site. You achieve 98% confidence and 250 conversions per variation in 3 days. Is the test done? Nope.

We need to rule out seasonality and test for full weeks. Did you start the test on Monday? Then you need to end it on a Monday as well. Why? Because your conversion rate can vary greatly depending on the day of the week.

So if you don’t test a full week at a time, you’re again skewing your results.  Run a conversions per day of the week report on your site, see how much fluctuation there is. Here’s an example: dayoftheweekWhat do you see here? Thursdays make 2x more money than Saturdays and Sundays, and the conversion rate on Thursdays is almost 2x better than on a Saturday.

If we didn’t test for full weeks, the results would be inaccurate. So this is what you must always do: run tests for 7 days at a time. If confidence is not achieved within the first 7 days, run it another 7 days. If it’s not achieved with 14 days, run it another 7 days.

Of course, first of all you need to run your tests for a minimum of 2 weeks anyway (my personal minimum is 4 weeks, since 2 weeks is often inaccurate), and then apply the 7 day rule.

The only time when you can break this rule is when your historical data says with confidence that every single day the conversion rate is the same. But it’s better to test 1 week at a time even then.

Always pay attention to external factors

Is it Christmas? Your winning test during the holidays might not be a winner in January. If you have tests that win during shopping seasons like Christmas, you definitely want to run repeat tests on them once the shopping season is over. Are you doing a lot of TV advertising or running other massive campaigns? That may also skew your results. You need to be aware of what your company is doing.

External factors definitely affect your test results. When in doubt, run a follow-up test.

#3: A/B split testing is done even when they don’t even have traffic (or conversions)

If you do 1 to 2 sales per month, and run a test where B converts 15% better than A – how would you know? Nothing changes!

I love A/B split testing as much as the next guy, but it’s not something you should use for conversion optimization when you have very little traffic. The reason is that even if version B is much better, it might take many months to achieve statistical significance.

So if your test took 5 months to run, you wasted a lot of money. Instead, you should go for massive, radical changes – and just switch to B. No testing, just switch – and watch your bank account. The idea here is that you’re going for massive lifts – like 50% or 100%. And you should notice that kind of an impact on your bank account (or in the number of incoming leads) right away. Time is money. Don’t waste time waiting for a test result that takes many months.

#4: Tests are not based on a hypothesis

I like spaghetti. But spaghetti testing (throw it against the wall, see if it sticks) not so much. It’s when you test random ideas just to see what works. Testing random ideas comes at a huge expense—you’re wasting precious time and traffic. Never do that. You need to have a hypothesis. What’s a hypothesis?

A hypothesis is a proposed statement made on the basis of limited evidence that can be proved or disproved and is used as a starting point for further investigation.

And this shouldn’t be a spaghetti hypothesis either (crafting a random statement). You need to complete proper conversion research to discover where the problems lie, and then perform analysis to figure out what the problems might be, ultimately coming up with a hypothesis for overcoming the site’s problems.

If you test A vs. B without a clear hypothesis, and B wins by 15%, that’s nice, but what have you learned from this? Nothing. What’s even more important is what we learned about the audience. That helps us improve our customer theory and come up with even better tests.

#5: Test data is not sent to Google Analytics

Averages lie, always remember that. If A beats B by 10%, that’s not the full picture. You need to segment the test data, that’s where the insights lie.

While Optimizely has some built-in segmentation of results, it’s still no match to what you can do within Google Analytics. You need to send your test data to Google Analytics and segment it. If you use Visual Website Optimizer, they have a nice global setting for tests, so the integration is automatically turned on for each test you run.

Set it and forget it: inte   Optimizely makes you suffer for whatever stupid reason. They make you switch on the integration for each test separately.

They should know that people are not robots and sometimes forget. Guys, please make a global setting for it. So what happens here is that they send the test info into Google Analytics as custom variables. You can run advanced segments and custom reports on it. It’s super useful, and it’s how you can actually learn from A/B tests (including losing and no-difference tests).


But Monetate – which should be a class above the other two services, since it costs way more, is not even able to send custom reports. Ridiculous, I know. They can only send test data as events. monetate   So in order to get more useful data, create advanced segments for each variation and create a new segment based on the event label: monetateAnd you can check whatever metrics in GA with a segment for each variation applied: seg Bottom line: always send your test data to Google Analytics. And segment the crap out of the results.

#6: Precious time and traffic are wasted on stupid tests

So you’re testing colors, huh? Stop.

There is no best color, it’s always about visual hierarchy. Sure you can find tests online where somebody found gains via testing colors, but they’re all no brainers. Don’t waste time on testing no brainers, just implement. You don’t have enough traffic, nobody does. Use your traffic on high-impact stuff. Test data-driven hypotheses.

#7: They give up after the first test fails

You set up a test, and it failed to produce a lift. Oh well. Let’s try running tests on another page?

Not so fast! Most first tests fail. It’s true. I know you’re impatient, so am I, but the truth is iterative testing is where its at. You run a test, learn from it, and improve your customer theory and hypotheses. Run a follow-up test, learn from it, and improve your hypotheses. Run a follow-up test, and so on.

Here’s a case study where it took 6 tests (testing the same page) to achieve the kind of lift we were happy with. That’s what real testing life is like. People who approve testing budgets—your bosses, your clients—need to know this.

If the expectation is that the first test will knock it out of the ballpark, money will get wasted and people will get fired. Doesn’t have to be that way. It can be lots of money for everyone instead. Just run iterative tests. That’s where the money is.

#8: They don’t understand false positives

Statistical significance is not the only thing to pay attention to. You need to understand false positives too. Impatient testers will want to skip A/B testing, and move on to A/B/C/D/E/F/G/H testing. Yeah, now we’re talking!

Or why stop here, Google tested 41 shades of blue! But that’s not a good idea. The more variations you test against each other, the higher the chance of a false positive. In the case of 41 shades of blue, even at 95% confidence level the chance of a false positive is 88%.

Watch this video, you’ll learn a thing or three:

Main takeaway: don’t test too many variations at once. And it’s better to do simple A/B testing anyway, you’ll get results faster, and you’ll learn faster—improving your hypothesis sooner.

#9: They’re running multiple tests at the same time with overlapping traffic

You found a way to cut corners by running multiple tests at the same time. One on the product page, one on the cart page, one on the home page (while measuring the same goal). Saving time, right?

This may skew the results if you’re not careful. It’s actually likely to be fine unless you suspect strong interactions between tests, and there’s large overlap of traffic between tests. Thing get more tricky if interactions and traffic overlap are likely to be there.

If you want to test a new version of several layouts in the same flow at once—for instance running tests on all 3 steps of your checkout—you might be better off using multi-page experiments or MVT to measure interactions, and do attribution properly.

If you decide to run A/B tests with overlapping traffic, keep in mind even distribution. Traffic should be split evenly, always. If you test product page A vs B, and checkout page C vs D, you need to make sure that traffic from B is split 50/50 between C and D (e.g. as opposed to 25/75).

#10: They’re ignoring small gains

Your treatment beat the control by 4%. “Bhh, that’s way too small of a gain! I won’t even bother to implement it”, I’ve heard people say.

Here’s the thing. If your site is pretty good, you’re not going to get massive lifts all the time. In fact, massive lifts are very rare. If your site is crap, it’s easy to run tests that get a 50% lift all the time. But even that will run out.

Most winning tests are going to give small gains—1%, 5%, 8%. Sometimes, a  1% lift can result in millions of dollars in revenue. It all depends on the absolute numbers we’re dealing with. But the main point in this: you need to look at it from a 12-month perspective.

One test is just one test. You’re going to do many, many tests. If you increase your conversion rate 5% each month, that’s going to be an 80% lift over 12 months. That’s compounding interest. That’s just how the math works. 80% is a lot.

So keep getting those small wins. It will all add up in the end.

#11: They’re not running tests at all times

Every single day without a test is  a wasted day. Testing is learning. Learning about your audience, learning what works and why. All the insight you get can be used in all of your marketing, like PPC ads and what not.

You don’t know what works until you test it. Tests need time and traffic (lots of it).

Having one test up and running at all times doesn’t mean you should put up garbage tests. Absolutely not. You still need to do proper research, have a proper hypothesis and so on.

Have a test going all the time. Learn how to create winning A/B testing plans. Never stop optimizing.

#12: Not being aware of validity threats

Just because you have a decent sample size, confidence level and test duration doesn’t mean that your test results were actually valid. There are several threats to the validity of your test.

Instrumentation effect

This is the most common issue. It’s when something happens with the testing tools (or instruments) that causes flawed data in the test.

It’s often due to wrong code implementation on the website, and will skew all of the results. You’ve got to really watch for this. When you set up a test, watch it like a hawk. Observe that every single goal and metric that you track is being recorded. If some metric is not sending data (e.g. add to cart click data), stop the test, find and fix the problem, and start over by resetting the data.

History effect

Something happens in the outside world that causes flawed data in the test. This could be a scandal about your business or an executive working there, it could be a special holiday season (Christmas, Mother’s Day etc), maybe there’s media story that gets people biased against a variation in your test, whatever. Pay attention to what is happening in the external world.

Selection effect

This occurs when we wrongly assume some portion of the traffic represents the totality of the traffic. Example: you send promotional traffic from your email list to a page that you’re running a test on. People who subscribe to your list like you way more than your average visitor. So now you optimize the page (e.g. landing page, product page etc) to work with your loyal traffic, thinking they represent the total traffic. But that’s rarely the case!

Broken code effect

One of the variations has bugs that causes flawed data in the test. You create a treatment, and make it live! However, it doesn’t win or no difference. What you don’t know is that your treatment displayed poorly on some browsers and/or devices. Whenever you create a new treatment or two, make sure you conduct quality assurance testing on them to make sure they display properly in all browsers and devices.


Today there are so many great tools available that make testing easy, but they don’t do the thinking for you. I understand Statistics was not your favorite subject in college, but time to brush up. Learn from these 12 mistakes, so you can avoid them, and start making real progress with testing.

Featured image credit

Join the Conversation Add Your Comment

  1. Hi, great post as always.

    I have a question… have you had any experience with unbounce?

    1. Peep Laja

      I do, but very little. I haven’t used it for testing.

      The reason is that Unbounce is for technically challenged people – people who don’t know or can’t touch “code”. People who want to bypass IT departments. I don’t have that problem.

    2. Hey Mario, I’ve been optimizing sites since 2005 and I’ve had great experiences with Unbounce for split testing (both for my own businesses and my clients’). Of course Unbounce is a no-IT-team-needed, hosted landing page solution as Peep points out (I’m not affiliated in any way), but their split testing capabilities are pretty robust — and built specifically for single landing pages. Check ’em out!

  2. What an article! Learned tons. Thank you.

  3. A very solid list of ‘gotchas’, Peep — thanks.

    To your Mistake #2 (not letting tests run for weeks)… sometimes it’s important to let a test run for a month or longer… particularly in the case where you have seasonality in your business. When I led the optimization program for Intuit (Turbotax, Quickbooks), our tax software business had massive seasonality for the US market in April.

    We saw our conversion rate go from ~8% in November to ~45% in April, and we found that tests which produced 10-20% lift in November ended up seeing NO lift at all in April — even through the page content (and test variations) hadn’t changed by a single pixel.

    Why? Because tax filers’ motivation is drastically different between November and April… and the more motivated the visitor, the less impact our persuasion techniques had on them. In fact, by April 10th or so, tax filers could probably care less what kind of messaging we provided on the website… they just wanted to get their taxes done to avoid penalties. :-)

    1. Peep Laja

      Thanks for chiming in, Lance.

      You’re absolutely right. The longer the test runs, the more accurate the outcome is going to be. Seasonality and other external factors impact tests a lot.

      Huge blunders can happen when massive ad campaigns are run at the same time when a/b testing happens… yet testers are not made aware of this campaign – and often the test results are very different when the campaigns are over.

  4. Don’t know if this counts as a mistake, but with split testing it’s easy to mistake novelty for improvement.

    If you’re A/B testing a new opt-in box or offer vs one that’s been on your site for a while then the people who’ve already take up your existing offer (or have seen it so many times they just ignore it) won’t respond to the control but they may well respond to the new variant just because it’s different.

    Now that’s probably all right if your target audience is your existing visitors. but if you’re trying to predict the impact on new visitors (ie the long run impact of the change) then you’ll get different results from people who are seeing both variants for the first time.


    1. Peep Laja

      Thanks Ian. You’re right. Another reason to let tests run longer, and to run follow-up tests. Typically new variations do better at first due to the novelty factor. That’s why also one should send the test data to Google Analytics, so you could segment the results based on new/returning and different traffic sources.

    2. Yeah – that was new for me, using Google Analytics. Very clever. Previously I’ve just reset the stats after the first week to let the novelty factor wear off.


  5. Very astute observations on the common pitfalls of A/B testing here – thanks! At Splitforce (http://splitforce.com) – we see some of our customers making the same mistakes when testing their native mobile apps and games.

    Another common issue that we’ve noticed is assigning variations too early in the funnel. Let’s say that you want to test two variations of an image on a product page – be careful not to assign a variation to users and include them in test results until they are actually exposed to the image in question. Too often, we see the opposite – leading to skewed results and inflated sample sizes which may falsely indicate statistical significance.

    Whether for web or mobile, make sure that the A/B testing tool you’re using only tracks the behavior of users who are actually exposed to a test subject.

  6. Hi Peep –

    Very dense and useful post, as always, thanks for continuing to publish the best posts on CRO!

    You use VWO and Optimizely as examples in your posts, but I was wondering if there was a specific reason why you never mention convert.com? It seems in the same group as the other 2 in my opinion, and I’m using all 3 depending on the project.

    Any specific reason to avoid it in your opinion or it’s just coincidence it’s never mentioned?

    Many thanks!

    – Julien

    1. Peep Laja

      No other reason than lack of personal experience. I just haven’t used it to run any tests. There are actually many more around, like AB Tasty (https://en.abtasty.com/) and others.

  7. Peep – this is a great article I’ll definitely be pointing clients to in the future :) Keep up the excellent work!

  8. Holy FUCK, i dont know how you manage to spit this kind of golden posts again and again…

  9. Great, well-argumented information, easy understandable. Shortly and effective.

  10. Wow. This is really a huge mass of information. It’s have been saving the URL-s to read through your posts again and again. I recommend your website to anyone I meet.

  11. I’m glad you highlighted #8. I have a very hard time explaining #8 to people, even some so-called CRO testing professionals. Not enough is being said about this and none of the testing tools address this in their interfaces (not that I’ve seen anyway).

    Thank you.

  12. Always love your posts, Peep! But this is an absolute gem. Enjoying your GA insights these days. :) Would love to understand segmentation of results to find actionable insights (with examples).

  13. Great article Peep, thanks!

    I’ve seen big digital marketing agency’s doing the mistakes that you pointed over and over again as it was part ther metodology.
    I’ll definitely send this article for some friends in the business.

    I think you should do one article just about the #5. Something like: “Understanding your A/B Test Data on GA”.


  14. Thanks for this. I do wonder about your advice regarding test confidence level and power. I think that there are many occasions when confidence levels below 90% are warrented. When you do not have sufficient evidence for higher certainty and when the tests are repeated often such as champion/challanger testing in a continuous media stream, it often makes better business sense to use a lower confidence level. Being correct in 75% of your business decisions is a good bet. Of course, each circumstance needs to be understood in its context with its own risk/reward. One must also be careful to measure the opportunity cost of not making the correct decision using and understanding the test statistical power.

    1. Peep Laja

      If the absolute sample size is decent (250+ conversions per variation), I agree that there might be cases where you *could* call it early at 90% – for instance where the discrepancy between the 2 is large enough. But 75% confidence level is terrible, there is no way you could call a test at 75% and be happy with it. It’s just slightly better than flipping a coin.

      Your analogy about business decisions is not a good fit here. Every test can end in 95% confidence if you just give it more time. You can’t make accurate business decisions even if you weigh pros and cons for a long, long time.

      And if you call the wrong version, you have just wasted tons and tons of effort and time + act with false confidence, thinking your winner is a winner while in fact it’s a loser. It’s never just about winning treatments, but improving customer theory.

  15. The reason why Optimizely and Convert.com removed the global GA option and moved it to the test level is that when you have two tests running at the same time it will be hard for clients to separate the data since it’s in the same slot. Convert had this feature and when we added Universal Analytics we also moved the global setting to the test level and that solved a lot of support tickets with confusion.

    I think the test tools should have best practices build in but allow users to overwrite them. We turn of by default 97% significance and have min 7 day runtime and don’t call winners with less then 10 conversions. But there is no absolute truth yet and people need to read all your suggestions here before calling a winner.

    1. Peep Laja

      Thanks for chiming in, Dennis.

      I get that, but each test has a different name… so it’d be easy to tell them apart? I understand if you want to switch if off for most users, but there’s should be an option to turn it on if I want to.

  16. Brilliant article Peep, the first point “tests are called early” resonates with us the most because it runs every other element of A/B.

    We had this issue with a client recently who were so happy with a test (after just over 200 unique visits) that they wanted to declare a winner and make the version live.

    We explained in detail why this was premature but for some reason in this instance they wouldn’t listen.

    Long story short and unbeknown to them we kept the test running (naughty). However some time later after 2,000 uniques, their winner was actually the clear loser! In fact if we’d done as they asked, they’d have lost nearly 60k (GBP) in revenue over that month!

  17. Hi Peep,
    Thanks for including us in this insightful post. As far as calling tests to early; at Optimizely we encourage people to use their own sample size calculators. But sample size calculators only really work if you have a projected improvement in mind. We have some safeguards about calling a test too early, including a minimum number of conversions and visitors, but it’s really up to the user to determine what the expected outcome of the test is and to figure out whether or not it’s a success after it’s been running for a defined number of visitors. We are not able to determine an ideal sample size for every experiment.

    In practice, running experiments can be quite complex. Statistical significance depends on many factors and we have no intention of sounding misleading in our product communications.

    Thanks again for including us. If you’re ever up for a conversation about how we’re working through these issues with our customers, we’d love to chat. You can contact me directly helen@optimizely.com.

    1. Peep Laja

      Thanks for chiming in Helen. I understand where you’re coming from. Most users are not savvy enough, so they end up calling tests early. I think improved safeguards would help many people, and there could be many clever ways of guiding the customer without proclaiming absolute truth. Would love to chat at some point. I’ll be in touch.

  18. Hi Peep,

    Many thanks for this great list of “how not to..” Excellent stuff to reference.

    If it’s ok with you, I would like to elaborate a bit on your #4 statement about the absolutely vital Hypothesis.

    As I tell our customers: “Have the Hypotheses being founded in clear and preferably quantified business goals”. Define a target like; The result of the test will be a 5% growth in upsell on car insurances, or 8% uplift in sales from customers coming in through Adwords campaign XYZ.

    This brings focus on the creation of the variation(s) and clearly identifies the segments and conversion points to monitor throughout the funnel in the reporting.

    Yes.. multiple conversion points.. The days of measuring your test success on visitors going from page A to page B are over. (with that the questionable quotes like ..”We have 156% uplift!” )

    As you state in #10. Small numbers at the end of the funnel can lead to respectable turnover per annum. So measure all conversion points/CTA’s per page. With that you have a clear view what your visitors are doing.
    (they do tend to behave unpredictable from time to time ;).

    If possible in your tooling, also report on the actual $$ value (like avg shoppingbasket, avg insured amount) to make a clear winner decision based on $$ instead of just clicks..

    All in all, a clear Hypothesis definition and thorough reporting makes your life easier in defining the test, monitoring the conversions, and finally calling a winner!..

  19. Great article. Pitfalls all-around for those doing testing halfheartedly!

    For number 9 on running multiple tests . . . I get asked this all the time.

    I’m trying to wrap my head around the approach you mention:

    “That way people either see the new version for each page, or they see only the old ones.”

    Does that also include visitors who are new to the site only seeing the new content? Or could new visitors also be in the control group?

  20. Peep, I saw that there is no mention of Test Burn Outs or Test fatigue. I use VWO. I have seen many times that a test with 95 – 99% confidence levels after running for larger duration (more than what time duration calculator tells you) starts giving diminishing results if you choose to run it further. I have also seen multiple times that tests that show clear winner with 99% confidence level have changed course, if you continue to run them longer.

    When i encountered these results for the first time I was totally confused. I then checked on the net and found that it is not me who is facing this issue there is a ton of research that went in to figure this phenomena and it was discovered that it is universal. So I think it is EXTREMELY important to test and re-test the same hypothesis after a certain duration to see if it is still valid or it has lost it “charm”.

    How do you handle this?

    Also, I am currently running short of VWO test implementation staff and I am looking for someone who can assist me in implementing the tests. Can you recommend any person who has some experience on VWO implementation?

    Thanks in advance.

    1. Peep Laja

      If the win fades out over time, it means it was imaginary. There was no lift to begin with. The bigger the sample size, the more accurate the testing will be.

  21. Aloha Peep!

    Awesome post! I’m currently digging deep with split testing my landing pages and have some interesting data I’d love to get a second opinion on. Would you be ok with me sending you a quick screen shot if the data with a few notes for you to give your opinion on? Either way thanks for the great post!


  22. For #9, what about running multiple tests that are independent from each other (e.g., a button change on the homepage and a form change during the checkout flow)?

    1. Peep Laja

      If the traffic is split evenly between variations (e.g. 50/50), then it’s fine.

  23. My site has about 5000 conversions per week. I ran a test and at some point in time A has 1173 conversions (CR 21.1%) and B hast 1246 conversions ( 22.7%).
    Optimizely declares B as winner with a ‘chance to beat baseline’ of 97.3 with an CR improvement of +7.2%
    So I should be happy and implement variation B right? But I let the test run because if seen this before, and what happens, the improvement goes down over time.
    After twice as much conversion the improvement is only 3.4% and Optimizely doesn’t declare a winner any more.

    I also af another site with only 100 conversions per week. It would take half a year to have these numbers of conversions and at that point I would be glad to declare a winner and not test another 6 months to see what happens.
    If I only had this site and not also the one it lost of conversions, I wouldn’t know what I know now and implement winners without results.

  24. I use VWO and ran a test on a few products. The variation one over the control by a large amount. I looked in analytics a few winners were actually losers in terms of revenue. How do you decide revenue over conversions?

    1. Peep Laja

      Revenue matters, conversions don’t. If you want higher conversion rate, just cut your prices in half!

  25. I have been searching the article on AB Testing and finally got this outstanding article here. Great Job…

  26. Great article with lots of good info for people getting into multivariate testing. There’s one aspect that I didn’t see mentioned that’s really important, and that’s primacy or “newness”.

    With tests that involve more drastic changes, like rebranding or significant layout changes, you may need to run a test longer (far beyond statistical significance and confidence reaching certain levels) in order to determine a winner.

    In these tests, users that are accustomed to the look and feel of your site may react differently the first time they encounter the change, and that will skew your initial metrics. It can often be good to run tests for extended periods, and then compare early performance to later performance to see how user behavior normalizes over time.

  27. Good article.

    Taking these suggestions from this article…

    1. Run the experiment for a full week (at least).
    2. Never run more than one experiment at a time.

    Does this mean that, at most, a site should only have 52 experiments per year, possibly fewer if we are shooting for 95% confidence?

    1. Peep Laja


      1. One week is not going to be enough in most cases. Plan 2-4 per test for better validity.
      2. If the split on your different tests is 50/50, then you can run more than one test at a time, provided that they’re on different URLs.

  28. Hi,

    Really nice article, thanks! I was wondering why is your personal minimum 4 weeks? I agree that you need to test at least 2 weeks, because the conversion can differ per week. But if you have enough transactions in 2 weeks, you are stilling waiting 2 weeks more? Is there a special reason for this?

    And are you agree with this?

    To avoid this type of blunder, always be patient and run your tests for a minimum of 2 weeks with recommended maximum of 6 weeks and confidence level no less than 95%. Also, once your testing tool declares a wining variation, don’t stop your test immediately. Run it for another week to see if the result is solid.
    “A solid winning variation should, during this ‘control’ week, hold its winning status. If it doesn’t, then you haven’t found your winning version. – See more at: http://www.proimpact7.com/ecommerce-blog/consequences-of-ending-your-test-too-soon/#sthash.AnVfLDNC.dpuf“?

    Thanks for your reaction!

    1. Peep Laja

      I see it all the time. A variation that is winning after 2 weeks is losing after 4 weeks. My personal minimum is 4 weeks (unless the discrepancy between variations is HUGE).

    2. Thans for your quick reaction. Good to know. And what is your opinion about the amount of test at one website? Also if they are not really close to each other (homepage and cart for example), they can still influence each other right?

  29. Fantastic post, but…

    I am not sure I follow the logic of full week testing. Lets put aside the arguments about power calcs and users being excited by change for a moment. If we have loads of traffic why cant we test for a day or even less? The A in your A/B test controls for variation in conversion rate and your are measuring the effect of B relative to A rather than absolute numbers. SO lets say I run a well powered test on Monday when my conversion rates are 10% how will that test differ from running the test on a Sunday when my rates are 2%? Unless the day of the week affects how the user interprets my changes the change in conversion rate will have no effect on the relative effect of B vs A.

    My question is not related to seasonal variation, thats a different issue and agree with you on that.

    1. Peep Laja

      If it only were so easy. I see it day in, day out in my work.

      After 2 weeks B is winning (1000’s of conversions per variation), but after 4 weeks the lift disappears – it was imaginary. There was no lift to begin with.

      Time is a critical component. If the test runs only a couple of days, it’s definitely not valid – and no result can be believed / trusted.

  30. The more I think of it today the biggest problem of website testing is multiple devices. What I mean is that many people discover website in one device but than order or perform actions in different devices (for example find a website on smartphone but make the order later at home with the desktop or tablet).
    This situation makes testing impossible (with exception of maybe Facebook and google who are able to identify cross device user) …

    What do you think?

  31. I’d also like to add three more–
    1) People don’t filter out their own internal traffic. This might not affect larger websites with tons of traffic, but if you don’t get too much traffic in general, your external traffic, whom are obviously not your potential clients, can significantly sway your data. Filter out the IPs of all people who work for you on your website, including yourself!
    2) People don’t take into account their traffic sources. Try to get a wide array of traffic sources, since people coming from different places will be in different phases of the buying cycle, and therefore will convert better than others.
    3) People test small things that will only get them to the “local maximum”. The goal should be finding the absolute maximum, and to do that you need to test BIG changes. Here are 9 goodies– 9 A/B Split Tests to Boost Your Ecommerce Conversion Rate

    1. > 1) People don’t filter out their own internal traffic. This might not affect larger websites with tons of traffic, but if you don’t get too much traffic in general, your external traffic, whom are obviously not your potential clients, can significantly sway your data. Filter out the IPs of all people who work for you on your website, including yourself!

      Filtering IP-s is quite fickle. Everybody is on the move these days. Everybody won’t be VPN-ing, nor is it ever realistic to filter new cafe or bookstore IP-s on the go.

      Use GA Opt Out browser plugin instead, in all browser profiles you do work in: https://tools.google.com/dlpage/gaoptout

  32. I never did A/B testing but I think I would do A1-A2/B1-B2 testing, A1 being identical to A2, and B1 being identical to B2. That way I would be quite confident about the test sample : A1 and A2 results should be close, and B1 and B2 results too. While they’re too different, the sample size is too small to produce reliable results.
    Of course you could just split A in A1/A2 and serve twice more B. Or split in thirds.

    But I’m not a statistician. Maybe that’s a wrong “good idea”.

Comments are closed.

Current article:

12 A/B Split Testing Mistakes I See Businesses Make All The Time