What Can You Do With Other People’s A/B Test Results? [Rant]

What Can You Do With Other People's A/B Test Results? [Rant]

Lots of people on the internet are running tests, can I just copy their winning tests? Let other people do the failing, I’ll just test (or implement) the winning stuff. Good idea, right?

Using other people’s A/B test results has been popularized by sites like WhichTestWon, and blog posts boasting about ridiculous uplifts (most of those tests actually NOT run correctly, stopped too soon – be skeptical of any case study without absolute numbers published). It’s an attractive dream, I get it – I can copy other people’s wins without doing the hard work.  But.

The thing is – other people’s A/B tests results are useless. They’re as useful as other people’s hot dog eating competition results. Perhaps interesting to know, but offer no value.

Why? Because websites are highly contextual.

It’s almost never apples to apples

You sell Ferraris and I sell used clothing. There’s no overlap in target audience, mindset, cost etc.

Or maybe we both sell the same type of stuff, like food. But I am Whole Foods and you are Walmart. Again, there’s very little overlap here in terms of context.

Or what if we both sell the exact same thing at the same price, like Samsung TVs? In most cases it’s still not exactly apples to apples as the brand, traffic sources, relationship with the audience and many other things will be different.

Other people’s A/B tests are other people’s solutions to other people’s problems. 

Your website has specific problems, not generic problems. Your job is to uncover the specific problems your site has, and build your tests accordingly. You shouldn’t copy other people’s sites (also a terrible idea), don’t copy their tests either.

But I’ve tested X across multiple sites with great success!

Yes, there things that work more often than not – but they work mostly because you’re testing them against stupid alternatives. Automatic image carousels, lack of value propositions and super long forms are just stupid ideas. (Yes, there are rare exceptions when they can work, as with everything in life).

So testing something okay against a stupid idea is likely to work. But it’s not the same thing as using other people’s AB test results.

What is interesting about other people’s A/B tests

I’m not saying you can’t learn from A/B tests other people have run. You can, a lot. It’s just not from the results.

What IS interesting about other people’s tests is the process they used, the analysis they did, the insights they pulled out of data.

  • Tell me how you identified the problem you’re addressing
  • What kind of supporting data did you have / collect?
  • How did you pull the insights out of the data you had?
  • Show me how came up with all the variations to test against Control, what was the thinking behind each one
  • What went on behind the scenes to get all of them implemented?

Running A/B tests is just the tip of the iceberg, all the hard work that goes into it is what actually matters.

I agree with Andrew Anderson on this:

If you really wanted to see a site like WhichTestWon matter, then show the variants that didn’t win. Show multiple options for each outcome and show what the best option was? Give us a measure of the cost and give us the internal roadblocks that you had to overcome. Let us know if that outcome was greater or worse than others for that group and what they are doing with the results to get a better more efficient result next time. If you are interested in anything more than self-promotion, post the things that don’t work. Tell us how often something wins, not the one time it did win.

Yes!

Conclusion

You can’t assume that what worked for other people will work for you. It doesn’t work that way. You have a highly contextual website with specific problems, so go and test highly contextual and specific solutions instead.

So next time you hear someone say “I got a 53% lift”, ignore that and ask them questions about stuff that really matters. Stop focusing on the ‘what’, and start exploring the ‘why’ and ‘how’ instead.

Join the Conversation Add Your Comment

  1. Yes, “some things work more often than not”, not just because the control is “stupid”. Sorry, that’s a weak explanation. If people are finding recurring effects across multiple sites/tests over and over, then that just might be a hint of general theory beginning to form.

    Repeatable successes most likely mean that your variations are based on some patterns from the realms of how people tick in general (psychology, perception, usability, cognition, etc). Hence, I believe that occasionally we can still filter out ultra site specific a/b tests and pay attention to some of the more general ones. This opens up the door for reuse – saving time, money, building confidence, or helping optimizers build more accurate test estimations (using better prior data).

    Yes, some of those a/b tests are prone to publication bias, and poor setup. And yes, transparency of process and data is also important. But once those are taken care of, the benefit is quite clear – allowing us to build on the shoulders of giants.

    Don’t forget that imitation of behavior and success is a critical component of success and survival (be it for a business, any 2+ year old kid, or many other animals). :)

    Cheers,
    Jakub

    1. Peep Laja

      I’m sure you feel that way since you’re in the business of selling AB test results ;)

      A site should identify their own issues, and come up with solutions for the specific problems they have – do the proper research. Relying on quick fixes, hacks and copying other sites will not get you far.

      AND – knowing other people’s test results (X won by 343%!) is as useless as knowing how many times a day they go to the bathroom.

  2. True. :) We are both biased in our own ways. Technically I do sell A/B tests results of various shapes and sizes with the intent of helping people reach higher conversions (while showing both process & testing ideas). But here, I also think you have to admit your own bias of selling deep discovery/research based projects. :) We both should become aware of the hammers we’re walking around with.

    Assuming a test is of high quality, I’d like to think that each test’s effect may vary along the general-specific spectrum, making some more reusable across sites than others. If this turns out to be true (because we’re still learning of course), then we should still be able to provide low effort, improvement in results.

    I think some sites can initially benefit from common best practices (hopefully backed up by multiple / quality tests). And, I also believe that in order to push through even further, stronger optimization processes come in handy (like your awesome approach, for example). :)

    1. Peep Laja

      Best practices and fixing usability issues etc is where you start, not where you end up. Once you’re there, and have a list “100 proven tactics”, what do you try? Figuring out what matters (the discovery process) is what one should do, and know WHERE the problems and WHAT they are before even thinking of “which tactic to test”. Once the problem is clear, one should test a breadth of possible solutions (multiple variations, iterative testing) since noone can know what will work – a proven or unproven idea.

      Merely trying out tactics that worked for other people is super inefficient and largely a waste of time (playing the lottery).

    2. I think we’re in way more agreement, than not. Would be nice to discuss the last part about inefficiency over a cold beer some day. :)

  3. Well, the post has the word “Rant” IN THE TITLE, so debates in the comments shouldn’t surprise me.

    But certainly there must be room in the spectrum between “other people’s results are useful as how many times they go to the bathroom” and “100 proven tactics that will work for you guaranteed”.

    Namely, something along the lines of “They had a similar problem, and approached it from this angle, perhaps we can approach it from a similar angle, even though our treatments may be totally different.” Perhaps this is precisely the conclusion of this post.

  4. Nice post! I never trust when I see numbers provided by these kind of tests. I could see that is not easy than blogs usually say. I’ve already changed the same thing as blogs was exploring in its testes and… guess what! Nothing happened (of course). Stop thinking about “What” and start thinking about “How” ou and “Why” is the way to get a good A/B test. I totally agree whith you. Next step is find out these ways to get correct data before run any A/B test.

  5. Hey Peep,

    Great post. Love the main premise – and I definitely agree that you shoudn’t blindly copy an A/B test. But I’m curious to hear your thoughts on a few points.

    As Jakub mentioned in the comments, even though each website is different, many A/B tests are grounded in universal principles — like psychology, cognition, and behavioral economics (to name a small few). Do you think that those A/B tests are useless?

    “You sell Ferraris and I sell used clothing. There’s no overlap in target audience, mindset, cost etc.”

    Because those A/B tests are based on human behavior, I don’t think the domain should matter. Those tests may not be grounded in data directly related to your site, but you still have a hypothesis.

    So I just wanted to get a clarification. Can there be no such thing as a universal list of “50 A/B Tests to Try” — even if those A/B tests are grounded in psychology or some other field?

    1. Peep Laja

      Websites are specifics, layouts are specifics, context is specific. If the principles were universal, why even test them – they should ALWAYS work on EVERYONE right? You see how silly that idea is? If it would be SO easy to sell to people, why aren’t all sales people good at it? Wouldn’t it be hard to put together a list of universal sales tactics? Doesn’t exist!

      Humans are complicated, and contexts are way different.

      “50 A/B tests to try” is a totally silly concept, is a huge waste of time should you ever try it. Optimization program has to be efficient. You measure the success of that program by A) number of tests you run B) percentage of tests that win and C) impact per successful experiment. If you just “try” a bunch of ideas, you will score REALLY LOW on the percentage of winning tests and the impact as well. You’ll be fired from a job pretty soon if that’s your approach as a CRO.

      It’s very attractive to want to believe that CRO is just a list of tactics you test / implement one by one. Alas, the world is so much more complicated than that.

      On the other hand, the fact that there are so many people believing in the naive dream, means that the people actually putting in the hard work are winning big and eventually kicking everyone else out of business.

    2. Quote: “’50 A/B tests to try’ is a totally silly concept, is a huge waste of time should you ever try it.”

      Challenge accepted.

      I think those articles could be helpful (if positioned properly). With a large list of psychology-backed CRO tactics, you can accomplish the three goals that you mentioned:

      A)Number of Tests You Run
      With a large list, you can find tactics that can be implemented earlier in your funnel so that you have a larger sample (thus being able to conduct more tests)

      B) Percentage of Tests That Win
      If those tactics are grounded in psychological hypotheses, you’re more likely to get a higher percentage of winning tests. Will they ALWAYS work on EVERYONE? No. But I don’t know anyone in CRO who expects to convert everyone. You just try to put the odds in your favor. And psychology can help.

      C) Impact Per Successful Experiment
      With a large list, you can find tactics that target the areas in your funnel that need the most improvement (and would give you the most value).

      I should have the article done by next week. I’d love to get your thoughts on it – even if you think it’s complete crap.

    3. Peep Laja

      Heh.

      1) Number of tests you run depends first and foremost on your execution capacity. Not a list of tactics.
      2) No. The number one reason people don’t have success with AB testing is exactly that – test random ideas from blog posts. That’s now how the world works. The tests that work more often are the ones targeting the specific problems the specific users have with the specific website that sells specific products/services. One needs to conduct proper qualitative + quantitative analysis, use the data etc.
      3) Again, this will be low impact stuff cause its generic bullshit, and does not address the actual issues and opportunities a website has.

  6. While Peep Laja’s rant has some good observations that I agree with, it is *way* overstating some points by claiming that “other people’s A/B tests results are useless.”

    What I agree with:
    1. Many A/B tests out there that report amazing lifts were run badly and the results will not replicate (even on their own site) if run properly.
    2. When a result is too good to be true (e.g., 53% lift), apply Twyman’s law: http://bit.ly/twymanLaw and approach it with great skepticism. There are breakthroughs, but they are rare.
    2. You can’t blindly copy ideas from one domain to another.

    But there is certainly value in studying A/B tests and improving our understanding, and we have done that in a KDD paper last year when we looked for patterns in thousands of A/B tests: http://bit.ly/expRulesOfThumb. Here are three observations from the paper that make the point:
    1. Small changes can have a big impact. Many people assume that high value comes from big projects. We have shown over and over that there are small changes that can have massive impact. When you think of ROI, the R is big and the I is small.
    2. Speed matters a lot. This has been shown over and over across multiple sites. Performance matters, and many times its importance is under-estimated.
    3. Your mileage will vary. We specifically addressed the point that sites like WhichTestWon have problems. That said, such sites can serve as great HYPOTHESIS generators. The ideas *may* carry over. And if the idea is in your domain, it’s more likely to carry over successfully. There are several search engines out there, and good ideas replicate well between. I can tell you with high confidence that any site with a search box will very likely benefit from autosuggestions (start typing, it offers completions). We reported strong results for open-in-new-tab back in 2008, and many sites have benefited from that in multiple scenarios. Ferraris and clothing sites may not overlap on many axes, but if you have a commerce site, look at other commerce sites that run A/B tests and things they ship may be useful. In particular, study Amazon since many things were tested in Weblab (the A/B system I was involved in when I was director of data mining and personalization there).

    — Ronny Kohavi

  7. Its really very nice post all point of view. I just say on your post games is most important for our life. Get Gymnastic Mats UK from GymnasticMatsUk,co.Uk

Comments are closed.

Current article:

What Can You Do With Other People’s A/B Test Results? [Rant]