At CXL Live 2015 we had an amazing a/b split testing panel featuring statistics and testing gurus Lukas Vermeer from Booking.com, Matt Gershoff from Conductrics and Yuan Wright from Electronic Arts. And the audience asked some of the toughest testing questions ever. All of them got answered.
Watch the video here:
Note: get on the email list to get notified when tickets for CXL Live 2016 become available. People on the list will get them at pre-sale prices.
Peep Laja: Right, okay. Well, these guys don’t need any more introduction because you already got to meet them. So let’s get right to it.
We’ll start off with an easy one. An easy one. When is my test cooked? When should I stop my test? What should be my stopping rule? Lukas.
Lukas Vermeer: You mean when are you done?
Peep: When am I done.
Lukas: You should have decided it up front.
Peep: Good answer. You should have decided it up front.
Lukas: Right, so you decide up front how long you’re going to run the test, you run the test that long . . .
Peep: And how do you make that decision?
Lukas: How long are you going to run?
Lukas: What Yuan was saying, you use a power calculator. You say, “This is my traffic, this is the uplift I expect, and this is how long I’m supposed to run it.”
Peep: All right, so let’s say that I estimate, whatever, 20% lift and I got the result in half an hour. Then what? Implement it?
Lukas: I would run it at least two business cycles, and that depends on what your business is. In our case that’s two weeks.
Peep: How do you determine how long is your business cycle?
Lukas: Well, I assume you’ve asked customers how long it takes them to make a decision and I assume you have data to back that up.
Peep: Don’t assume anything.
Lukas: Okay, well in that case probably ask your customers how long it takes them to make a decision, look at the session data you have, so from the first time they hit the site till they make a booking or make a purchase, how long is that? If that’s a couple of days or a couple of hours, then you could say, “Well, in the beginning when I’m running my experiment, there are going to be people who are buying stuff who actually were not exposed to the experiment in the first place because the experiment wasn’t running when these people were first exposed to my site.”
On the other hand, at the end of the experiment there are going to be people who were exposed to your experiment, but they’ll make a purchase after your test ends. So it’s only people in the middle that were fully, nicely exposed to your experiment where you have the full purchasing cycle.
So you want to have a length of time that’s long enough that you have enough of those people in there. There’s a big difference between days of the week for most companies. Definitely for us there is. So a week is the minimum. If I only do a week, then I’ll have one Monday, one Tuesday, one Wednesday, one Thursday, and one Friday. So that’s why I said two weeks, but if you’re a car website and people take two months to decide, then you’re probably in a longer cycle.
Matt Gershoff: Yeah, just to chime in, I think the issue is that when we’re doing online testing we kind of assume that we’re taking random samples, but unfortunately we never really take a random sample. We’re really taking a convenient sample, and that’s because the users are presenting themselves to us and all we’re really doing is randomly allocating based upon as users appear to us.
So when he’s talking about the cycle or the period of your traffic, we’re really trying to emulate a random sample of your users. That’s one issue, so we have to kind of block in these discrete units.
The second issue to think about is, and I think in your talk you were discussing it, is the notion that maybe things are kind of changing over time. So really you have to think through what the nature is of your traffic behavior, and it may be that, to Lukas’ point about the bandit, that you in a way never want to shut off the test. It really is going to depend upon the case.
Peep: All right. Well, something that Matt, you touched upon, she said P values are not anymore considered useful in many scenarios. So P value, a statistical confidence, is a stopping rule. Should we even care about that? What’s the deal?
Lukas: Are people seriously doing this? Can I see some hands? Who stops the test at 90%? Come on. No one? Good. Next question.
Peep: We won’t beat you too hard.
Lukas: You should never do this.
Peep: All right.
Matt: Wait, I want to get back to that. That’s a little harsh. I think, especially because there’s been some other tools out there that were alerting customers that they should stop once it reached a threshold. I didn’t say that no one thinks that it’s a value, but that there’s sort of a controversy around it, but again, the standard way to do it is really to pre-select how long you’re going to be running you test for.
Peep: All right.
Lukas: Okay, let me back up on that. The moment you would call a test prematurely, when I would call a test prematurely, I’m not even looking at the statistics anymore. The difference has got to be so big that statistics are irrelevant. If you have thousands of conversions and you have tens of thousands in your variant, statistics is useless.
Any tool that will show you a G test number will basically say 100%. If it keeps saying 100% for a very long time, then at some point you’re going to say, “Well, okay. Now it’s time to call the test,” but not at 90.
Peep: Well, so if we have a change like this where it’s double, the lift is double, sample not so long, but let’s say maybe they took four weeks to run this test. What should we conclude here? One hundred percent significance.
Lukas: Yeah, that’s, there are 210 visitors-
Audience: What’s the question?
Peep: Is this a valid winner? Do I have a winner and I declare that, “Yeah, I’m going to go with A”?
Lukas: Wait, is this my example that I sent you?
Peep: It is.
Yuan Wright: First, when I look at this, this is an incredibly small sample size. I’d want to know how long this test has been running. Really, I think it’s too small to even call. When we’re using Test&Target for a very large side we definitely try to use 200 plus conversions each recipe before we start to look at the results because there’s too small a sample size. The statistical randomness is something you guys are very, having stats, this doesn’t look like enough sample size to call results, from what I can tell.
Lukas: This looks like a sample size that’s too small to even run tests on your website. If this is what you got in a few weeks-
Peep: Let’s say I’m starting to run a test and already three days in it’s like, “Hey, I’m seeing a trend here.” and my CEO is saying, “Yeah, we’re agile. We need to test faster.” So can I spot a trend? You can just say no.
Matt: You can think maybe you have a trend. I don’t know, but again this just gets back to the question that we take convenient samples. If you haven’t sampled at least over the normal period of your traffic, then you really don’t have enough data yet to assess marginal efficacy.
Peep: Awesome, so when should I use A/B test? When should I do a multi-variant? When should I do bandits?
Yuan: Yeah, I can talk a little bit about that one, and feel free to add in there. Bandit, I will leave it to Lukas.
For A/B testing I try to use something fairly simple, just to recipe, or I actually try to use a one inch major side changes, one that’s touching a lot of things. The multi-variant wouldn’t be able to scale all the change element. I would just use a simple A/B, but then leveraging the segmentation, too, as I think you both talked about, look at them before, look at the new versus repeat marketing channel, different segments of customers and how they behave.
I use multi-variant, usually depends on a different tool. For example if you use Test&Target they give you a contribution element, but they really don’t give you, basically you run a . . . with multi-variant you have to run the A/B adding all this together because they do a Taguchi method, which is sampling, for example if it’s 3×3 they give you eight recipes, but the combinations actually are 27. So what it does, it uses the Taguchi method.
So they give you the contribution of each success element, but they didn’t necessarily have one recipe having all the success in there. So you have to do another A/B, so it’s complicated. I try to just run A/B. I would rather it just be something simple I can get to it, rather than a really complicated thing that I can’t read the results on. That’s just my personal preferences.
Peep: All right, so on the eCommerce side, lots of different types of pages.
Lukas: Are you in a hurry, Peep?
Peep: I am. We need to get through lots and lots of questions. On an eCommerce stats side, I have lots of pages, product card, whatever. So if I have these multiple tests running on all these different types of pages, but we’re measuring the same goal, does that make sense?
Peep: I had a test where B beat A, and I implemented it, but six months down the line I’m like, “Is the uplift still there a year later?”
Lukas: You should be worried about C. That beats B and A.
Peep: Well, I have a SaaS product and I’m spending money to acquire customers, and there’s a free trial signup, but actually I’m making money when the trial period ends and they become a paid conversion, actually paying user. So how do I know that these trial signups actually become paying customers? How can I run tests that measure that?
Lukas: You either test for a very long time or you find a proxy that indicates whether people are going to stay. So I don’t have a SaaS product, but I can imagine that if people used the product for the entire month or maybe the first five days, then that’s a good indication that they will buy in the end.
So you can test for a proxy metric rather than the end goal and you can test faster, but ideally you want to test for a metric that’s really indicative of your bottom line. So in this case you would want to run longer.
Netflix runs tests for three months because they want people to stay with Netflix. That’s their target and they use proxies to decide whether they’re going to keep the test running or kill it early, but in the end that’s what they’re optimizing for.
Yuan: For this one if I add one more thing here, it looks like it’s three during pretrial, but you wanted to look the paid conversions, right? This is a customer lifetime value. So you want to look at a little bit longer time. This is when the attribution kicks in, right? You really want to watch, after 30 days do they return? Do they reject? Do they continue to B? What’s the lifetime value of the customer? I would look a little bit further, beyond A/B testing results, but look at your back end system to understand the long term value or engagement of such customer.
Peep: All right. I have a page. Let’s say it’s a product page. Can I run an A/B test on the same page while I bandit? Let’s say I’m bandit testing button color, or whatever, and A/B testing copy at the same time on the same page. Can I do it and do you recommend it?
Yuan: Yeah, this one you can probably run a, I would suggest just one layout, one call to action. I would make it a 2×2 because this is basically ABCD results you can get pretty quickly. I would do a 2×2 multi-variant and get this done.
Peep: Let’s say that you’re testing an element where there’s no learning to be had. Let’s say I’m testing a photo. I don’t know which photo is going to work the best, but whatever the photo is going to be about I am not necessarily going to learn much about it. So does it make sense that I’m running bandits on those things and then A/B testing stuff that I can learn from?
Lukas: These questions are interesting, but what if your competitor is running a bandit? He’s influencing conversion on your site. Do you worry about your competitor running a test? No, you don’t. You randomize samples, and yes, there are issues with that, but you can’t take all these things into account. Feel free to disagree.
Matt: I don’t know about that. I think in spirit I agree with you. I don’t know if you possibly have confounding going on, to think about it, but depending upon how your traffic is flowing in, and it depends upon the nature of what bandit you’re running. Let’s say there’s tremendous efficacy within the test that you’re adaptively running. We’ll call a bandit testing, an adaptive test, and the distribution is changing. I’m not sure how that might affect the variants and the results, especially if the traffic is coming in at different rates. You might average over, I guess it should be randomly allocated to both. I’m not sure, but I think you’re right.
Lukas: Yeah, your randomization of your sampling should solve that, just like it should solve more people coming from Asian countries because, random. There are always things going on that will influence the composition of your traffic, and that’s why you randomize allocation to a base or variant, because there are things going on in the outside world that you have no control over that might influence the results. You hope that by randomizing that sample, you somehow-
Matt: Now what you could do, you could just spawn off a feature that lets the system know what the bandit selection was, and then you could cross that against your A/B test and you could do an implicit multi-variant test that way. You could analyze it that way if you wanted to.
Lukas: So just like you would segment your results by, say, language or geographic location, you would segment by what the bandit said.
Matt: Yeah, you could do that.
Lukas: During that test. So you can look at the results of the A/B test as a whole, but also per arm that the bandit selected.
Peep: All right, next question. We talked about pre-calculating sample size, if I know how many transactions, but if my KPI is RPV, revenue per visitor, does anything change?
Lukas: Yeah, that’s not binomial, so I don’t know how to do that.
Yuan: So for this one, if I may add on this one, revenue per visitor, there are two pieces to it, the average order value and conversion. Do you get more people to buy or people to buy more? You really need to get into that. Figure out which lever you’re going to pull because the calculator could be built based on AOV, it could be built on conversion, and these two equations add to be the revenue per visitor. So break it up. Understanding what you’re truly going to measure, truly going to manage, then you can put either conversion or average order value into the calculator to calculate how long it needs to run. Yeah.
There are two things in there. It’s not going to get what you’re looking for because this is the metrics made up of two parts. You really need to peel the onion to figure out what exactly you’re trying to measure, which one is the impacting factor of the revenue per visitor? It could be average order value or it could be conversion.
Peep: All right, so lots of tests are invalid because it wasn’t set up properly. So in your experience how often does that happen?
Matt: With us, never.
Lukas: It depends on a dev. I can’t give you any names, but I’ve had multiple teams where some dev is better than another. When you’re setting up the test you’re like, “This test is really important. So I’d rather have Bob set it up.
Peep: Awesome. So running tests for everybody and then doing segmentation after? Or running tests targeting specific segments?
Lukas: Do you want them to be in sync or not? It depends on the decision, right? So are you going to put them both full on or would you consider putting one full on and the other not?
Peep: I’m testing the same thing for all of them, for all the segments, and it’s 100% per segment. Or should I just run the traffic test for 100% of traffic and do post-test segmentation to analyze how the test performed across segments?
Lukas: No, so this setup depends on the decision you’re willing to take. If you are going to make a go/no-go decision for all platforms anyway, then why run them as separate tests? The separate test is only useful if you say, “Well, maybe I want to do it on desktop but not on tablet.”
Peep: Well, let’s say my mobile traffic is 20% and desktop is 80%. After two weeks, test stop, sample size perfectly adequate, but I can’t stop the test because the mobile segment still needs like three more months. So it’s regret because now I could be testing something else on desktop, but I can’t because I need the sample size for the mobile.
Lukas: But you already decided that you wanted to put them full on at the same time.
Peep: Well, I was stupid.
Lukas: So basically your mobile is lifting on your desktop now.
Peep: All right. Now, questions from the audience. Leho, [sounds like 00:18:07] do we have questions?
Leho: We have a super interesting first question. Can Lukas be a tiny bit nicer? So what is the nicer version of Lukas?
Lukas: Who said that?
Leho: Twelve people at the very least.
Matt: Is that a question or a request?
Leho: I’m not sure. Either way.
Leho: There you have it. That’s a valid answer. The next highest voted question, what is your stance on changing the percentage traffic allocation to a test sale during an A/B test if it looks promising? Okay, it’s a long question.
Peep: Can we have the Q&A thing displayed on the screen, please? Aivar maybe, can you help?
Leho: Yeah, that would actually make sense. I think what they’re asking is, changing traffic allocation in the middle of a test towards the winner essentially.
Lukas: So that’s Simpson’s paradox, right? You risk biasing your result by doing that.
Matt: Well, but you would have the same problem if you were doing a bandit and you were doing probabilistic matching or value matching. You’re reallocating. So I’d say: why not? If you’re trying to run a statistical test and you’re under that formalism, then no, you probably shouldn’t. But if you’re treating it more as a bandit test, then it’s kind of like a bandit problem, right? You’re kind of reallocating.
Lukas: I actually have an example of this, but I need to put up a computer. Peep?
Lukas: I have an example of this exact thing, but can I walk over there and show it?
Peep: Yes, please. A live demo.
Lukas: A live demo.
Leho: This is Lukas being nice. [side conversation 00:20:11-00:20:25]
Peep: Yes, meanwhile another question.
Leho: VWO versus Optimizely, pros and cons.
Matt: No comment.
Yuan: Which one? What’s the first?
Leho: VWO, visual website optimizer or-
Yuan: Yeah, I didn’t use that one. I didn’t use the first one, but I used the second, Optimizely. It’s fairly lightweight. I really think it would be good for doing some of the marketing campaign type of optimization, but I haven’t found it to be very robust.
If you’re going to be changing your navigations and search results, integrate it into your back-end system, I think it’s a little bit lightweight for Optimizely. I call it a pistol tool, and I call tools like a Test&Target a cannon tool. Definitely depending on your purpose, you want to choose differently.
Leho: Test&Target is large organizations.
Yuan: Right, it’s very development-heavy, but it’s also very robust, and Optimizely is rapid innovation. You can do a lot of marketing driving, type of tool. It doesn’t have to necessarily need a developer, but its setup is simple, the code can be generated by itself, and the measurement is pretty lightweight. It’s just a lightweight tool.
Peep: All right, Lukas, what about your demo?
Lukas: Yeah. I think this was the question. Can I change the allocation? So this is an example where you ran a test for one week and you were cautious because you thought this thing was going to bomb, so you assigned 99% of the traffic to variant A and 1% to B. You ran it for a week, and you see these results.
That’s a statistically significant difference, so you ramp up the traffic for B. Can you go to the next one? So you ramp up the traffic and you now assign 50/50 in the second week. You can see that if you add up all the data from the first and second week, it looks like now A is actually better than B. So the trend has completely reversed. Both of these are statistically significant, right?
The problem is, if you were adding up these two weeks, most of the traffic went to A and in the second week it was split 50/50. So in the complete data set, about two-thirds of all the traffic in the data you have for A is from the first week, but the majority of the traffic for B is in the second week, like 99% of traffic.
So if you go to the next Guy.
If we take only the second week, we see that B was actually consistently 10% better than A, but the overall global conversion rate dropped by point, what? something. This could easily happen in our business, for instance, if this is the week before Easter and the week after Easter, this could easily happen. Conversion rates change over time.
So if you change the assignment or the way you split between two variants, you could easily run into a problem like this where the global changes in conversion rates start affecting the results you see. So if you change the distribution, reset the data, reset the test, because you’re basically doing a new test.
Matt: But you have to be. . .
Peep: Let’s move on. Thank you. The question about optimizing VWO, who is our event partner? VWO. Okay, next question. Guys, can you see the screen here?
Lukas: Can Matt not respond?
Peep: Low traffic websites cannot really run A/B tests. So what should they do instead?
Matt: I don’t know.
Peep: Talk to me later, how about it?
Lukas: Go work for McDonald’s?
Leho: Run a test that lasts for a year? Is that better than nothing?
Lukas: Put those people on the phone and call more customers. I don’t know. If it’s difficult to run tests, if your sample sizes are so small, you really have to question whether that’s worth it. It’s very similar to if you have a salesperson who calls a million people and gets one sale, you question the value of having a salesperson.
Lukas: This is the same thing.
Peep: Okay. Well, we were talking about, sorry.
Yuan: Yeah, the first question. If you don’t have enough traffic to run A/B testing, I would just look at before and after. Why bother running an A/B if you can’t get the results?
Peep: Exactly my point. All right, so A/A testing. What equation do you use to remove the noise from the result of an A/A/B test?
Yuan: First of all, on removing the noise. The reason you wanted to run the A/B from my understanding, definitely correct me if I’m wrong whomever raised the question, would be understanding what is noise and what are the true results. That’s the reason you run it. So I would consider anything between the AA is really the noise. I cross-finger that’s no difference or a difference with no significance, meaning there’s really no noise, right? But the reason you want to run A/A/B is to understand what’s noise and what’s the true lift. Did that answer that question? Maybe I didn’t, did? Okay, great. Thank you.
Lukas: You’re sticking your finger in the wind and trying to gauge how bad the noise is. That’s what you’re doing. It’s really to get a sense of the noise, not mathematically deduct it from it.
Peep: All right. Quality assurance of tests. Any tools you guys use or have used or have heard about anyone using?
Yuan: I don’t know how Bookings.com, what we’re using is sometimes, in Dell we actually started having proprietary script, our own QA script to do, but I really felt regardless of what tool you use, please use human eyes because the scripts don’t script anything. Using human eyes just a couple of times to look at all the browsers, different versions of the browsers, different devices, making sure it’s what you’re looking for. Nothing is more accurate than people actually looking at this.
Leho: How scalable is that?
Yuan: What’s that?
Leho: How scalable is that?
Yuan: We have a script to run this already. So a lot of times by the time it gets to the people a lot of the things are already weeded out. So it’s quite scalable. Now, of course you can build the team differently, you can build it in a lower-cost region so you can have a regional coverage, 24 hour coverage. There’s a way you can scale that for sure. I don’t know how Bookings.com . . .
Audience: Is there a basic tool for statistical sampling between sample sizes that anyone out there uses?
Peep: Evan Miller’s tool is good.
Yuan: For estimating traffic-
Lukas: Evan Miller for the win.
Peep: All right, next question. I guess this is about Yuan’s presentation. Any tips on how to build an evaluation framework to determine if and when the previous test needs to be run again to ensure that the lift is still being achieved?
Yuan: Yeah. I think that’s what I was talking to as well, right? A lot of times I use that, just to have a gut feel regarding if I book the revenue, the revenue is still there. Sometimes I use A/B testing to push the winner, temporary implementation, and flip it back to see what’s in there. So I would consider frequency, I would probably do once a quarter because it’s just validation effort. I wouldn’t want to spend a lot of time. But it’s good to have so you know is there still there or not there any more lift. So I would try to flip it over probably once a quarter. Not everything, but just randomly picking certain things to see what the impact is there. Yeah.
Lukas: Interesting. I would only run, we call this a negative test. I would only do it if I want to base a decision on it to turn it off. So if there’s no more lift and there’s no technical debt, so there’s no cost attached to having this on our site, I wouldn’t even run the test because, what am I going to change?
So if this is a feature that requires a lot of computation on the server or it requires some sort of script to run or database or whatever, then you could argue, “Well, removing it will remove technical debt, and that’s worth running a negative test.” But if this is a button color or copy change or something that doesn’t really matter whether it’s A or B, then I wouldn’t waste time on a negative test. That is eating into my testing time for other things.
Yuan: I completely agree.
Peep: Perfect. Let’s take the next question. I guess it’s for Yuan again. Any significant mobile wins on booking? Well, he can’t disclose anything, Dell.com, Office Depot. I’ve been hearing a lot about mobile eCommerce implementations that did not produce better mobile conversions.
Lukas: They are definitely different things. We do see wins on desktop being translated into mobile and the other way around. So we have different teams running them and they are constantly in touch to, “Hey, that worked. Can you try this also?” But it’s not one-to-one. The stuff that works on desktop doesn’t always work on mobile, and the other way around.
Yuan: I agree. This is one of the things I can resonating what Amy mentioned this morning. The mobile conversion’s less than 0.0. I think if I remember correctly it’s 0.02. I won’t call each name, but it’s a very small conversion rate. When you’re using that a lot of times, desktop, you’re using that, you just don’t see it, and when your base is 0.01%, where do you see the difference? What does it matter? So there’s really kind of a need to understand the purpose. I really think a lot of mobile is discover, it’s not a buying platform. It could be me waiting at the bus stop and doing this. So it’s a different usage model. Correct me if I’m wrong, a lot . . .
Lukas: We just crossed 100,000 bookings a day on mobile last month.
Yuan: Oh, yeah. So it’s different in business. I apologize. Yeah.
Lukas: People are buying on mobile.
Lukas: A lot.
Yuan: Great point. It’s probably a different type of business, right? People just don’t buy a $500 computer on their thing.
Lukas: But they do book $500 resorts.
Yuan: Yeah, so you can see it’s kind of an interesting discussion to have a different business and have a different model. One opportunity I would love for us to see, which, I don’t see that many platforms get there, is building the multi-channel together, to understand is Yuan on a desktop and also on mobile? That I think is the next generation insight I would love to get. With that, then you can understand the usage model between desktop and mobile. That’s just my thoughts. I personally haven’t seen a lot of winners on the mobile for me.
Lukas: Conversion is definitely lower on mobile.
Lukas: Depending the platform, too. For us on tablet it’s a lot higher than on MDOT.
Peep: I’ve heard a number being thrown around that a typical mobile conversion rate is 25% of desktop. Is that similar to your experience or not at all?
Yuan: No. Yeah, I think it’s a little lower. For us it’s a little lower. For what I’ve seen it’s a little lower than that actually.
Peep: All right, awesome. Moving on, testing calculators. We mentioned Evan Miller’s sample size calculator. Any other tools you’d recommend?
Matt: You can actually, as a little exercise for yourself, you can just take the T test, just do a little algebra, and you can make your own in Excel. So that’s kind of, if this is your thing to do, I kind of recommend doing that.
We used to do that back when I was in the agency back in the ’90s. We would do tests when we were sending out mail campaigns and whatnot. That’s what we would do. We would just write our own little thing. It’s just a little bit of algebra actually.
Peep: Life’s easy when you’re a PhD machine learning dude, isn’t it?
Matt: No, I think it’s worthwhile because then you actually get a better sense of what’s happening and you can understand that actually as you alter your N, as you increase your sample size, you can make almost anything significant. So I do recommend actually, that is a worthwhile little couple hours to do. It’s not that hard.
Peep: Screen, please.
Yuan: Whoever wants that tool, I know there are a couple of really smart Microsoft guys, actually one of the evangelists that kind of started very early one, about 10 years ago. They actually had a calculator on the website. If you guys, whoever asked that question, send me an email. I will send that link so you can have the calculator, but to your point you can absolutely build-
Matt: First you have to show your work. You’ve got to send your work to her and then she’ll give you the cleaned up version.
Audience: [inaudible 00:33:00]
Audience: [inaudible 00:33:03]
Peep: VWO? Yeah, Optimizely has as well. Yeah, for sure.
Okay, next question. What is your recommendation for storing and organizing hundreds of testing documents? I want my team to be able to reference old test learnings quickly and easily.
Lukas: You talked about shelf life. Why do you want this?
Peep: To learn from pasts tests. “Hey, we learned this and this about our customers. So maybe they respond well to these things.”
Lukas: Do you think that’s still true?
Audience: [inaudible 00:33:40]
Yuan: Yeah, so some of the things, sorry. Some of the things when we were in the large organization in which we run about 400-500 tests a year, so this is to your point, there’s a lot of learning. Not necessarily winner, but what didn’t work? Let’s not repeat the things we knew didn’t work in the past. So there are two ways you can do it. You can probably use tools like JERA, just put in the JERA, or you can use confluence pages, but to build a good indexing way to say, “Here’s the test idea, here is the site, here’s the concept, and here are the results.” Easily searchable, right?
So that would be my suggestion, putting in something, like a knowledge base tool, and then building in a way that, don’t attach the carpooling deck. People have to go there and dig. It’s not scalable. Just really structure the way people can search, can look it up very quickly. That would be my suggestion. That’s what we’ve done.
Audience: [inaudible 00:34:32-00:34:38]
Lukas: Yeah, we use our internal A/B testing tool. Then everything is in one place, so when you start a test you have to write your own hypothesis, and when you end or full-on a test you have to explain why. All these results can be found by anyone in the company, so if you want to know what tests the hotel page team has been running, you can see what they’ve been working on, what worked, and what didn’t. So that’s all custom-built. I don’t know of any tools that will help you do that.
Peep: Awesome. Okay, next question. Regarding unusual variants in A/A tests, have you seen this more or less frequently on different testing platforms like Adobe Target, Optimizely, etc.?
Lukas: Yes. We run lots of A/A tests continuously to check whether the tool is still working. Sometimes they go out of whack. They’ll be more positive or negative than you expect. Usually it’s a bug. It happens. That’s why you run these tests. You run AA to figure out whether the tool is still working.
Matt: I think the question is, they’re trying to pick between this limited set of tools based upon the A/A tests.
Peep: Conductors, anyone?
Matt: I’m just kidding. I don’t know. [inaudible 00:35:52]
Peep: Poor Lukas. Okay, let’s give other guys one question without Lukas. How do you ensure that your winning tests actually affect the overall conversion increase for bottom line growth?
Yuan: If I may clarify that question, winning test, depending on what KPI-
Peep: Alex, can you clarify?
Alex: Yeah, as all your tests add up, sometimes the overall conversion rate doesn’t increase too.
Yuan: So by the way, it’s kind of funny. We’ve actually done that before. I won’t name names. We’re doing that. Here are the five, six, I need to have 80%, no, I don’t see 80%. I see single digits. This is where I think there is a canceling effect and there is a shelf life. So don’t add it up. That’s to say make it a learning, make it, I just need one better than the other. I don’t know if it’s 20% better, but I just know it’s better. So you can’t add the 20% and the other on top of it and stack it up. It never stacks up together. I completely agree. Yeah.
Peep: Hey, Lukas. You’re a good boy. You can come back now. All right, Lukas’ presentation talked about the importance of multi-arm bandit testing, but then discussed the need to not continually change test percentages. Could he clarify which he prefers and why?
Lukas: You can answer that.
Matt: That’s directly to you.
Lukas: They asked me not to answer any questions.
Matt: [inaudible 00:37:23]
Peep: Go ahead, Lukas.
Lukas: Sorry, I can’t actually see the whole question here.
Matt: It’s why he used a bandit with varying quantities and not changing the test. The one thing that gave me pause in your answer, and it gets back to this notion of drift or whatnot, is that your response was like, “Well, look, if we do it a week before Easter or the week after, it could be different,” but that’s true if you ran it for the two weeks and those two weeks were right before Easter or three weeks before Easter, and then you’re going to go live forever with those selective results.
The environment can still change out from underneath you anyway. So you’re always kind of facing that risk. In a way the whole methodology is predicated on this type of problem because we run random or certain allocation across a certain time period. Then we make a discreet choice, and then we play one, and that’s like the most extreme version of altering it.
So in terms of a decision problem where we’re just trying to select one from another, I guess it’s okay if we assume that we have a sort of stationary environment. If you want to interpret your statistical results, if that P value or your confidence interval, your standard error is going to have meaning, then you can’t do that, I think.
I think if you’re going to interpret the results under the framework of the null hypothesis testing framework, then you probably shouldn’t, but I’m not certain. But if you just want to use as sort of a selection criteria, then you’re exposed to that risk anyway, always. We’re saying the same thing with the shelf life is the same thing as the notion that your environment changes.
Peep: Okay, so one magical test for every website that always works. What is it?
Matt: Yeah, give it away for free.
Peep: Make everything free?
Matt: Yeah, that works.
Peep: All right. I need to hire and train a conversion optimization guy to work in-house. How would I pick the right person for my company? What tools and resources do you recommend?
The second skill set that’s important is the data science analytics element. So these are the two skill sets I would look for for in-house expertise. I wouldn’t look for program management because process can kick in later, but you’ve got to let the tools do their job, having people know how to interpret things. I would go for that before the process itself. Does it answer your question?
Peep: All right, next question. Tell me more about an upper and lower funnel test.
Lukas: This is for Yuan.
Peep: This was in Yuan’s presentation.
Yuan: Yeah, when I say upper funnel, a lot of times I consider the browse layer, the home pages, the navigation pages, product details, and category. I call it upper funnel. The lower funnel is, for example, in the store or in the conversion funnel, like the cart and checkout. The question is, “Can I run this concurrently?” I think Lukas already answered that question. Yes, you can, but before you run that, definitely run, like I said, AA here and a couple of AA here to make sure the traffic is really splitting equally between the recipe. That way you can cancel all the noise when you go down to the funnel.
Audience: [inaudible 00:41:26]
Peep: Hold the mic closer, please.
Yuan: Oh yeah. I was just saying when you run upper and lower funnel, before you run this test, assuming there’s no noise, run this AA and run a couple of AA here to make sure when you split traffic 50/50, when you’re coming down here it’s evenly between these two recipes. That means the noises are canceling out each other. Not necessarily all the time, so you want to watch for that. Once that is proving out to be accurate, then you can run this upper and lower funnel simultaneously without too much worry about the noise in there.
Peep: All right. Two last questions. What are the biases which you are aware of that are influencing your test results or testing strategies? Biases.
Lukas: You mean human biases or statistical biases?
Peep: Confirmation bias? That’s what a bias is. Who asked the question? Jakub?
Jakub: Just anything that you’re aware of that’s influencing your ability to set up you tests.
Lukas: I think one of the speakers already mentioned confirmation bias. That’s definitely one because most of the biases that are influencing the results are human biases, right? So things like confirmation bias or Ikea Effect, the Downwind Effect. “I built this, so it must be awesome.”
Peep: Don’t mind me.
Lukas: I’m scared.
Peep: They asked me to. They made me.
Lukas: I don’t like this game.
Peep: Okay, last question. How do you convince a client to keep testing after several inconclusive tests? The client is saying, “This shit’s not working.” So how do you convince them to keep on trying?
Yuan: If I can give one suggestion, I think it’s a good question. I think we should talk a little bit more. First of all, A/B testing is not a sprint, it’s a marathon. There is that odds ratio of 30%. These are really important statistic to . . . Don’t set an expectation to be here. Start from here. Never over-communicate especially if you want to try to sell it, and above all, run more tests because your odds are the more you run, the chances are you’re going to have to win. If you only run three, the chances are that none of them work.
So it’s setting expectations and using industry common practices and best practices. The level setting, the longevity of the marathon nature, non-sprint nature, is really important. Just don’t oversell, really. Yeah.
Lukas: Yeah, that 30% estimate is actually high. For us it’s much lower.
Lukas: Yeah, because we’ve been doing this for seven, eight years now. At some point every pixel has been touched so many times. We don’t get near the 30% success. I wish we had a 30% success rate. That would be awesome.
Yuan: That’s a really excellent point. When you’re having to start from there, you have a lot of opportunity, you have an easy fix. By the time you get seven to eight years everything is pretty good already. It’s hard to find out what-
Lukas: You run out of low hanging fruit.
Yuan: Exactly. Yeah.
Peep: Well, people want more. You guys are doing a good job, so we’ll keep on going. Hey everybody, if anybody needs to leave, go ahead.
Lukas: So first you spank me and then you want more?
Matt: That’s usually how it goes.
Peep: Okay, okay. Why do all A/B test trend line patterns look the same at the beginning? Why are they always moving around in the beginning, the graphs?
Yuan: That’s an interesting question.
Matt: Why is there so much variance?
Matt: Because you have small sample sizes so that any deviation that comes in is going to radically move the average around. So if you think about, I don’t know, when you were back in elementary school and you had your weekly spelling tests that maybe you failed every week, so you’d try to go to bed before your father would come home so you wouldn’t get in trouble, but . . .
Peep: Which is why you got a PhD.
Matt: On your first couple exams your average score can move around a lot, but then as you keep taking your exams you’re just going to start to converge to a fixed average. So that’s why. You know what? This is going to sound a little flippant, but actually the number one takeaway I would have for everyone in this room is actually literally, go and calculate your own sample size calculator from the T test, the T score, because if you do that you will see why this has happened. You’ll see why you’ll have that variance, because it all comes from the fact that we’re looking at the standard error, and this is going to get a little technical, but it’s divided by the square root of N.
So all of this stuff, all of our confidence, all of the notions of the variance of our samples, it’s all driven by one over the square root of N. That’s the key thing, and I believe that there’s no one in here who can’t just go through that exercise. You’ll be empowered. You won’t have to keep asking anyone, you’ll understand it deeply, and you’ll just be much more comfortable doing all this. So I do think it’s actually a very good exercise. I hadn’t really thought about it before, but I really recommend doing that.
Lukas: You can stop asking us.
Yuan: Yeah, that’s a great suggestion. That’s an excellent point.
Peep: All right. How much do conversion services cost and for what do you charge your clients? Well, these guys are not consultants. They’re not charging anything. They’re being paid by their employer. So if you want to ask me how we charge, approach me later. For Yuan, we are looking at Adobe’s tests. It’s called Adobe Target now, I think.
Yuan: Yeah, it is.
Peep: How valuable do you see Test&Target, and do you believe you have achieved an ROI with it? So cost versus outcome.
Yuan: Yeah, like I said, depending on what your organization is, how big, and how robust, Test&Target, which they call it Adobe Target now, it’s a very technical tool. So you definitely need a developer to be able to cull a lot of this, but it’s very robust. You can test a lot of things. We use that to test actually just things in cart and checkout, which is really incredibly hard. A lot of times a lot of tools are not testable on those areas. It’s very robust.
Now, do I believe in ROI of the tool? I believe in ROI of A/B testing, regardless of the tool. I think it’s the mentality that continuous iteration, believing that A/B testing will make the customer experience better, is going to give you the ROI. Yeah, the tool is just a part of that equation, right? You have tool, you need the people, you need the time, you name it. There is definitely cost associated with that. Yeah.
Peep: Awesome. What does it mean when two treatments cross over each other in your testing tool? Is that a validity threat? Can you clarify the question, anonymous person?
Matt: I think they’re asking about overlapping tests.
Peep: So two tests on the same page running simultaneously? Is that . . . ?
Audience: [inaudible 00:48:41-00:48:50]
Yuan: So the trend is just switching left and right.
Peep: So A was winning, B is winning, A is winning, B is winning.
Yuan: Yeah, so that means these two tests are not statistically significant. There’s no difference just because they’re switching back and forth. If that happens you wouldn’t see the P value to be anywhere near 5%, or 0.05%, excuse me. No, actually 5%. Yeah, because the reason it’s switching is because it couldn’t tell the difference. Today’s better, tomorrow is worse, and you won’t get to the confidence there. I’ve seen something there like that in . . .
Lukas: Unless you have some very weird seasonality thing going on, but that’s very unlikely.
Yuan: Or today A is better than B, tomorrow B is better than A. So basically it’s . . .
Lukas: But it’s probably random.
Yuan: Yeah, statistically insignificant results, basically. Yeah.
Peep: Any thoughts on the new Optimizely stats engine?
Matt: My understanding . . . I don’t know if I should say anything. You can look it up. I think it’s based on the Wald Likelihood test, the sequential Wald Likelihood test. So you can wiki that. I guess I’m not going to talk about Optimizely.
My sense though, that in general you really don’t want to be using the P value as a stopping rule. So whether or not that’s what’s happening here with the sequential testing, I’m not sure, but if it is, there’s an additional risk which is called the “Magnitude Bias”. One of the issues with using a P value for a stopping rule, along with all the other reasons, is that you’re basically putting a threshold on how large of an effect size you can see between the two, right? That’s really when we want to estimate our lift.
It’s like, let’s say, the difference between A and B. If you stop it using the P value at a certain point, you’ll only see magnitudes that are quite large because those are the only differences that would be statistically significant. Because of that, results that you get from doing an early stop with a P value, on average, will have higher magnitude or a higher effect size than if you didn’t because they’re conditioned on it.
So that is probably one of the reasons why we also tend to see, if you do that, you see these regressions back to the mean. There may be an effect, but whatever effect you see is going to be biased, in absolute value, much larger. So that’s an issue. I understand even in sequential tests, that’s still an issue.
Lukas: You’re pretty much always overestimating the effect size.
Matt: You have to overestimate the effect size.
Lukas: My understanding of what Statue Engine is doing is they test to the power of one to tweak the P value as the test is going on. So they lower the threshold. The thing is, “statistically significant” doesn’t mean “business-wise significant”, right? You can have something that’s statistically significant and completely irrelevant to your business. So that’s one.
The other thing is that by doing this they’re controlling for a false positive. So you have less false positive, and that’s good for them because that means when the test is positive, that is more likely a real result. Now, that’s what Optimizely wants because they’re selling you a tool that helps you do testing, and then they want you to find results that are real. What it also does is it increases the false negatives. So you have less false positives, less times when you thought there was an effect when there wasn’t, but you’re also missing more real results.
As a scientist I think this is great, because I want real results. As a bandit I think this is terrible because I am missing things that are actually helping my business because I am being a dick about my P value.
Matt: I’m going to just gently disagree with you.
Lukas: Come on, violently. Come on. We can do this.
Matt: No, no, but you do have this magnitude bias though, regardless. I’m not going to talk about Optimizely, but if you’re doing that type of process, it’s just something to keep in mind if you want to use it, because if you’re forecasting you’re going to be upward-biased. I’ll leave it at that.
Lukas: Yeah, I agree.
Peep: All right. I think we have achieved the stopping rule because people want to know what’s for dinner. So thank you so much, guys.
Matt: Thank you.
Lukas: Thank you.