When To Do Multivariate Tests Instead of A/B/n Tests

When To Do Multivariate Tests Instead of A/B/n Tests

When should you use Multivariate testing, and when is A/B/n testing best?

The answer is at once simple and complex.

Of course, A/B testing is the default for most people, as it is more common in optimization. But there is a time and a place for multivariate testing (MVT), as well, and it can add a lot of value.

Before we get into the nuances, let’s briefly go over the differences.

What is multivariate testing?

Multivariate testing is, in a sense, a more complex form of testing than A/B testing. A/B testing is fairly straight forward:

ab testing conversionxl

You can also measure the performance of three or more variations of a page with A/B/n tests. As Yaniv Navot of Dynamic Yield wrote, “High-traffic sites can use this testing method to evaluate performance of a much broader set of variations and to maximize test time with faster results.”

Here’s what an A/B/C/D test looks like conceptually:

abcd testing
Image Source

A/B testing usually involves less combinations with more extreme changes, whereas multivariate tests have a large number of variations that usually have subtle differences.

Lars Nielson of Sitecore described it as the following:

“Multivariate testing, opposes the traditional scientific notion. Multivariate testing is the process of testing more than one component on the web site in a live environment. Essentially, it can be described as running multiple A/B/n tests on the same page, at the same time.”

The Case For A/B/n Tests

Should you use MVT or A/B/n tests?

If you have enough traffic, use both. They both serve different yet important purposes. In general, A/B tests should be your default, though.

With A/B testing you can:

  • You can test more dramatic design changes
  • Tests usually take way less time than MVTs
  • Advanced analytics can be installed and evaluated for each variation (e.g. mouse tracking info, phone call tracking, analytics integration, etc.)
  • Individual elements and interaction effects can still be isolated for learning & customer theory building
  • A/B tests typically bring bigger gains (since you often test bigger changes)

A/B testing tends to get meaningful results faster. The changes between pages are more drastic, so it’s easier to tell which page is more effective.

So A/B testing harnesses the power of large changes, not just tweaking colors or headlines as is sometimes the case with MVT. Optimizers usually start all engagements with A/B testing, because that’s where the bigger gains are possible

Yaniv Navot, Director of Online Marketing at Dynamic Yield, also mentioned that MVT is mainly used for smaller tweaks. He also mentioned that A/B tests are better for multi-page and multi-scenario experiences:

yaniv Yaniv Navot:

“Multivariate testing tends to encourages marketers to focus on small elements with little, or no impact at all. Instead, marketers should focus on running programmatic and dynamic A/B tests that enable them to serve segmented experiences to multiple cohorts across the site. This cannot be achieved using traditional multivariate testing.”

Something else to worry about with MVT: the amount of traffic you get.

How Much Traffic Do You Get?

Because of the additional variations, multivariate tests require a lot of traffic. If not high traffic, at least high conversion rates.

For example, a 3×2 test (testing 2 different versions of 3 design elements) would require the same amount of traffic as an A/B test with 9 variations (3^2). 3×2 is a typical MVT test.

multivariate test 3x2
Image Source

In a full factorial multivariate test, the your traffic is divided evenly among all variations, which multiplies the amount of traffic necessary for statistical significance. As Leonid Pekelis, Statistician at Optimizely, said, this results in a longer test run:

“All together, the main requirement becomes running your multivariate test long enough to get enough visitors to detect many, possibly nuanced interactions.”

Claire Vo, co-founder of Experiment Engine, also said that MVT is more difficult to execute because of the extra traffic and resources it requires:

claire Claire Vo:

“MVT tests require significantly more investment on the technology, design, setup, and analysis side, and certainly full-factorial MVT testing can burn through significant traffic (if you even have the traffic to support this testing method.) This means MVT testing can be a big burden on your conversion “budget”–whether that’s time, people, resources, or internal support.”

A rule of thumb: if your traffic is under 100,000 uniques/month, you’re probably better off doing A/B testing instead of MVT. The only exception would be the case where you have high-converting (10% to 30% CR) lead gen pages.

In addition, if you’re an early stage startup and you’re still doing customer development, it’s too early for MVT. You may end up with the best performing page, but you won’t learn much. By doing everything at once, you miss out on the ups and downs of understanding the behavior of your audience.

That said, there are definitely some high-impact use cases for MVT.

When Should You Use a Multivariate Test?

Multivariate tests are about measuring interaction effects between independent elements to see which combination works best. As Ton Wesseling, found of Testing Agency put it:

ton wesselingTon Wesseling:

“When to use MVT? There’s only one answer: if you want to learn about interaction effects. An A/B-test with more than 1 change could not be winning because of interaction effects. A winning new headline could be unnoticed because the new hero shot is pointing attention to a different location on the page. If you want to learn real fast which elements on your page create impact: do a MVT with leaving in and out current elements.”

Paras Chopra from VWO said he’d use MVT for optimizing several variables, but not expecting a huge lift. More for incremental improvements on multiple elements:

parasParas Chopra:

“I’d use multivariate test when I’m doing optimization with several variables, not hoping for a wild swing (that we expect in A/B test). I think the right way is to use A/B test for large changes (such as overhauling entire design) and such. A/B test could be followed up with MVT to further optimize headlines, button texts, etc.”

The Benefits of Multivariate Tests

MVT is awesome for follow-up optimization on the winner from an A/B test, once you’ve narrowed the field.

While A/B testing doesn’t tell you anything about the interaction between variables on a single page, MVT does. This can help your redesign efforts by showing you where different page elements will have the most impact.

This is especially useful when designing landing page campaigns, for example, as the data about the impact of a certain element’s design can be applied to future campaigns, even if the context of the element has changed.

Andrew Anderson, Head of Optimization at Malwarebytes, explained that MVT is used to figure out what the most influential item on the page is and then going much deeper on it:

Andrew Anderson Andrew Anderson:
“It is not about ‘I want to see what happens with 3 pieces of copy, 4 images, and a small CTA.’ The question should be what matters most, the copy, the image, or the CTA, and whatever matters most I am going to test out 10 versions (and learn something important).”

AB Testing can never tell you influence, MVT can when it is done right. ANOVA analysis gives you mathematical influence, or the relative amount one factor influences behavior relative to others.”

So a big goal of multivariate testing is to let you know which elements on your site play the biggest role in achieving your objectives.

ANOVA? A Quick Definition

ANOVA (analysis of variance) is a “collection of statistical models used to analyze the differences among group means and their associated procedures.”

In simple terms, when comparing two samples, we can use the t-test – but ANOVA is used to compare the means of more than two samples.

anova
Image Source

If you’re looking to dive deep into ANOVA, here’s a great video tutorial to learn:

So if there are certain use cases for multivariate tests, then there are certain ways to execute them. What are the conditions and requirements of running successful multivariate tests?

Multivariate Testing: How To Do It Right

The one big condition of running MVT: “Lots and lots of traffic,” according to Paras Chopra. Therefore, much of the accuracy in running MVT means understanding traffic needs and avoiding false positives.

Common Mistakes with running MVT

Though many of the common mistakes of MVT aren’t unique (many apply to A/B testing as well), some are specific to multivariate methods. But they’re pretty much as you’d guess:

  1. Not enough traffic.
  2. Not accounting for increased chance of false positives.
  3. Not using MVT as a learning tool.
  4. Not using MVT as a part of a systemized approach to optimization

1. Not Enough Traffic

We already talked about traffic above, but to reiterate: MVT requires lots of traffic. Fractional factorial methods mitigate this, but there are some questions as to the accuracy of this method.

The increased traffic requirement also presents the question of how long you should expect this test to go. This is especially true if you’re using MVT as a way to throw things at the wall and see what sticks (inefficient).

One thing you should definitely do is estimate the traffic needed for significant results. Use a calculator like this.

Leonid from Optimizely discussed ways to get around the need for crazy amounts of traffic, including the fractional factorial method (we’ll discuss more below):

leonidLeonid Pekelis:

“There’s another approach to reducing the need for more visitors in a multivariate test – examine fewer interactions (e.g. only 2-way interactions). This is where things like fractional factorial designs come in. You can reduce the required number of visitors by quite a lot if you use fractional factorial instead of full factorial, but you only get to see part of the interaction picture. Things get complicated pretty quickly when you look at all the different design methods out there.

One other use of multivariate tests if you don’t have tons of traffic: start by running a full factorial just to check that none of your changes interact to break your site, you’ll notice those pretty quickly, and then switch to running A/B/n tests to see which changes outperform their baseline.”

Though Matt Gershoff, CEO of Conductrics, said that it’s not necessarily true that an MVT requires more data than would a related set of simple A/B tests – in fact, he says, for the same number of treatments to be evaluated and the same independence assumptions that are implicitly made when running separate AB tests, an MVT actually requires less data. He continues:

Matt Gershoff Matt Gershoff:

Regardless of the type of test you decide to run, there are always two steps: 1) Data Collection; and 2) Data Analysis. One can always collect the data in a multivariate way (Full Factorial), and then analyze the data assuming that there are no interactions (Main Effects), or with interactions (we can even pick the degree of the interaction, based on the number of dimensions of the test).

This is why, collecting the data using a full factorial design is nice, because we can analyze it with any degree of interaction we choose – including zero interactions. The only cost, at least in the digital environment, is that we need to have more cells in our database to hold all of the test combinations. If we collect the data in fractional manner our analysis will be constrained based on the nature of the fractional design we used.

Unfortunately, there is no free lunch. Many who balk at the use of main effects MV tests because of concerns about test interactions happily recommend running separate AB tests – which also IMPLICITLY ASSUMES NO INTERACTION EFFECTS (independence) and requires even more data to evaluate.

2. Not accounting for increased chance of false positives.

According to Leonid, the most common mistake in running multivariate tests is not accounting for the increased chance of false positives. His thoughts:

leonidLeonid Pekelis:

“You’re essentially running a separate A/B Test for each interaction. If you’ve got 20 interactions to measure, and your testing procedure has a 5% rate of finding false positives for each one, you all of a sudden expect about 1 interaction to be detected significant completely by chance.

There are ways to account for this, they’re generally called multiple testing corrections, but again, the cost is you tend to need more visitors to see conclusive results.”

We’ve written about multiple comparison problems before. Read a full account here.

3. Not using MVT as a learning tool.

As we mentioned in a previous article, optimization is really about “gathering information to inform decisions.” MVT is best used as a learning tool. Using it as a way to drive incremental change and throw stuff at the wall is inefficient and takes time away from more impactful A/B tests. Andrew Anderson put it well in an article on his blog:

Andrew Anderson Andrew Anderson:

“The less you spend to reach a conclusion, the greater the ROI. The faster you move, the faster you can get to the next value as well, also increasing the outcome of your program. What is more important is to focus on the use of multivariate as a learning tool ONLY, one that was used to tell us where to apply resources. One that frees us up to test out as many resources for feasible alternatives on the most valuable or influential factor, while eliminating the equivalent waste on factors that do not have the same impact. The goal is to get the outcome, getting overly caught up in doing it in one massive step as opposed to smaller easier steps, is fool’s gold.”

4. Not using MVT as a part of a systemized approach to optimization

Similarly, many MVT mistakes come from people not knowing what they’re planning on doing, or having a testing plan at all. As Paras Chopra put it:

parasParas Chopra:

“The biggest mistake is not knowing what they expect out of an MVT. Are they expecting to see best combination of changes or they want to know which element (headline, button) had the maximum impact?”

Andrew Anderson puts it in perspective, saying if you’re using either A/B or MVT testing just to throw stuff against the wall or to validate hypotheses, this will only lead to a personal optimum (ie ego-fulfillment.) He continues, saying that, “tools used correctly to maximize results and maximize resource allocation for future efforts leads to organizational and global maximum.”

Now, I mentioned above that there were different statistical methods for MVT. There’s a bit of a debate between them. Does it matter?

Full Factorial, Fractional Factorial…Does it Matter?

There are a few different methods of multivariate testing:

  • Full factorial
  • Fractional factorial
  • Taguchi

There’s a bit of an ideological debate between the methods, as well.

MVT approaches
Image Source

Full Factorial Multivariate Testing

A full factorial experiment is “an experiment whose design consists of two or more factors, each with discrete possible values or “levels”, and whose experimental units take on all possible combinations of these levels across all such factors.”

In other words, full factorial MVT tests all combinations with equal amounts of traffic. That means that it is:

  • is more thorough, statistically.
  • requires a ton of traffic.

Paras Chopra wrote in Smashing Magazine a few years ago:

“If there are 16 combinations, each one will receive one-sixteenth of all the website traffic. Because each combination gets the same amount of traffic, this method provides all of the data needed to determine which particular combination and section performed best. You might discover that a certain image had no effect on the conversion rate, while the headline was most influential. Because the full factorial method makes no assumptions with regard to statistics or the mathematics of testing, I recommend it for multivariate testing.”

Fractional Factorial Multivariate Testing

Fractional factorial designs are “experimental designs consisting of a carefully chosen subset (fraction) of the experimental runs of a full factorial design.”

So fractional factorial experiments test a sample set by showing significant combinations. Because of that, they require less traffic:

full factoriall vs fractional factorial
Image Source

Though, an Adobe blog post likened fractional factorial design to a barometer, saying “a barometer measures atmospheric pressure, but its value is not so much in the precise measurement as the notification that there is a directional change in pressure.”

The same article then also said:

“I question how valuable it is to spend 5 months running 1 single test for learnings that may no longer be applicable by the time the test has completed and the data pumped through analysis. Instead, why not take the winnings and learnings of your week-long fractional-factorial multivariate test and then run another test that builds off that new and improved baseline?”

Taguchi Multivariate Testing

This is a bit more esoteric, so it’s best not to worry about it. As Paras wrote in Smashing Magazine:

”It’s a set of heuristics, not a theoretically sound method. It was originally used in the manufacturing industry, where specific assumptions were made in order to decrease the number of combinations needing to be tested for QA and other experiments. These assumptions are not applicable to online testing, so you shouldn’t need to do any Taguchi testing. Stick to the other methods.”

Further reading: The Truth About Taguchi by Tim Ash

So does it matter?

As mentioned above, most of the debate lies in the murkier statistics of the fractional factorial method. A large amount of the optimizers I talked to said they only recommend full factorial. As Paras explains, “A lot of ‘fractional factorial’ methods out there are pseudo scientific, so unless the MVT method is properly explained and justified, I’d stick to full factorial.”

However, some, like Andrew Anderson, hold that these debates in general are misguided. As he explains:

Andrew Anderson Andrew Anderson:

“Debating which is better, partial or full factorial, at that point is useless because you are just arguing over what shade of green is one leaf in the large forest. MVT should be used to look for influence and focus future resources, in which case it is just a fit and data accessibility question. Any other use of MVT missed that boat completely and just highlights the lack of discipline and understanding of optimization.”

So does it really matter? I don’t know. If you have enough traffic, I think full factorial is harder to mess up. That said, you’re making business decisions that are time critical, so if a full factorial test will take you 6 months to complete, it’s probably not worth the accuracy.

Conclusion

If you have enough traffic, use both types of tests. Each one has a different and specific impact on your optimization program, and used together, can help you get the most out of your site. Here’s how:

  • Use A/B testing to determine best layouts
  • Use MVT to polish the layouts to make sure all the elements interact with each other in the best possible way.

As I said before, you need to get a ton of traffic to the page you’re testing before even considering MVT.

Test major elements like value proposition emphasis, page layout (image vs copy balance, etc), copy length and general eyeflow via A/B testing, and it will probably take you 2-4 test rounds to figure this out. Once you’ve determined the overall picture, now you may want to test interaction effects using MVT.

However, makes sure your priorities align with your testing program. Peep once said, “most top agencies that I’ve talked to about this run ~10 A/B tests for every 1 MVT.”

Feature Image Source

Join the Conversation Add Your Comment

  1. Great post, many thanks.
    To add to the complexity of this discussion: What approach do you recommend for testing something like a loan calculater with different possible default values?
    – A/B/N Test in parallel, e.g. one round with A=control vs. B=different default in field 1 vs. C=different default in field 2 vs. B+C
    – A/B/N Test in sequence, e.g. 1st round: A=control vs. B, 2nd round: Winner 1st round vs. C
    – MVT is probabely not applicable since you can’t hide fields completely

  2. Hi Alex,
    Its helpful reading about the pros and cons of the A/B test and Multivariate Tests. The article is detailed and its cool learning new insights from it. At least I now understand what it takes to do a multivariate test. The examples are revealing!
    Having said that, I think my best takeaway in this post comes from the concluding part:

    If you have enough traffic, use both types of tests. Each one has a different and specific impact on your optimization program, and used together, can help you get the most out of your site.

    I left the above comment in kingged.com as well

  3. Hey nice article. Interesting to see how different people approach it.

    I want to mention that ANOVA would be a rather crude tool for analyzing the results of an MVT. It is only a comparison between means, so a lot of information is being lost. Additionally, conversions are a binary variable and with ANOVA you are comparing their transformations (a conversion rate), meaning you lose some signal. These problems show up huge in your DoE when calculating sample sizes – you need some ridiculous numbers there. To top all of this off, comparing means will only show you if there are differences in the distributions, not how big they are. Many people miss this and this leads to a lot of ridiculous blog posts claiming insane conversion lifts, which are simply not there. If you chose to include effect sizes, that will make your DoE even more complex, but worse – you risk losing a lot of information, especially in a multivariate scenario like this one.

    Some other ways to overcome the above problems:

    1) Using MANOVA (https://en.wikipedia.org/wiki/Multivariate_analysis_of_variance) instead. This is an extension of ANOVA meant specifically for such scenarios. It will fix the sample sizes problem up to a point. Additionally, you can include other variables besides the testing elements – (i.e. channel) to unlock even more insight, however those need to have some logic behind them or you risk overfitting.

    2) Logistic regression. That would be my go-to tool for the job. It addresses all the problems I outlined and you can add addition variables in it too (i.e. channel). You can quantify the effect every component/combination has, which is simply not possible with means tests. There are other pros, like for example testing only major variations and getting info about combos you did not think of and allocating more sample to them mid-flight. You can go full-nuts mode and test incredibly diverse scenarios using a nested extension, if your heart desires.

    3) CART would be an especially good option for the scenarios Andrew Anderson mentions.

    Simple ANOVA would be the tool I chose for A/B/n tests in simultaneous flight. A lot of people mention how you should only be having 1 test on a page, because of poor performance of t-tests. That’s the answer they are looking for.

    One thing I want to add is that when doing factorial designs you most definitely need to screen the combinations. More often than not there would be some that simply don’t make sense, so you can save a lot of testing sample by removing them.

    Would love to know what you think or want to expand on something.

    (I wrote this post on GH thread first, but it was considered spam, so I figured I’d post it here instead)

    1. Hi Momchil -long time ;)
      I am not sure I fully understand your comment, especially the suggestion to use MANOVA, but let me take a crack at it. I think you are raising three possible issues:
      1) Limited Dependent Variables: When running tests, you can have different measures: continuous outcomes (like sales amount per order), count (0,1,2,3…) , and binary (convert (yes, no)) etc. If you have a binary dependent variable, then logistic regression can be a good choice, since 1) it outputs probability scores (in the range of {0,1}), and, and this is really not that important, removes a source of heteroscedastic errors that fitting a simple linear model will suffer from. For online testing, I am not sure of how much extra value this is, since one will now have the issue of explaining/interpreting log odds. That might be a deal breaker for most organizations.
      2) Multiple correlated Dependent Variables: I am not sure if this is what you were getting at, but since you mentioned MANOVA, I am assuming that is what you mean. Most folks are not looking to test multiple outcome variables jointly, but yeah, you could do this, but MANOVA, like ANOVA, is going to assume homoscedastic error terms. So it won’t solve the limited dependent variable issue, if that is something you are really worried about.
      3) Control for nuisance / contextual independent variables (this is your ‘channel’ example)– if there are external aspects of the problem domain outside of our direct control, that can explain variance in our outcome variables, we can improve our results by accounting for them directly in our model. This is something that one could do with either an ANCOVA model, or with direct regression approaches. As you know, ANOVA, is essentially analogous to linear regression with dummy variable encodings of the treatment effects. ANCOVA is basically the same thing, but allows for the inclusion of continuous variables, along with our treatment variables, and is just like linear regression with the dummy treatment encoding and added continuous variables on the right hand side – Outcome=f(treatment(dummy) + ‘channel’ + ‘day part’ + ‘device type’ ….).
      If you wanted to blend this stuff together, I guess you could us MANCOVA, or even use Zellner’s Seemingly Unrelated Regression (SUR) – the error structure across equations is assumed to be correlated, or whatever is in use today (not my expertise). I guess it depends on what you want to test really.
      I am afraid I don’t see how CART/CHAID etc is going to be of much use for testing. It for sure can be useful for predicting and learning a mapping between customer attributes and outcomes. In fact back in the ‘90s when I was in Database Marketing, I would often use a tree model rather than logistic regression, just because clients often never really understood the regression model, and without understanding, you almost never got client buy in – which is true today, and I think the main take away. 
      All that said, for most basic test situations, it is hard for me to see when using a factorial ANOVA isn’t going to be robust enough come up with a good result in almost all basic testing situations (esp since under the cover it is essentially linear regression.) All else being equal, East to West, Least Squares is BEST ;-)

  4. Hey Alex,

    Great information here, but the details went over my head LOL

    But I didn’t know about MVT. I use A/B testing for my opt in forms on my blog and it looks like I had incorporated MVT. I run a contest between forms and which form gets the most opt ins, I’ll use it, copy the form, and make minor changes. This is where the MVT comes in.

    But according to your article, on average a site should be getting 100,000 unique visitors/month in order to use MVT efficiently. I’m far from having stat this per month and I might be confused about the difference between A/B testing versus MVT.

Comments are closed.

Current article:

When To Do Multivariate Tests Instead of A/B/n Tests