You run an A/B test, and it’s a winner. Or maybe it’s flat (no difference in performance between variations). Does it mean that the treatments that you tested didn’t resonate with anyone? Probably not.
If you target all visitors with the A/B test, it merely reports overall results – and ignores what happens in a portion of your traffic, in segments.
Why is conducting post-test segmentation important?
Your users are different. What resonates with one person, doesn’t work with another one.
If you segment your A/B test results by browser version, you might discover that customers coming from the Safari web browser are converting much better than the average.
You might also notice that people using the Firefox browser hardly convert at all when they see variation B: this could mean that there are some technical issues with the front-end code that make the treatments not work on Firefox.
Noticing which segments respond well (or not all) to particular treatments can make the difference of making money or not.
This is especially true of big businesses:
Knowing these finer details can help meaningfully shift your metrics in a positive direction, where they may have once been plateauing.
Analyzing data in your post-test analysis
A/B/n test results report results over all of the visitors that were targeted by the test. If the target audience was large enough, they give you an idea of the general trends, but not specific trends within certain groups.
Aggregate data can show no significant difference over time in site data on average; however, when you break that data down into groups, there often is a significant difference in a particular user segment – proving once again all data in aggregate is crap.
Jakub Linowski elaborates on the distinction between certain groups that segmentation in post-test analysis can pinpoint.
TIm Stewart offers an Internet Explorer browser example to illustrate how data segmentation after your test can tell you about your site’s functionality:
Conducting QA on your A/B tests is a must, but bugs can still slip in. Post-test segmentation can help you discover whether any of the treatments are still buggy.
Chad Sanderson explains that you must have a methodology to segmentation:
Once you’ve asked yourself these questions and developed some answers, you can scour your data and divide along your chosen lines.
Avinash Kaushik offers an excellent guide on how to segment your users between source, behavior, and outcome.
A word of caution when segmenting, though: the more segments that are compared together, the higher the probability of error, so choose wisely and make sure your data is relevant. No need to compare apples to oranges.
Segmenting your data: before or after your test?
Many optimizers don’t like segmenting in post-test analysis, preferring segmented tests from the get-go.
Chad Sanderson explains why he prefers segmenting beforehand, which he calls “pre-registration.”
Segmenting tests from the beginning doesn’t always help you with your discovery process. The goal of a test is to figure out which segments respond to which treatments, and often that’s hard to do if you divide them before you even start testing.
Tim Stewart prefers segmenting beforehand but says that errors can occur both before and after the test.
How to test to ensure valid results
As Tim said, whenever you decide to test, making sure you’re efficiently following testing process guidelines to ensure valid results is crucial. Setting up a test is not the time to go rogue and throw the rule book out the window.
Claire Keser of WiderFunnel offers some tips for testing:
The elephant in the room: sample size
In general, a good guideline is to stop an A/B test when three conditions have been met:
- Large enough sample size (based on pre-test sample size calculations)
- Long enough test duration (minimum 2 business cycles, so 2-4 weeks)
- Statistical significance 95% or better
When data is segmented, however, this can divide up your sample size into chunks that simply aren’t large enough.
If the sample’s too small, it doesn’t give you the full picture – you’re only seeing a small percentage of your visitors and your data can’t be counted on for statistical validity. (You want uniform sampling to avoid The Simpson’s Paradox.)
Tim Stewart agrees that you need to make sure the sample size of your smallest segment is large enough to detect the expected difference.
Your experimental controls, Stewart says, need to be equal: sample size, performance range, and distribution of outlier behavior. Control inequalities can introduce different weighted averages within a segment, between segments, and against the whole, which gives you inaccurate data.
Smaller changes aren’t reflected in the overall data as the number is usually too small to affect lift. Without segmentation, insights like these would be missed – segmentation gives context.
Follow up tests
In order to account for the decreased sample size, you should run your original A/B test for double the amount of time you normally would, especially if you know in advance that you’ll be segmenting your results.
Jakub Linowski explains that when you decide to segment indicates if you should conduct re-testing:
How big should your sample sizes be for follow-up tests?
You need to calculate the sample size in advance for that particular segment. You can set the expected uplift to what you were seeing in your original test.
For example: if in the original test you saw a big uplift inside that segment (ex. +30%), you don’t need as many people to be part of the test to achieve statistical validity.
The caveat, though, is that if the lift is small (ex. 5%), you need a much larger sample size.
Looking past biases and focusing on data
Making a priori assumptions about reasons data has manifested a certain way can get you into trouble.
Tim Stewart discusses some testing problems he’s encountered in his career:
“Also common is a device split, sometimes with potentially valuable insight being dismissed because of prior accepted ‘knowledge.’
People report, ‘We know conversion to sale is lower on mobile; consumers research on mobile, the buy on their desktop. [We see this pattern all the time] outside of a test.’
Then, seeing desktop variant outperform the control, the mobile version shows no difference, or clear small negative, but a net gain on the overall. So, it’s declared a winner.
But more useful would be to ask and test why mobile is ‘accepted’ to perform worse. Why is it easier to detect 10% vs 12% desktop than 5% vs 6% mobile?
Ask: Why is the baseline lower? Why was the effect less/more pronounced – because of actual change, or because lower baseline means more sensitivity to volatility?
Was the variant designed mobile-first? Or was the concept for desktop crammed into less real estate and genuinely works less well.
Or is the main lever in the test, the hypothesis you are exploring, simply clearer on a larger screen, barely visible on a mobile screen?
Are the mobile users differently motivated? More time pressure, different part of the buying cycle, different day of week time of week pattern?
Is the mobile sample large and different enough to report as ‘significant,’ but not representative enough of a different buying, research, or user motivation cycle?
I’ve worked with several clients who had a clear pattern in logged-in user tests, where we can track and consistently test the same user experience across devices.
Patterns like: a customer researching on a desktop during the week, then purchasing via mobile when at the desired location on the weekend. Or even web surfing from the sofa during the evenings, followed by a desktop purchase at lunchtime the following day.
There was also the pattern of a consumer purchasing on desktop during the week, then checking the site on mobile for a different purpose when at their desired location (flights/ticketed events).
There are lots of scenarios where the user would be counted in test but has either bought on a different device or has a different motivation for their mobile visit (check delays, confirm details, access boarding pass, etc.) which means they won’t buy.
That creates a big imbalance in signal to noise on one device or another, which needs to be considered. Ideally in planning, but also in post-test segmentation.
The motivation for visit, user’s desired outcome, and unrepresentative samples can also be a big factor in marketing channel segments.
An example of on of our segmented tests, for illustrative purposes.
Is the variant doing a better job of making a key offer prominent and then shown to a Paid search audience that has a high proportion of users visiting for that offer?
Segmenting helps uncover these sort of patterns, errors and omissions, but this is something to establish prior to test and plan accordingly.
Post-test analysis and interpret segments with this in mind. But it is statistically questionable to declare these as definitive if the hypothesis and sample weren’t consciously built for this detail.
In those scenarios, where you can ensure the user is allocated to the same experience, you can sometimes run a hybrid test – a different treatment for the same concept between devices/screen sizes/offers/expected user motivation.
But the main hypothesis then becomes: ‘Did we get the balance right?’
You’re second-guessing different motivations/segments and developing different solutions for context. So, the test and analysis is reporting if that hybrid approach had merit.
Segmentation lets you look at things like whether a win on mobile cancels out a loss on a desktop (or vice versa), or whether a campaign or email meant the sample was unfairly biased in the test period.
Post-test segmentation can help identify these areas, which feeds into pre-test planning and potentially testing specific areas with appropriate context and sample planning.
You can then plan and run completely different tests with device-specific hypotheses, concepts, sample sizes to account for the different levels of noise, effect size and user motivation.
But ultimately whether this is worth it depends on the cost to do this. Calculating the opportunity cost, risk exposure, test cadence vs your ability to implement segment specific outcomes, and the value in doing so.
Because if you can’t implement (or it is prohibitive to do so), then testing a context-specific treatment is useful intelligence, but not revenue-positive.”
What do you do once you find that a particular treatment works better for a segment?
If the segments in question are big enough, the typical answer is personalization: provide different segments a different experience depending on what works best for them.
It’s difficult to manage this by creating manual rules inside your personalization tool. Machines are way better than humans at this. For example, Conductrics’ machine-learning based algorithm can learn these personalization rules by itself (which particular segments respond to which treatment), and adjust traffic between various experiences automatically.
Chad Sanderson details some of the tools and the importance of zeroing in on a specific audience:
User segmentation can be vital to gaining insights and maximizing revenue. When segmenting, keep in mind that you need large enough, valid sample sizes for each of the segments you’re analyzing.
Whether you should go for pre-registration or post-test analysis approach should be determined by your business needs and issues specific to your web property.