Statistical significance & other A/B test pitfalls
16 November 09

Last week I tossed a coin a hundred times. 49 heads. Then I changed into a red t-shirt and tossed the same coin another hundred times. 51 heads. From this, I conclude that wearing a red shirt gives a 4.1% increase in conversion in throwing heads.
A ridiculous experiment (yes, I really did it) with a ridiculous conclusion, yet I sometimes see similarly unreliable analysis in A/B testing.
It’s logical and laudable that designers should seek data in our quest for verifiability and return on investment. But data must be handled with care, and mathematical rigour isn’t a common part of a designer’s repertoire.
Here’s an example from ABTests.com, a worthwhile project that I feel slightly bad to pick on.

The two versions are subtly different:
- Version A: Upload button bold, Convert button bold, Convert button has a right arrow
- Version B: All buttons regular weight, no right arrow on Convert button
Although minor changes can cause major surprises, I wouldn’t expect these small differences to improve the form’s usability. With the caveat that I don’t know the users or product, I’d even speculate that Version B could perform worse since it reduces the priority of the calls to action and removes the signifier of progression.
The designer claims that version B showed a 30.4% conversion improvement in an A/B test. Here’s why this isn’t quite accurate.
The role of chance
Any A/B test is a trial, so called because we’re observing evidence gained by trying something out. I can never truly know that there’s a 50% chance of a coin landing as a head or a tail – I can only run trials and observe the evidence. Similarly, we can never truly know that a design leads to higher conversion – we can only run trials and observe the evidence. If that empirical evidence is strong enough, we conclude that the design is an improvement. If not, we don’t.
To be valid, trials need to be sufficiently large. By tossing my coin 100 or 1000 times I reduce the influence of chance, but even then I’ll still get slightly different results with each trial. Similarly, a design may have 27.5% conversion on Monday, 31.3% on Tuesday and 26.0% on Wednesday. This random variation should always be the first cause considered of any change in observed results.
The null hypothesis
Statisticians use something called a null hypothesis to account for this possibility. The null hypothesis for the A/B test above might be something like this:
The difference in conversion between Version A and Version B is caused by random variation.
It’s then the job of the trial to disprove the null hypothesis. If it does, we can adopt the alternative explanation:
The difference in conversion between Version A and Version B is caused by the design differences between the two.
To determine whether we can reject the null hypothesis, we use certain mathematical equations to calculate the likelihood that the observed variation could be caused by chance. These equations are beyond the scope of this post but include Student’s t test, χ-squared and ANOVA (Wikipedia links given for the eager). Here’s a site that does the calculations for you, assuming a standard A/B conversion test with a clear Yes or No outcome.
Statistical significance
If the arithmetic shows that the likelihood of the result being random is very small (usually below 5%), we reject the null hypothesis. In effect we’re saying “it’s very unlikely that this result is down to chance. Instead, it’s probably caused by the change we introduced” – in which case we say the results are statistically significant. Note that we still can’t guarantee that this is the right interpretation – significance is about proof only beyond reasonable doubt.
Running the calculations on the above data shows that the results aren’t statistically significant: the evidence isn’t strong enough to reject the null hypothesis that the difference in conversion is simply down to luck. The main problem is the small sample size (128 and 108 users respectively), so I would advise the designer, Johann, to repeat the test with more users. Assuming the observed conversions seen didn’t change (a big assumption) a sample size of approximately 200 users per variant should be sufficient for significance. He could then either reject the null hypothesis or the results would remain inconclusive, in which case there’s no evidence the design has made a difference. In Johann’s defence, he recently posted that he takes the point about significance, and I’m looking forward to seeing more conclusive data for this intriguing test.
Percentage confusion
Significance isn’t the only slippery problem A/B tests face. For starters, quoting conversion improvements is always fraught with difficulty. Since conversion is usually measured in percentages (in this example, 31.3% and 40.7%) there are two ways to quote improvements. We can say that conversions increased by:
- 9.4% – the difference between the two
- 30.4% – the amount that 40.7% is bigger than 31.3%*
Any percentage improvement quoted in isolation should be challenged: which of these two calculations has been used? It’s dangerously easy to assume the wrong figure without sufficient context.
The A/B death spiral
A/B tests also suffer from a common quantitative problem, in that they tell us what but not why. I’ve written about this previously in What if the design gods forsake us. It’s wise to back up numerical tests with qualitative evaluation (eg. a guerrilla usability test) so we can make informed decisions if data suggests we need to rethink a design.
Even with backup, sometimes A/B tests are simply the wrong tool for the job. They can provide powerful insight in some cases, but in the wrong place they can be a blind alley or, worse, a weapon of disempowerment. Logical positivism and design don’t mix – not everything we do can be empirically verified – yet some businesses fall back on A/B testing in lieu of genuine design thinking. I call this the “A/B death spiral”, and it plays out something like this:
Designer: Here’s a new design for this screen. You’ll see it has a new navigation style, tweaked colour palette and I’ve moved the main interactions to a tabbed area.
Product owner: Wow, those are pretty big changes for such a high-risk screen. I tell you what: let’s test them individually to see which of these changes works and which doesn’t…
As the proverb suggests, sometimes you can’t jump a twenty foot chasm in two ten foot leaps. Cherry-picking only those design elements that are “proven” by an A/B test can be a route to fragmented, incoherent design. It may earn marginally more money in the short term, but it becomes hard to avoid a descent into poor UX and the long-term harm this causes.
Being faithful to data
Given the potential hazards, I’m concerned about the naïveté with which some designers approach quantitative testing. The world of statistics rewards an honest search for the truth, not dilettantism, and I’d advise any designer moving in statistical circles to pick up some basic stats theory, or at least partner with someone knowledgeable.
A flawed A/B test, be it statistically insignificant, misapplied or misquoted, is nothing more than anecdotal evidence. It’s the same crime as making a website red on the feedback of one user. Yet an impatient designer, seeing the example I quoted above, could quickly jump to a false conclusion: “I should remove arrows from continue buttons: it’s 30.4% better.” Perhaps this designer deserves what he gets. It’s likely he’s only really interested in shortcuts to good UX, and linkbait lists of “Twelve ways to make your site more usable.” Since he understands neither the mathematics nor the context of this trial (timescales, userbase, surrounding task) he will inevitably grab the wrong end of the stick. Nonetheless, he is out there.
Don’t let yourself be that designer.
Photo: snellgrove
* subject to rounding.
25 comments on Statistical significance & other A/B test pitfalls
-
solle on 16 November 09:
good stuff. A/B testing is just one of many tools and should only ever be an indicator – and if you want to hold great sway by it yes get the help of a decent marketeer who has some idea about statistics. too many think they will make some fabulous $1,000,000 tweak….
-
Arjan Haring on 16 November 09:
Great post Cennydd, this is exactly the stuff I want to discuss within the “conversion family” of DfC. I am very glad I have asked you to become a Team Captain for our conference in NYC in 2010.
1. I completely agree with you when you say there are quite some naive designers that know little about statistics, I would also want to point out that there are even a lot of naive web metrics “professionals” out there.
2. I can not agree with you that a savvy designer has to follow his instincts (quoting your other post)… what the f***? First you sound like a higher level skeptic, keeping true to their standards and than you bluntly switch to becoming someone that thinks homeopathic remedies work.
We have a saying in Holland “Zachte heelmeesters maken stinkende wonden” what translates in something like “desperate diseases require desperate remedies”. Which honourable scientific theory confirms the fact that a designer instinct is more important than proof?
I would say, let’s try a little harder and see if we can make evidence-based design work.
But again, thanks for this very insightful post.
Ps. can we use this post on our conference blog to stimulate a conversation?
-
John Romadka on 17 November 09:
Another great resource for measuring usability, complete with calculators: http://www.measuringusability.com
The site was created and developed by Jeff Sauro, who’s PhD is from Stanford in Statistics, but works at Oracle in their usability department. He’s done a lot of work on creating reliable, valid usability calculators for small sample sizes.
He tries very hard to make these concepts easy to understand is bring good statistical methods to the usability community.
I’m not his PR guy, just a fan, and wanted to pass the good word.
-
Statistical significance & other A/B test pitfalls : Design for Conversion on 17 November 09: [...] Read the full post here. [...]
-
Arjan Haring on 17 November 09:
Hi Cennydd, of course… the world is more nuanced than I made it look like. But we have to try to stay as true to what we know to be true as possible. So that’s why I think your part science, part art answers doesn’t help much to develop a more mature design discipline.
I am not saying that behaviourist psychologists are right. I am more generally stating that we have to take a look into what science has to offer to designers. “God” gave us colors so we could differentiate rotten apples from ripe ones.
I would like to suggest that any designer that doesn’t know those kind of basics gets thrown out of the basket, before the whole bunch gets rotten.
-
Claudiu on 17 November 09:
Thanks for the post Cennydd. Just yesterday I was having some people waving around some A/B testing results that just looked too fishy. Putting the test in the Statistical Significance tool provided by you it proved my point: they were fishy. :)
I want to have a better understanding of statistics and how they can be applied to a/b testing and marketing, but kind of don’t know where to start from. Can you point out some resources, please?
-
Harry B on 17 November 09:
Conversion rate uplift percentages can be confusing, for sure. I did a post on this not so long ago (shameless plug) -http://bit.ly/zZUHr
I prefer to refer to Design Research as “detective work” rather than science. Sure, some elements involve scientific methdos, but on their own, scientific methods cannot create good design.
-
Items of interest » Blog Archive » Bookmarks for November 17th from 15:05 to 15:05 on 17 November 09: [...] Statistical significance & other A/B test pitfalls – [...]
-
Jerry Steele on 18 November 09:
Eep…I didn’t know that there were people out there doing A/B tests without statistical significance…
I also report Confidence Intervals, which basically says I’m 95% sure that the conversion percentage is x +/- y. I like confidence intervals because you’re reporting what the data IS, rather than making a statement on what it’s NOT. The Measuring Usability site has a nice CI calculator.
-
Jonathan Boutelle on 18 November 09:
This article sparked a lively conversation within our company(SlideShare)! Thanks for writing it. Now to nitpick.
GWO and other tools have hard-core multivariate statistics baked in. The tools give you a specific confidence estimate, and tell you very clearly when your trial isn’t big enough to yield real data yet. Given that these tools are the easiest way by far to run an A/B test, I doubt that the problem of statistical cluelessness is as bad as you describe. Not because we all understand the statistics, but because in general we’ve outsourced that understanding to tools that do.
-
Labnotes » Vanity: Experiment Driven Development for Rails on 20 November 09: [...] This experiment will complete when a) there are at least 1000 participants for each alternative, that’s a big enough sample size, and b) one alternative stands out with probability of 95% or more. (You want to read more about probabilities and interpreting the results) [...]
-
Hugh Gage on 29 November 09:
Hi Cennydd,
An interesting post. In reference to the percentage confusion that you mention when I’m dealing with this I normally refer to the instance in your first example (9.4%) in terms of percentage points. This is distinct from a simple percentage increase as calculated by the difference between the two.
About the significance, in cases where increasing the sample size may be difficult, what about running the test several times to establish whether a pattern in improved conversion can be observed in favour of one design or the other?
Regards, Hugh
-
Tim Peter on 3 December 09:
Hi Cennydd,
While I agree with your basic premise, I don’t know that I agree with your specific method. The problem I have with PRCOnline’s statistical calculator is that it doesn’t show the confidence level achieved.In some business contexts, the cost of going from a 90% confidence level to a 95% confidence level is higher than the potential return (either in actual cost of sampling a larger population or in opportunity cost from not choosing a new champion). For instance, for observed improvements greater than 20%, a business can call a winner with a 90% confidence level with as few as 100 samples. Yes, there’s a risk that the small sample size might be masking random chance. And, yes, I agree that the particular test you’ve selected would benefit from receiving over 100 conversions before calling it in favor of one challenger. And, I definitely have seen many shoddy A/B comparisons. But, when we advocate higher bars than are really necessary to reduce uncertainty and make a beneficial business decision, I think we risk turning businesses off to valid testing methods altogether.
-
Al on 3 December 09:
An issue I often see ignored in gurus’ stats tools is data contamination during the test period. E.g., say the control has been running for 1 week w/ 10,000 visits, then we kick in test B. After 3 weeks w/ 30,000 additional visits we check the stats. B wins. But wait – what if we check only the last 3 weeks when conditions were *exactly* identical? Looks like the control wins… Basic science – keep all other factors identical during a test.
-
links for 2009-12-04 « Köszönjük, Emese! on 4 December 09: [...] Statistical significance & other A/B test pitfalls It’s logical and laudable that designers should seek data in our quest for verifiability and return on investment. But data must be handled with care, and mathematical rigour isn’t a common part of a designer’s repertoire. (tags: testing a/b statistics analysis) [...]
-
Justin Hunter on 7 December 09:
Cennydd,
Great post. I understand what you’re saying and enthusiastically agree with it.
Having said that, I personally feel the much bigger problem is that way too few people are conducting A/B tests or multi-variate tests. I am a huge believer the value of such tests. (This is no doubt influenced by the fact that I grew up listening to rapturous soliloquies on the value of Design of Experiments at the dinner table from my dad, William G. Hunter, one of the co-authors of “Statistics for Experimenters”).
At any rate, if I had the choice of using a website designed by (a) people who used A/B testing (without any concept of statistical significance) or (b) people who designed the website without any use of A/B testing, I would choose “choice a” in a heartbeat.
Having said that, if I had a chance to speak to the designers running the A/B tests, would I point out the significant issue of statistical significance to ensure they didn’t draw conclusions too quickly from random variation? Sure, I’d probably even point them to this article.
I’ve written a blog post summarizing the best web video presentation I’ve seen on A/B testing and multi-variate testing (given by Kohavi). If anyone is dealing with skeptics at their place of work who do not feel A/B testing is worth pursuing, I’d recommend it as a fantastic source of useful, eye-opening examples. http://hexawise.wordpress.com/2009/08/18/learning-using-controlled-experiments-for-software-solutions/
- Justin
-
Jay Harlow on 6 January 10:
Cennydd, I’ve come across separate posts on your blog now twice today from different sources. Great stuff. Regarding this one…
YES, yes, and yes! I’d also add that another, perhaps even more fundamental, problem with A/B tests is that they presume that there is in fact a single, binary variable to be tested; or that the variable in question is the right one to test.
Here, as you say, is where the designer’s instinct comes into play. Or, perhaps, intuition is a better word — as this is not innate ability, but the thoughtful application of experience. ABtests.com is full of examples of designers asking absolutely the wrong question. Why fret about nuances like the color or position of a button, when typography is so bad, or hierarchy so confusing, that a user might have no idea why to push it?
-
links for 2010-01-07 | Small Farm Design on 7 January 10: [...] Statistical significance & other A/B test pitfalls : Cennydd Bowles on user experience Last week I tossed a coin a hundred times. 49 heads. Then I changed into a red t-shirt and tossed the same coin another hundred times. 51 heads. From this, I conclude that wearing a red shirt gives a 4.1% increase in conversion in throwing heads. (tags: testing usability analytics a/b) [...]
-
Johann on 10 January 10:
Hello,
first of all, I’d have appreciated if you had linked to media.io so people could check out the site instead of two old screenshots.
You are right that the numbers I presented at ABTests.com might not have been statistically relevant.
What is important to me is that I decided to stop the test and implement the changes and that the conversion rate increased (~25 % better if I’m not wrong). And this is what I’m interested in. I don’t care about the theory behind MV tests.
Also, you can search for perfect data as long as you want, but on the web, it’s a futile effort that prevents you from making decisions.
Johann
-
Tim Watson on 12 January 10:
A great opening experiment with shirt colour change. It is so typically of common practice today. I’m with you Cennydd.
Its is far too rare when I talk to people about their test results and read published tests, that significance is taken into account. Results are almost always published with no reference to sample size or level of significance. I have to conclude its because it’s not been considered.
I fear testing is going to get a bad name if common practice doesn’t change, as worse performing versions are picked in some cases.
In the A/B example given above the significance is 80%. Weak but worthy of a further test with larger sample to verify or otherwise.
A/B test also suffer from assuming total independence of the elements under test. If you change a page headline too much and not the supporting text and images then its not a fair test. Multi-variate testing should become more widespread to address this.
Another issue often not considered is fairness of test. Are there environmental factors that means the traffic or data in use will have bias?
-
Jason on 14 January 10:
Speaking about noise in an A|B test, I enjoyed this post and think there is some great information regarding designing and executing tests however if I can be honest, I found myself getting turn-off by the noise of “us vs. them”.
These tests often fail because, be it the designer or the analyst, the groups fail to come together. The designer doesn’t want his trade and experience to be questioned by a numbers geek. The analyst doesn’t want some artist telling him how to analyze user experience.
I’d like to see a post about the flip side that shows examples of how designers going with their “instincts” caused thousands of dollars in lost revenue. If you need examples of this, I have seen it many times, I would be more than happy to share.
-
Madlib-style forms increase conversion by 30%? Well, maybe … on 3 March 10: [...] As you can see, unlike the traditional web signup form, it’s asking you to fill in the blanks of a short paragraph or sentence with the required information. At the time it was released, it was seen more as a novelty and a curiosity than an innovation in web form design, but recently an A/B test of a similar form against a more traditional form by the team at Vast.com, as blogged by Luke Wroblewski, should make you think twice about it. The test apparently showed a 25-40% increase in conversion from the storytelling forms as opposed to the traditional, stacked field style form. You can see an example of the new madlib form here on Vast.com. Blogging about the results, Jeremy Keith says: That seems to be a statistically-significant result, even accounting for Cennydd’s reality-check on A/B testing. [...]
comments
comment