Statistical significance & other A/B test pitfalls
16 November 09

Last week I tossed a coin a hundred times. 49 heads. Then I changed into a red t-shirt and tossed the same coin another hundred times. 51 heads. From this, I conclude that wearing a red shirt gives a 4.1% increase in conversion in throwing heads.
A ridiculous experiment (yes, I really did it) with a ridiculous conclusion, yet I sometimes see similarly unreliable analysis in A/B testing.
It’s logical and laudable that designers should seek data in our quest for verifiability and return on investment. But data must be handled with care, and mathematical rigour isn’t a common part of a designer’s repertoire.
Here’s an example from ABTests.com, a worthwhile project that I feel slightly bad to pick on.

The two versions are subtly different:
- Version A: Upload button bold, Convert button bold, Convert button has a right arrow
- Version B: All buttons regular weight, no right arrow on Convert button
Although minor changes can cause major surprises, I wouldn’t expect these small differences to improve the form’s usability. With the caveat that I don’t know the users or product, I’d even speculate that Version B could perform worse since it reduces the priority of the calls to action and removes the signifier of progression.
The designer claims that version B showed a 30.4% conversion improvement in an A/B test. Here’s why this isn’t quite accurate.
The role of chance
Any A/B test is a trial, so called because we’re observing evidence gained by trying something out. I can never truly know that there’s a 50% chance of a coin landing as a head or a tail – I can only run trials and observe the evidence. Similarly, we can never truly know that a design leads to higher conversion – we can only run trials and observe the evidence. If that empirical evidence is strong enough, we conclude that the design is an improvement. If not, we don’t.
To be valid, trials need to be sufficiently large. By tossing my coin 100 or 1000 times I reduce the influence of chance, but even then I’ll still get slightly different results with each trial. Similarly, a design may have 27.5% conversion on Monday, 31.3% on Tuesday and 26.0% on Wednesday. This random variation should always be the first cause considered of any change in observed results.
The null hypothesis
Statisticians use something called a null hypothesis to account for this possibility. The null hypothesis for the A/B test above might be something like this:
The difference in conversion between Version A and Version B is caused by random variation.
It’s then the job of the trial to disprove the null hypothesis. If it does, we can adopt the alternative explanation:
The difference in conversion between Version A and Version B is caused by the design differences between the two.
To determine whether we can reject the null hypothesis, we use certain mathematical equations to calculate the likelihood that the observed variation could be caused by chance. These equations are beyond the scope of this post but include Student’s t test, χ-squared and ANOVA (Wikipedia links given for the eager). Here’s a site that does the calculations for you, assuming a standard A/B conversion test with a clear Yes or No outcome.
Statistical significance
If the arithmetic shows that the likelihood of the result being random is very small (usually below 5%), we reject the null hypothesis. In effect we’re saying “it’s very unlikely that this result is down to chance. Instead, it’s probably caused by the change we introduced” – in which case we say the results are statistically significant. Note that we still can’t guarantee that this is the right interpretation – significance is about proof only beyond reasonable doubt.
Running the calculations on the above data shows that the results aren’t statistically significant: the evidence isn’t strong enough to reject the null hypothesis that the difference in conversion is simply down to luck. The main problem is the small sample size (128 and 108 users respectively), so I would advise the designer, Johann, to repeat the test with more users. Assuming the observed conversions seen didn’t change (a big assumption) a sample size of approximately 200 users per variant should be sufficient for significance. He could then either reject the null hypothesis or the results would remain inconclusive, in which case there’s no evidence the design has made a difference. In Johann’s defence, he recently posted that he takes the point about significance, and I’m looking forward to seeing more conclusive data for this intriguing test.
Percentage confusion
Significance isn’t the only slippery problem A/B tests face. For starters, quoting conversion improvements is always fraught with difficulty. Since conversion is usually measured in percentages (in this example, 31.3% and 40.7%) there are two ways to quote improvements. We can say that conversions increased by:
- 9.4% – the difference between the two
- 30.4% – the amount that 40.7% is bigger than 31.3%*
Any percentage improvement quoted in isolation should be challenged: which of these two calculations has been used? It’s dangerously easy to assume the wrong figure without sufficient context.
The A/B death spiral
A/B tests also suffer from a common quantitative problem, in that they tell us what but not why. I’ve written about this previously in What if the design gods forsake us. It’s wise to back up numerical tests with qualitative evaluation (eg. a guerrilla usability test) so we can make informed decisions if data suggests we need to rethink a design.
Even with backup, sometimes A/B tests are simply the wrong tool for the job. They can provide powerful insight in some cases, but in the wrong place they can be a blind alley or, worse, a weapon of disempowerment. Logical positivism and design don’t mix – not everything we do can be empirically verified – yet some businesses fall back on A/B testing in lieu of genuine design thinking. I call this the “A/B death spiral”, and it plays out something like this:
Designer: Here’s a new design for this screen. You’ll see it has a new navigation style, tweaked colour palette and I’ve moved the main interactions to a tabbed area.
Product owner: Wow, those are pretty big changes for such a high-risk screen. I tell you what: let’s test them individually to see which of these changes works and which doesn’t…
As the proverb suggests, sometimes you can’t jump a twenty foot chasm in two ten foot leaps. Cherry-picking only those design elements that are “proven” by an A/B test can be a route to fragmented, incoherent design. It may earn marginally more money in the short term, but it becomes hard to avoid a descent into poor UX and the long-term harm this causes.
Being faithful to data
Given the potential hazards, I’m concerned about the naïveté with which some designers approach quantitative testing. The world of statistics rewards an honest search for the truth, not dilettantism, and I’d advise any designer moving in statistical circles to pick up some basic stats theory, or at least partner with someone knowledgeable.
A flawed A/B test, be it statistically insignificant, misapplied or misquoted, is nothing more than anecdotal evidence. It’s the same crime as making a website red on the feedback of one user. Yet an impatient designer, seeing the example I quoted above, could quickly jump to a false conclusion: “I should remove arrows from continue buttons: it’s 30.4% better.” Perhaps this designer deserves what he gets. It’s likely he’s only really interested in shortcuts to good UX, and linkbait lists of “Twelve ways to make your site more usable.” Since he understands neither the mathematics nor the context of this trial (timescales, userbase, surrounding task) he will inevitably grab the wrong end of the stick. Nonetheless, he is out there.
Don’t let yourself be that designer.
Photo: snellgrove
* subject to rounding.
59 comments on Statistical significance & other A/B test pitfalls
good stuff. A/B testing is just one of many tools and should only ever be an indicator – and if you want to hold great sway by it yes get the help of a decent marketeer who has some idea about statistics. too many think they will make some fabulous $1,000,000 tweak….
Great post Cennydd, this is exactly the stuff I want to discuss within the “conversion family” of DfC. I am very glad I have asked you to become a Team Captain for our conference in NYC in 2010.
1. I completely agree with you when you say there are quite some naive designers that know little about statistics, I would also want to point out that there are even a lot of naive web metrics “professionals” out there.
2. I can not agree with you that a savvy designer has to follow his instincts (quoting your other post)… what the f***? First you sound like a higher level skeptic, keeping true to their standards and than you bluntly switch to becoming someone that thinks homeopathic remedies work.
We have a saying in Holland “Zachte heelmeesters maken stinkende wonden” what translates in something like “desperate diseases require desperate remedies”. Which honourable scientific theory confirms the fact that a designer instinct is more important than proof?
I would say, let’s try a little harder and see if we can make evidence-based design work.
But again, thanks for this very insightful post.
Ps. can we use this post on our conference blog to stimulate a conversation?
Hi Arjan – please do post the link on the blog; be great to stimulate some further debate.
As for instinctual design, I think perhaps you’ve read my other post differently to how I intended. There are some things that simply aren’t measurable. Can I ever know that a particular colour scheme increases a user’s feeling of excitement about a forthcoming product? Is it possible to prove that a particular grouping reinforces the understanding that Service X is aimed at a particular market? I’d say no. I don’t agree with the behaviourist psychologists who say we are simple machines of stimulus and response; and therefore I believe there will always be (dare I say) ineffable aspects to design.
UX is a fascinating mix of science and art, where instinct and evidence are inextricably linked. Neither is more important. I think any attempt to label design to either a ‘scientific’ or ‘creative’ pursuit is to miss the richness the other approach can bring.
Another great resource for measuring usability, complete with calculators: http://www.measuringusability.com
The site was created and developed by Jeff Sauro, who’s PhD is from Stanford in Statistics, but works at Oracle in their usability department. He’s done a lot of work on creating reliable, valid usability calculators for small sample sizes.
He tries very hard to make these concepts easy to understand is bring good statistical methods to the usability community.
I’m not his PR guy, just a fan, and wanted to pass the good word.
[...] Read the full post here. [...]
Hi Cennydd, of course… the world is more nuanced than I made it look like. But we have to try to stay as true to what we know to be true as possible. So that’s why I think your part science, part art answers doesn’t help much to develop a more mature design discipline.
I am not saying that behaviourist psychologists are right. I am more generally stating that we have to take a look into what science has to offer to designers. “God” gave us colors so we could differentiate rotten apples from ripe ones.
I would like to suggest that any designer that doesn’t know those kind of basics gets thrown out of the basket, before the whole bunch gets rotten.
Thanks for the post Cennydd. Just yesterday I was having some people waving around some A/B testing results that just looked too fishy. Putting the test in the Statistical Significance tool provided by you it proved my point: they were fishy. :)
I want to have a better understanding of statistics and how they can be applied to a/b testing and marketing, but kind of don’t know where to start from. Can you point out some resources, please?
Conversion rate uplift percentages can be confusing, for sure. I did a post on this not so long ago (shameless plug) -http://bit.ly/zZUHr
I prefer to refer to Design Research as “detective work” rather than science. Sure, some elements involve scientific methdos, but on their own, scientific methods cannot create good design.
[...] Statistical significance & other A/B test pitfalls – [...]
Eep…I didn’t know that there were people out there doing A/B tests without statistical significance…
I also report Confidence Intervals, which basically says I’m 95% sure that the conversion percentage is x +/- y. I like confidence intervals because you’re reporting what the data IS, rather than making a statement on what it’s NOT. The Measuring Usability site has a nice CI calculator.
This article sparked a lively conversation within our company(SlideShare)! Thanks for writing it. Now to nitpick.
GWO and other tools have hard-core multivariate statistics baked in. The tools give you a specific confidence estimate, and tell you very clearly when your trial isn’t big enough to yield real data yet. Given that these tools are the easiest way by far to run an A/B test, I doubt that the problem of statistical cluelessness is as bad as you describe. Not because we all understand the statistics, but because in general we’ve outsourced that understanding to tools that do.
[...] This experiment will complete when a) there are at least 1000 participants for each alternative, that’s a big enough sample size, and b) one alternative stands out with probability of 95% or more. (You want to read more about probabilities and interpreting the results) [...]
Hi Cennydd,
An interesting post. In reference to the percentage confusion that you mention when I’m dealing with this I normally refer to the instance in your first example (9.4%) in terms of percentage points. This is distinct from a simple percentage increase as calculated by the difference between the two.
About the significance, in cases where increasing the sample size may be difficult, what about running the test several times to establish whether a pattern in improved conversion can be observed in favour of one design or the other?
Regards, Hugh
Hugh – yes, rerunning the test can be an easy way to increase the sample size. The only caveat is that you need to control other variables, i.e. ensure that the context is as close as possible to the earlier tests. (Similar people, same task, no huge changes to the environment eg different browser).
Claudiu – I’m sure there are plenty of basic statistics primers out there on the web, but I don’t have any links to hand myself. Should be easy enough to Google. Most of my statistical knowledge comes from A-levels, aged 18. A good teacher is a big help.
Jonathan – it’s great that GWO has significance built in. I’d expect nothing less from a company with such a mathematical mindset :) I still think there’s a danger in people blindly trusting the analysis without at least a basic appreciation of the concept of sampling, but from what you say I needn’t worry too much about flagrant misuse of GWO data. Sadly, I have seen plenty of incorrect straight A/B comparisons, and continue to do so.
Hi Cennydd,
While I agree with your basic premise, I don’t know that I agree with your specific method. The problem I have with PRCOnline’s statistical calculator is that it doesn’t show the confidence level achieved.
In some business contexts, the cost of going from a 90% confidence level to a 95% confidence level is higher than the potential return (either in actual cost of sampling a larger population or in opportunity cost from not choosing a new champion). For instance, for observed improvements greater than 20%, a business can call a winner with a 90% confidence level with as few as 100 samples. Yes, there’s a risk that the small sample size might be masking random chance. And, yes, I agree that the particular test you’ve selected would benefit from receiving over 100 conversions before calling it in favor of one challenger. And, I definitely have seen many shoddy A/B comparisons. But, when we advocate higher bars than are really necessary to reduce uncertainty and make a beneficial business decision, I think we risk turning businesses off to valid testing methods altogether.
An issue I often see ignored in gurus’ stats tools is data contamination during the test period. E.g., say the control has been running for 1 week w/ 10,000 visits, then we kick in test B. After 3 weeks w/ 30,000 additional visits we check the stats. B wins. But wait – what if we check only the last 3 weeks when conditions were *exactly* identical? Looks like the control wins… Basic science – keep all other factors identical during a test.
[...] Statistical significance & other A/B test pitfalls It’s logical and laudable that designers should seek data in our quest for verifiability and return on investment. But data must be handled with care, and mathematical rigour isn’t a common part of a designer’s repertoire. (tags: testing a/b statistics analysis) [...]
Cennydd,
Great post. I understand what you’re saying and enthusiastically agree with it.
Having said that, I personally feel the much bigger problem is that way too few people are conducting A/B tests or multi-variate tests. I am a huge believer the value of such tests. (This is no doubt influenced by the fact that I grew up listening to rapturous soliloquies on the value of Design of Experiments at the dinner table from my dad, William G. Hunter, one of the co-authors of “Statistics for Experimenters”).
At any rate, if I had the choice of using a website designed by (a) people who used A/B testing (without any concept of statistical significance) or (b) people who designed the website without any use of A/B testing, I would choose “choice a” in a heartbeat.
Having said that, if I had a chance to speak to the designers running the A/B tests, would I point out the significant issue of statistical significance to ensure they didn’t draw conclusions too quickly from random variation? Sure, I’d probably even point them to this article.
I’ve written a blog post summarizing the best web video presentation I’ve seen on A/B testing and multi-variate testing (given by Kohavi). If anyone is dealing with skeptics at their place of work who do not feel A/B testing is worth pursuing, I’d recommend it as a fantastic source of useful, eye-opening examples. http://hexawise.wordpress.com/2009/08/18/learning-using-controlled-experiments-for-software-solutions/
- Justin
Cennydd, I’ve come across separate posts on your blog now twice today from different sources. Great stuff. Regarding this one…
YES, yes, and yes! I’d also add that another, perhaps even more fundamental, problem with A/B tests is that they presume that there is in fact a single, binary variable to be tested; or that the variable in question is the right one to test.
Here, as you say, is where the designer’s instinct comes into play. Or, perhaps, intuition is a better word — as this is not innate ability, but the thoughtful application of experience. ABtests.com is full of examples of designers asking absolutely the wrong question. Why fret about nuances like the color or position of a button, when typography is so bad, or hierarchy so confusing, that a user might have no idea why to push it?
[...] Statistical significance & other A/B test pitfalls : Cennydd Bowles on user experience Last week I tossed a coin a hundred times. 49 heads. Then I changed into a red t-shirt and tossed the same coin another hundred times. 51 heads. From this, I conclude that wearing a red shirt gives a 4.1% increase in conversion in throwing heads. (tags: testing usability analytics a/b) [...]
Hello,
first of all, I’d have appreciated if you had linked to media.io so people could check out the site instead of two old screenshots.
You are right that the numbers I presented at ABTests.com might not have been statistically relevant.
What is important to me is that I decided to stop the test and implement the changes and that the conversion rate increased (~25 % better if I’m not wrong). And this is what I’m interested in. I don’t care about the theory behind MV tests.
Also, you can search for perfect data as long as you want, but on the web, it’s a futile effort that prevents you from making decisions.
Johann
Hi Johann, I linked to the screenshots because they illustrate the blog post, giving both ‘before’ and ‘after’ state. Your current site does not.
There’s no such thing as perfect data. Statistics simply doesn’t bend that way. All I care about is good enough data, and the numbers say that your data doesn’t pass that threshold. If that doesn’t concern you, no problem. It’s your site and your opinion, as this is mine.
A great opening experiment with shirt colour change. It is so typically of common practice today. I’m with you Cennydd.
Its is far too rare when I talk to people about their test results and read published tests, that significance is taken into account. Results are almost always published with no reference to sample size or level of significance. I have to conclude its because it’s not been considered.
I fear testing is going to get a bad name if common practice doesn’t change, as worse performing versions are picked in some cases.
In the A/B example given above the significance is 80%. Weak but worthy of a further test with larger sample to verify or otherwise.
A/B test also suffer from assuming total independence of the elements under test. If you change a page headline too much and not the supporting text and images then its not a fair test. Multi-variate testing should become more widespread to address this.
Another issue often not considered is fairness of test. Are there environmental factors that means the traffic or data in use will have bias?
Speaking about noise in an A|B test, I enjoyed this post and think there is some great information regarding designing and executing tests however if I can be honest, I found myself getting turn-off by the noise of “us vs. them”.
These tests often fail because, be it the designer or the analyst, the groups fail to come together. The designer doesn’t want his trade and experience to be questioned by a numbers geek. The analyst doesn’t want some artist telling him how to analyze user experience.
I’d like to see a post about the flip side that shows examples of how designers going with their “instincts” caused thousands of dollars in lost revenue. If you need examples of this, I have seen it many times, I would be more than happy to share.
[...] As you can see, unlike the traditional web signup form, it’s asking you to fill in the blanks of a short paragraph or sentence with the required information. At the time it was released, it was seen more as a novelty and a curiosity than an innovation in web form design, but recently an A/B test of a similar form against a more traditional form by the team at Vast.com, as blogged by Luke Wroblewski, should make you think twice about it. The test apparently showed a 25-40% increase in conversion from the storytelling forms as opposed to the traditional, stacked field style form. You can see an example of the new madlib form here on Vast.com. Blogging about the results, Jeremy Keith says: That seems to be a statistically-significant result, even accounting for Cennydd’s reality-check on A/B testing. [...]
[...] Blogging about the results, Jeremy Keith says: That seems to be a statistically-significant result, even accounting for Cennydd’s reality-check on A/B testing. [...]
Excellent post! Your format of multiple concise topics made it very easy to read. My favorite line was your warning about cherry picking: “Sometimes you can’t jump a twenty foot chasm in two ten foot leaps.” The trick is finding the sweet spot in between cherry picking and huge sweeping changes.
[...] [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
[...] if you’re really into the math, Cennyd Bowles takes a deep look at statistical significance and other testing pitfalls. Heavy, but surprisingly readable. I’d also call attention to my comment on the post for a [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
Great post. I recommended it to my workgroup as a “best primer” on A/B testing. This post is the best “one stop shopping” resource for understanding A/B data pitfalls I have been able to find. Thanks!
[...] Statistical Significance and Other A/B Test Pitfalls 此条目发表在 交互设计 分类目录。将固定链接加入收藏夹。 ← 界面设计与体验设计 [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
[...] ●Statistical Significance and Other A/B Test Pitfalls [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
[...] aannames worden gedaan die mogelijk onjuist zijn. Ik wil het niet eens hebben over significantie (zoals vaak besproken door anderen) want ik ga er vanuit dat iedereen hier daar wel raad mee weet: Je moet wel genoeg observaties [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
Those having a tough time with A/B tests might also like to have a look at the phenomenon of “repeated significance testing errors” – something that can be even more problematic than the issue of statistical significance outlined in Mr Bowles’s post:
http://www.evanmiller.org/how-not-to-run-an-ab-test.html
“If instead of deciding ahead of time, ‘this experiment will collect exactly 1,000 observations,’ you say, ‘we’ll run it until we see a significant difference,’ all the reported significance levels become meaningless.”
With millions of dollars now pouring in to multivariate testing (eg Sitepect et. al.), repeated significance testing errors raise the stakes for test-driven design rather high.
[...] Statistical Significance and Other A/B Test Pitfalls [...]
[...] The best way to produce significant results is by using a large sample size. I recommend reading Cennydd Bowles article on AB testing for more [...]
[...] (I don’t know where the 65th one went.) This leads me into the biggest A/B testing trap, statistical significance. Crazy things can happen by chance, so it’s important to be rigorous about making sure your [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
[...] Statistical Significance and Other A/B Test Pitfalls [...]
[...] Statistical significance & other A/B test pitfalls [...]
Add a comment