<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Statistical significance &amp; other A/B test pitfalls</title>
	<atom:link href="http://www.cennydd.co.uk/2009/statistical-significance-other-ab-test-pitfalls/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cennydd.co.uk/2009/statistical-significance-other-ab-test-pitfalls/</link>
	<description>Design, technology, doing things differently.</description>
	<lastBuildDate>Fri, 12 Mar 2010 21:00:15 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Madlib-style forms increase conversion by 30%? Well, maybe &#8230;</title>
		<link>http://www.cennydd.co.uk/2009/statistical-significance-other-ab-test-pitfalls/comment-page-1/#comment-13752</link>
		<dc:creator>Madlib-style forms increase conversion by 30%? Well, maybe &#8230;</dc:creator>
		<pubDate>Wed, 03 Mar 2010 05:28:31 +0000</pubDate>
		<guid isPermaLink="false">http://www.cennydd.co.uk/?p=1386#comment-13752</guid>
		<description>[...] As you can see, unlike the traditional web signup form, it&#8217;s asking you to fill in the blanks of a short paragraph or sentence with the required information. At the time it was released, it was seen more as a novelty and a curiosity than an innovation in web form design, but recently an A/B test of a similar form against a more traditional form by the team at Vast.com, as blogged by Luke Wroblewski, should make you think twice about it. The test apparently showed a 25-40% increase in conversion from the storytelling forms as opposed to the traditional, stacked field style form. You can see an example of the new madlib form here on Vast.com. Blogging about the results, Jeremy Keith says: That seems to be a statistically-significant result, even accounting for Cennydd’s reality-check on A/B testing. [...]</description>
		<content:encoded><![CDATA[<p>[...] As you can see, unlike the traditional web signup form, it&#8217;s asking you to fill in the blanks of a short paragraph or sentence with the required information. At the time it was released, it was seen more as a novelty and a curiosity than an innovation in web form design, but recently an A/B test of a similar form against a more traditional form by the team at Vast.com, as blogged by Luke Wroblewski, should make you think twice about it. The test apparently showed a 25-40% increase in conversion from the storytelling forms as opposed to the traditional, stacked field style form. You can see an example of the new madlib form here on Vast.com. Blogging about the results, Jeremy Keith says: That seems to be a statistically-significant result, even accounting for Cennydd’s reality-check on A/B testing. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jason</title>
		<link>http://www.cennydd.co.uk/2009/statistical-significance-other-ab-test-pitfalls/comment-page-1/#comment-8793</link>
		<dc:creator>Jason</dc:creator>
		<pubDate>Thu, 14 Jan 2010 20:34:44 +0000</pubDate>
		<guid isPermaLink="false">http://www.cennydd.co.uk/?p=1386#comment-8793</guid>
		<description>Speaking about noise in an A&#124;B test, I enjoyed this post and think there is some great information regarding designing and executing tests however if I can be honest, I found myself getting turn-off by the noise of &quot;us vs. them&quot;.   

These tests often fail because, be it the designer or the analyst, the groups fail to come together.  The designer doesn&#039;t want his trade and experience to be questioned by a numbers geek.  The analyst doesn&#039;t want some artist telling him how to analyze user experience.  

I&#039;d like to see a post about the flip side that shows examples of how designers going with their &quot;instincts&quot; caused thousands of dollars in lost revenue.  If you need examples of this, I have seen it many times, I would be more than happy to share.</description>
		<content:encoded><![CDATA[<p>Speaking about noise in an A|B test, I enjoyed this post and think there is some great information regarding designing and executing tests however if I can be honest, I found myself getting turn-off by the noise of &#8220;us vs. them&#8221;.   </p>
<p>These tests often fail because, be it the designer or the analyst, the groups fail to come together.  The designer doesn&#8217;t want his trade and experience to be questioned by a numbers geek.  The analyst doesn&#8217;t want some artist telling him how to analyze user experience.  </p>
<p>I&#8217;d like to see a post about the flip side that shows examples of how designers going with their &#8220;instincts&#8221; caused thousands of dollars in lost revenue.  If you need examples of this, I have seen it many times, I would be more than happy to share.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tim Watson</title>
		<link>http://www.cennydd.co.uk/2009/statistical-significance-other-ab-test-pitfalls/comment-page-1/#comment-8550</link>
		<dc:creator>Tim Watson</dc:creator>
		<pubDate>Tue, 12 Jan 2010 08:49:42 +0000</pubDate>
		<guid isPermaLink="false">http://www.cennydd.co.uk/?p=1386#comment-8550</guid>
		<description>A great opening experiment with shirt colour change. It is so typically of common practice today. I&#039;m with you Cennydd.

Its is far too rare when I talk to people about their test results and read published tests, that significance is taken into account. Results are almost always published with no reference to sample size or level of significance. I have to conclude its because it&#039;s not been considered.

I fear testing is going to get a bad name if common practice doesn&#039;t change, as worse performing versions are picked in some cases.

In the A/B example given above the significance is 80%. Weak but worthy of a further test with larger sample to verify or otherwise. 

A/B test also suffer from assuming total independence of the elements under test. If you change a page headline too much and not the supporting text and images then its not a fair test. Multi-variate testing should become more widespread to address this.

Another issue often not considered is fairness of test. Are there environmental factors that means the traffic or data in use will have bias?</description>
		<content:encoded><![CDATA[<p>A great opening experiment with shirt colour change. It is so typically of common practice today. I&#8217;m with you Cennydd.</p>
<p>Its is far too rare when I talk to people about their test results and read published tests, that significance is taken into account. Results are almost always published with no reference to sample size or level of significance. I have to conclude its because it&#8217;s not been considered.</p>
<p>I fear testing is going to get a bad name if common practice doesn&#8217;t change, as worse performing versions are picked in some cases.</p>
<p>In the A/B example given above the significance is 80%. Weak but worthy of a further test with larger sample to verify or otherwise. </p>
<p>A/B test also suffer from assuming total independence of the elements under test. If you change a page headline too much and not the supporting text and images then its not a fair test. Multi-variate testing should become more widespread to address this.</p>
<p>Another issue often not considered is fairness of test. Are there environmental factors that means the traffic or data in use will have bias?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Cennydd Bowles</title>
		<link>http://www.cennydd.co.uk/2009/statistical-significance-other-ab-test-pitfalls/comment-page-1/#comment-8380</link>
		<dc:creator>Cennydd Bowles</dc:creator>
		<pubDate>Sun, 10 Jan 2010 16:36:29 +0000</pubDate>
		<guid isPermaLink="false">http://www.cennydd.co.uk/?p=1386#comment-8380</guid>
		<description>Hi Johann, I linked to the screenshots because they illustrate the blog post, giving both &#039;before&#039; and &#039;after&#039; state. Your current site does not.

There&#039;s no such thing as perfect data. Statistics simply doesn&#039;t bend that way. All I care about is good enough data, and the numbers say that your data doesn&#039;t pass that threshold. If that doesn&#039;t concern you, no problem. It&#039;s your site and your opinion, as this is mine.</description>
		<content:encoded><![CDATA[<p>Hi Johann, I linked to the screenshots because they illustrate the blog post, giving both &#8216;before&#8217; and &#8216;after&#8217; state. Your current site does not.</p>
<p>There&#8217;s no such thing as perfect data. Statistics simply doesn&#8217;t bend that way. All I care about is good enough data, and the numbers say that your data doesn&#8217;t pass that threshold. If that doesn&#8217;t concern you, no problem. It&#8217;s your site and your opinion, as this is mine.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Johann</title>
		<link>http://www.cennydd.co.uk/2009/statistical-significance-other-ab-test-pitfalls/comment-page-1/#comment-8376</link>
		<dc:creator>Johann</dc:creator>
		<pubDate>Sun, 10 Jan 2010 15:29:52 +0000</pubDate>
		<guid isPermaLink="false">http://www.cennydd.co.uk/?p=1386#comment-8376</guid>
		<description>Hello,

first of all, I&#039;d have appreciated if you had linked to &lt;a href=&quot;http://media.io&quot; rel=&quot;nofollow&quot;&gt;media.io&lt;/a&gt; so people could check out the site instead of two old screenshots.

You are right that the numbers I presented at ABTests.com might not have been statistically relevant.

What is important to me is that I decided to stop the test and implement the changes and that the conversion rate increased (~25 % better if I&#039;m not wrong). And this is what I&#039;m interested in. I don&#039;t care about the theory behind MV tests.

Also, you can search for perfect data as long as you want, but on the web, it&#039;s a futile effort that prevents you from making decisions.

Johann</description>
		<content:encoded><![CDATA[<p>Hello,</p>
<p>first of all, I&#8217;d have appreciated if you had linked to <a href="http://media.io" rel="nofollow">media.io</a> so people could check out the site instead of two old screenshots.</p>
<p>You are right that the numbers I presented at ABTests.com might not have been statistically relevant.</p>
<p>What is important to me is that I decided to stop the test and implement the changes and that the conversion rate increased (~25 % better if I&#8217;m not wrong). And this is what I&#8217;m interested in. I don&#8217;t care about the theory behind MV tests.</p>
<p>Also, you can search for perfect data as long as you want, but on the web, it&#8217;s a futile effort that prevents you from making decisions.</p>
<p>Johann</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: links for 2010-01-07 &#124; Small Farm Design</title>
		<link>http://www.cennydd.co.uk/2009/statistical-significance-other-ab-test-pitfalls/comment-page-1/#comment-8055</link>
		<dc:creator>links for 2010-01-07 &#124; Small Farm Design</dc:creator>
		<pubDate>Thu, 07 Jan 2010 18:01:19 +0000</pubDate>
		<guid isPermaLink="false">http://www.cennydd.co.uk/?p=1386#comment-8055</guid>
		<description>[...] Statistical significance &amp; other A/B test pitfalls : Cennydd Bowles on user experience Last week I tossed a coin a hundred times. 49 heads. Then I changed into a red t-shirt and tossed the same coin another hundred times. 51 heads. From this, I conclude that wearing a red shirt gives a 4.1% increase in conversion in throwing heads. (tags: testing usability analytics a/b) [...]</description>
		<content:encoded><![CDATA[<p>[...] Statistical significance &amp; other A/B test pitfalls : Cennydd Bowles on user experience Last week I tossed a coin a hundred times. 49 heads. Then I changed into a red t-shirt and tossed the same coin another hundred times. 51 heads. From this, I conclude that wearing a red shirt gives a 4.1% increase in conversion in throwing heads. (tags: testing usability analytics a/b) [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jay Harlow</title>
		<link>http://www.cennydd.co.uk/2009/statistical-significance-other-ab-test-pitfalls/comment-page-1/#comment-7967</link>
		<dc:creator>Jay Harlow</dc:creator>
		<pubDate>Wed, 06 Jan 2010 21:42:03 +0000</pubDate>
		<guid isPermaLink="false">http://www.cennydd.co.uk/?p=1386#comment-7967</guid>
		<description>Cennydd, I&#039;ve come across separate posts on your blog now twice today from different sources. Great stuff. Regarding this one...

YES, yes, and yes! I&#039;d also add that another, perhaps even more fundamental, problem with A/B tests is that they presume that there is in fact a single, binary variable to be tested; or that the variable in question is the right one to test.

Here, as you say, is where the designer&#039;s instinct comes into play. Or, perhaps, intuition is a better word -- as this is not innate ability, but the thoughtful application of experience. ABtests.com is full of examples of designers asking absolutely the wrong question. Why fret about nuances like the color or position of a button, when typography is so bad, or hierarchy so confusing, that a user might have no idea why to push it?</description>
		<content:encoded><![CDATA[<p>Cennydd, I&#8217;ve come across separate posts on your blog now twice today from different sources. Great stuff. Regarding this one&#8230;</p>
<p>YES, yes, and yes! I&#8217;d also add that another, perhaps even more fundamental, problem with A/B tests is that they presume that there is in fact a single, binary variable to be tested; or that the variable in question is the right one to test.</p>
<p>Here, as you say, is where the designer&#8217;s instinct comes into play. Or, perhaps, intuition is a better word &#8212; as this is not innate ability, but the thoughtful application of experience. ABtests.com is full of examples of designers asking absolutely the wrong question. Why fret about nuances like the color or position of a button, when typography is so bad, or hierarchy so confusing, that a user might have no idea why to push it?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Justin Hunter</title>
		<link>http://www.cennydd.co.uk/2009/statistical-significance-other-ab-test-pitfalls/comment-page-1/#comment-5642</link>
		<dc:creator>Justin Hunter</dc:creator>
		<pubDate>Mon, 07 Dec 2009 01:54:25 +0000</pubDate>
		<guid isPermaLink="false">http://www.cennydd.co.uk/?p=1386#comment-5642</guid>
		<description>Cennydd, 

Great post.  I understand what you&#039;re saying and enthusiastically agree with it. 

Having said that, I personally feel the much bigger problem is that way too few people are conducting A/B tests or multi-variate tests.  I am a huge believer the value of such tests.  (This is no doubt influenced by the fact that I grew up listening to rapturous soliloquies on the value of Design of Experiments at the dinner table from my dad, William G. Hunter, one of the co-authors of &quot;Statistics for Experimenters&quot;).  

At any rate, if I had the choice of using a website designed by (a) people who used A/B testing (without any concept of statistical significance) or (b) people who designed the website without any use of A/B testing, I would choose &quot;choice a&quot; in a heartbeat.

Having said that, if I had a chance to speak to the designers running the A/B tests, would I point out the significant issue of statistical significance to ensure they didn&#039;t draw conclusions too quickly from random variation?  Sure, I&#039;d probably even point them to this article.  

I&#039;ve written a blog post summarizing the best web video presentation I&#039;ve seen on A/B testing and multi-variate testing (given by Kohavi).  If anyone is dealing with skeptics at their place of work who do not feel A/B testing is worth pursuing, I&#039;d recommend it as a fantastic source of useful, eye-opening examples.  http://hexawise.wordpress.com/2009/08/18/learning-using-controlled-experiments-for-software-solutions/

- Justin</description>
		<content:encoded><![CDATA[<p>Cennydd, </p>
<p>Great post.  I understand what you&#8217;re saying and enthusiastically agree with it. </p>
<p>Having said that, I personally feel the much bigger problem is that way too few people are conducting A/B tests or multi-variate tests.  I am a huge believer the value of such tests.  (This is no doubt influenced by the fact that I grew up listening to rapturous soliloquies on the value of Design of Experiments at the dinner table from my dad, William G. Hunter, one of the co-authors of &#8220;Statistics for Experimenters&#8221;).  </p>
<p>At any rate, if I had the choice of using a website designed by (a) people who used A/B testing (without any concept of statistical significance) or (b) people who designed the website without any use of A/B testing, I would choose &#8220;choice a&#8221; in a heartbeat.</p>
<p>Having said that, if I had a chance to speak to the designers running the A/B tests, would I point out the significant issue of statistical significance to ensure they didn&#8217;t draw conclusions too quickly from random variation?  Sure, I&#8217;d probably even point them to this article.  </p>
<p>I&#8217;ve written a blog post summarizing the best web video presentation I&#8217;ve seen on A/B testing and multi-variate testing (given by Kohavi).  If anyone is dealing with skeptics at their place of work who do not feel A/B testing is worth pursuing, I&#8217;d recommend it as a fantastic source of useful, eye-opening examples.  <a href="http://hexawise.wordpress.com/2009/08/18/learning-using-controlled-experiments-for-software-solutions/" rel="nofollow">http://hexawise.wordpress.com/2009/08/18/learning-using-controlled-experiments-for-software-solutions/</a></p>
<p>- Justin</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: links for 2009-12-04 &#171; Köszönjük, Emese!</title>
		<link>http://www.cennydd.co.uk/2009/statistical-significance-other-ab-test-pitfalls/comment-page-1/#comment-5437</link>
		<dc:creator>links for 2009-12-04 &#171; Köszönjük, Emese!</dc:creator>
		<pubDate>Fri, 04 Dec 2009 11:53:28 +0000</pubDate>
		<guid isPermaLink="false">http://www.cennydd.co.uk/?p=1386#comment-5437</guid>
		<description>[...] Statistical significance &amp; other A/B test pitfalls It’s logical and laudable that designers should seek data in our quest for verifiability and return on investment. But data must be handled with care, and mathematical rigour isn’t a common part of a designer’s repertoire. (tags: testing a/b statistics analysis) [...]</description>
		<content:encoded><![CDATA[<p>[...] Statistical significance &amp; other A/B test pitfalls It’s logical and laudable that designers should seek data in our quest for verifiability and return on investment. But data must be handled with care, and mathematical rigour isn’t a common part of a designer’s repertoire. (tags: testing a/b statistics analysis) [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Al</title>
		<link>http://www.cennydd.co.uk/2009/statistical-significance-other-ab-test-pitfalls/comment-page-1/#comment-5397</link>
		<dc:creator>Al</dc:creator>
		<pubDate>Thu, 03 Dec 2009 19:28:31 +0000</pubDate>
		<guid isPermaLink="false">http://www.cennydd.co.uk/?p=1386#comment-5397</guid>
		<description>An issue I often see ignored in gurus&#039; stats tools is data contamination during the test period. E.g., say the control has been running for 1 week w/ 10,000 visits, then we kick in test B. After 3 weeks w/ 30,000 additional visits we check the stats. B wins. But wait - what if we check only the last 3 weeks when conditions were *exactly* identical? Looks like the control wins... Basic science - keep all other factors identical during a test.</description>
		<content:encoded><![CDATA[<p>An issue I often see ignored in gurus&#8217; stats tools is data contamination during the test period. E.g., say the control has been running for 1 week w/ 10,000 visits, then we kick in test B. After 3 weeks w/ 30,000 additional visits we check the stats. B wins. But wait &#8211; what if we check only the last 3 weeks when conditions were *exactly* identical? Looks like the control wins&#8230; Basic science &#8211; keep all other factors identical during a test.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
