Why Many A/B Tests Aren’t Worth It

Note: This post originally appeared as a guest feature on TechCruch to announce our new Test Significance website.

At RJMetrics, we believe in data-driven decisions and that means we do a lot of testing.  However, one of the most important lessons we’ve learned is this: not all tests are worth running.

In a data-driven organization, it’s very tempting to say things like “let’s settle this argument about changing the button font with an A/B test!”  Yes, you certainly could do that.  And you would likely (eventually) declare a winner.  However, you will also have squandered precious resources in search of the answer to a bike shed question.  Testing is good, but not all tests are.  Conserve your resources.  Stop running stupid tests.

The reason for this comes from how statistical confidence is calculated.  The formulas that govern confidence in hypothesis testing reveal an important truth:

Tests where a larger change is observed require a smaller sample size to reach statistical significance.

(If you’d like to dig into why this is the case, a good place to start is Wikipedia’s articles on hypothesis testing and the binomial distribution.)

In other words, the bigger the impact of your change, the sooner you can be confident that the change is not just statistical noise.  This is intuitive but often ignored.  And the implications for early-stage companies are tremendous.

If your site has millions of visitors per month, this isn’t a big deal.  You have enough traffic to hyper-optimize and test hundreds of small changes per month.  But what if, like most start-ups, you only have a few thousand visitors per month?  In these cases, testing small changes can invoke a form of analysis paralysis that prevents you from acting quickly.

Consider a site that has 10,000 visitors per month and has a 5.0% conversion rate.  The table below shows how long it will take to run a “conclusive” test (95% confidence) based on how much the change impacts conversion rate.

Starting
Conversion
Rate
New
Conversion
Rate
Total
Participants
Required
Test
Duration
Required
5.00% 5.20% 185,926 1.5 years
5.00% 5.40% 47,340 5 months
5.00% 5.60% 21,420 2 months
5.00% 5.80% 12,262 37 days
5.00% 6.00% 7,982 24 days
5.00% 6.20% 5,638 17 days
5.00% 6.40% 4,210 12 days
5.00% 6.60% 3,276 10 days
5.00% 6.80% 2,630 8 days
5.00% 6.90% 2,378 7 days
5.00% 7.00% 2,162 6.5 days
5.00% 7.20% 1,814 5.4 days
5.00% 7.40% 1,548 4.6 days
5.00% 7.60% 1,338 4.0 days
5.00% 7.80% 1,170 3.5 days
5.00% 8.00% 1,034 3.1 days

(Data assumes a Bernoulli Trial experiment with a two-tailed hypothesis test and all traffic being split 50/50 between the test groups.)

As you can see, your visitors are precious assets.  Too many start-ups will run that “button font” test, expecting full well that in a best-case scenario it will only impact conversion by a quarter of a percent.  What they don’t appreciate up-front is that this may block their ability to run certain other tests for a year and a half (assuming they don’t end the test prematurely).

When you can’t run many tests, you should test big bets.  A homepage redesign.  A pricing change.  A new “company voice” throughout your copy.  Not only will these tests potentially have a bigger impact, you’ll have confidence sooner if they do.

I found myself making this argument a lot recently here at RJMetrics, so I developed a tool to calculate the required population size for a significant test.  We’ve shared that tool with the world free of charge at Test Significance.  Just input your current conversion rate and your desired confidence interval and it will generate a table like the one above.

TestSigScreenshot-Better

We hope this tool helps a few companies out there learn the lessons we have about when to test and what to expect in terms of finding a conclusive result.

 

  • http://www.firebrandtraining.co.uk/ marcus austin

    Nice tool but you may want to change the page description tag on the testsignificance.com home page. It currently reads “A free tool for translating MySQL queries into MongoDB. Helpful for SQL users who want to learn about MongoDB by building on their existing knowledge.” And it’s none of these things ;-)
    marcus

  • vinay

    The participants size above looks a way different then the example sighted by google researchers at https://support.google.com/analytics/answer/2844870?hl=en

    Briefly, for a conversion improvement from 4% to 5% the google link says it requires 22,330 observations, where as testsignificance.com says only 6,600 (both are for 95% confidence level). Am I missing something?

    Thanks,
    Vinay

    • Robert J. Moore

      I’m not totally sure where Google’s 22,330 number is coming from. Our data assumes a Bernoulli Trial experiment with a two-tailed hypothesis test and all traffic being split 50/50 between the test groups. I just ran these numbers again and our 6,600 number is definitely valid for those parameters.

      My only thought on how Google’s number might be so much higher is what’s actually being tested. Note that in most A/B tests, what you are testing is not that B is 1% better than A, but simply that B is better than A in any way. The observed difference simply happens to be 1%. It’s possible that Google is testing not just that B is superior to A, but that it’s superior by at least a threshold of 1%. This would certainly increase the number of required data points.