Understanding Your A/B Testing Data And When To Declare A Winner

Hey there! Justin Christianson here, the co-founder and president of Conversion Fanatics. And today, I want to talk about how to analyze your split-test data and when to declare a test a winner.

One of the biggest mistakes people make is not gathering a big enough sample size. Maybe you have a test that’s winning by 20 or 30% or more early on and you brand it for a few days. Maybe you’ve got forty or so conversions on that test and you immediately think you’ve got a winner. It’s showing improvement. Now, let’s make that the new control.

Well, the problem is you don’t have a big enough sample size. To get statistically significant or statistically confident results, you need a much bigger sample size.

For example, we had a test that was winning by 20 or 30% recently (I can’t recall the exact figure). But as we gathered more data and let the test run longer, the test trended downward and ended up losing by a handful of percentage points (around five or six).

So if we would have assumed that test was a winner, we would have led ourselves down the wrong path and actually hurt our conversions long-term.

What we generally recommend is not looking at the data until you have at least fifty conversions per variation and at least a calendar week in there because weekend traffic converts differently than weekday traffic. So you need to have a big enough sample size and then try running the test to statistical significance (at least 95%).

Some of the testing software out there is pretty liberal in its calculations of statistically confident results. Case in point, we had a test recently that showed a 90 plus percent improvement and 99 percent confidence in around three days.

Well, that raised some red flags for us because three days isn’t a big enough sample size and we had only about 50 conversions on the winning variation. Yes, it was a massive improvement. But when we let the test run longer, we didn’t show a 90% improvement. We showed only about a 30% improvement. Then, the data leveled off. We let it run for about ten days instead of three and ended up with a 30% improvement.

Yes, it was still an improvement, but if we would have called it at three days, we wouldn’t have had accurate data to go on to inform our future tests.

So what we generally look at when we A/B test is statistical confidence. But if it’s on a short timeframe or the sample size is too small, we’ll let the test run longer.

We look for trends in the data. If something stays consistently up for the course of an experiment, chances are it’s a winner. But if it’s up and down or variable, then you might need to get a bigger sample size. Or, if it shows dead even or it fluctuates by only a percentage point or two, then you might want to cut that test and move on because reaching significance and getting statistically sound data is going to be difficult.

As a side note, when we’re split-testing, we always pass that additional data to analytics so we can get a second set of eyes on it. Then, we run our own calculations for statistical confidence.

You can do a search for a statistical significance calculator online. I believe VWO and convert and Optimizely all have their own calculators you can use.

Personally, we use a spreadsheet created by an Oxford mathematician to make sure we have all the correct data for our clients.

The biggest thing is making sure you get a big enough sample size on your data. Be patient with the results. Split-testing is a long game, and it isn’t going to happen miraculously overnight. But if you do it effectively and you follow the data, you’ll get exponential growth.

That’s all I have to share with you today. If you enjoyed this content or know any other people who might benefit from it, be sure to like us and share.

We’ll talk to you again soon. Thanks.