AB Test Overview

An non-exhaustive checklist for running an Online Controlled Experiment

Jak Marshall

Sep 05, 2022

You’ve been thinking about running an AB Test for your game.

What do you need?

Let’s run through a high-level non-exhaustive list at break-neck speed, shall we?

You need:

A live or soft-launched game with players ready to suffer experimentation.
- I know this seems obvious but I have heard talk about “doing AB Tests” without involving real players.
- Get a game out, people!
An AB Testing platform that can robustly handle the testing you want to do.
- This is a big hurdle, as it requires upfront tech investment, leadership buy-in, and skilled staff that can handle this for your game.
- But you can’t do an AB Test without the infrastructure. Sorry!
- You need to be able to randomly allocate users to test groups, and handle the code changes required for each treatment group (more on this later)
A well-defined metric of success for your Test
- e.g. ARPU, CTR, D30
- Make sure everyone knows what these metrics are and how they are defined. Don’t bully people with acronyms like I just did. Put them in a wiki or something.
A pre-meditated target improvement in the success metric
- Say something like “we want D30 to go up by 1%”
- Don’t look at the D30 after the test and retroactively claim victory for 0.1%
- You also need a good estimate for the current level of the success metric based on real-world data.
- “The D30 is currently 6%”

What is A/B testing and what can it be used for? - Seobility Wiki — take a break and look at this image

Variants to test, and a control group
- The control group is the status quo of your game with no changes.
- The treatment groups (or treatment variants) make changes to the behaviour of your game in an attempt to improve the success metric
- Your backend testing infrastructure also needs to be able to handle concurrent versions of the game running different versions of code that do different things depending on which group each player is in without bugging out. This is no mean feat!
A power analysis
- In a nutshell, it tells you how many users you would need to detect a meaningful change in the success metric.
- “In order to detect a difference from 6% to 7% D30 at the required Test power level, we would need to have X users in each test group”
Post-analysis and communication processes.
- Your data people need time to do due diligence on the results of the Test, as there is a lot of nuance beyond checking p-values.
- They need to verify the trustworthiness of spurious or “too good to be true” results among other things.
- Give them a public forum to discuss results, even if they are negative or neutral. Learning is the important thing, and a lot of tests will turn up absolute bupkiss*. Get used to it!

This is already quite long for a non-exhaustive list, so I’ll stop here. How did you do? Are you ready for AB Tests? If not, bully your colleagues with this article today! Better still, memorise it and take all the credit! I really don’t mind. Laters!

*nothing.

Game Math Done Quick

AB Test Overview

An non-exhaustive checklist for running an Online Controlled Experiment