Skip to content

Experiments

An Experiment is an A/B test — a scientific way to compare two or more versions of a feature and measure which one performs better. Instead of guessing whether a change is good, you let real users tell you through their behavior.


What Is an A/B Test?

An A/B test works like this:

  1. You have two (or more) versions of something (Version A and Version B)
  2. You split your users into groups — some see Version A, others see Version B
  3. You measure which version produces better results (more purchases, more signups, etc.)
  4. The winning version is rolled out to everyone

FlagPal makes A/B testing easy by handling the user splitting, feature delivery, and metric tracking for you.


A Real-World Example

Your team has two ideas for a "Buy Now" button: - Version A: A button that says "Buy Now" - Version B: A button that says "Add to Cart"

You want to know which one leads to more purchases. Here's what you do:

  1. Create a Feature Flag called buy_button_text (String type) (optionally with rules for values "Buy Now" and "Add to Cart")
  2. Create a Metric called Purchase Completed (Boolean type) — records when a purchase happens
  3. Create an Experiment with:
    • Targeting Rules: buy_button_text equals (empty) (an important rule to ensure that only users with no previous value see the experiment. Otherwise, users who already entered this Experience would be enrolled again)
    • Variant A: buy_button_text = "Buy Now" (weights default to 50% of users)
    • Variant B: buy_button_text = "Add to Cart" (weights default to 50% of users)
    • Attach two Metrics: Experiment Started (your exposure or impressino metric) and Purchase Completed (your goal metric)
  4. Run the experiment for a couple of weeks
  5. FlagPal shows you the results — which button led to more purchases

The Components of an Experiment

Name

A descriptive name for your experiment. Example: "Buy Button Color Test Q1 2024"

Description

What you're testing and why. This is helpful for your team to understand the context.

Active Toggle

Start and stop the experiment without deleting it.

Traffic Percentage

The percentage of total available traffic that enters this experiment. This is different from how traffic is split between variants.

Example:

  • Traffic Percentage = 50% → only half your users will be in the experiment at all
  • Of those, 50% see Variant A and 50% see Variant B
  • The other 50% of your users are unaffected and see the default experience
  • If you have a second Experiment running which targets 100% of users, and its rules disallow users from the previous Experiment, it only has the remaining 50% of traffic left to work with.

Why would you limit traffic? Because you might want to run experiments cautiously — especially for high-risk changes.

Variants (Feature Sets)

Each variant in an experiment is called a Feature Set. It defines:

  • A name (e.g., "Control", "Variant A", "Variant B")
  • The Feature Flag values for that variant
  • A weight — how much of the experiment traffic goes to this variant. By default, weight is split across all variants equally.

Example variants for the buy button experiment:

Variant Feature Values Weight
Control (buy now) buy_button_text = "Buy Now" 1
Variant B (add to cart) buy_button_text = "Add to Cart" 1

Equal weights = equal split. You can make the split unequal — for example, give the new variant only 10% of traffic if you want to test it cautiously.

Targeting Rules

Like Experiences, you can use rules to target your experiment at specific users. For example, run the experiment only for users in a particular country, or only for users on a specific plan.

IMPORTANT: Even though targeting rules are designed for flexibility and are therefore optional, it is considered best practice to always include at least one targeting rule – the one that ensures only users with no previous value see the experiment. Otherwise, users who already entered this Experiment would be enrolled again, skewing your results.
If you're an advanced user, you can use the flexibility of targeting rules (or skip them entirely) to create more complex experiments, which is not possible in alternative tools.

Metrics

Metrics are what you're measuring. You attach one or more Metrics to an Experiment to track which variant wins.

Learn more about Metrics →


Reading Experiment Results

The examples below are used to understand the concept. For a full guide, check Reading Experiment Results.

Once your experiment is running and collecting data, you can view the results on the Experiment's View page. You'll see:

Result Summary Table

You'll see results broken down by variant:

Variant Exposure metric Goal metric Ratio Probability to be the best
Variant A 1010 30 2.97 1%
Variant B 1017 51 5.01 99% Results are statistically significant!

What each column means:

  • Variant: The name of the variant you're testing.
  • Exposure metric: The number of recorded Metrics when users were exposed to the variant. Use the drop-down to select any Metric that is collected in this Experiment (e.g., clicked on a button, visited a page, or in our use case: Entered the Experiment).
  • Goal metric: The number of times the goal Metric was achieved (same as above, you can select any Metric recorded in this Experiment).
  • Ratio: The ratio of goal metric to exposure metric for each variant. Using a ratio allows you to compare the performance of different types of metrics more easily: clicks VS conversions for a conversion rate, or conversions VS revenue for AOV.
  • Probability to be the best: The probability that this variant is the best performing variant.
  • Results are statistically significant!: Indicates that the difference between variants is not due to random chance.

Charts

Visual charts showing how each variant performed over time. This helps you spot trends — for example, if one variant started strong but faded over time.

Statistical Significance

FlagPal helps you understand whether the difference between variants is real or just random chance. This is called statistical significance. It's tempting to stop an experiment as soon as one variant looks like it's winning — but this can lead to false conclusions. Wait until you have enough data for the results to be statistically significant.


Experiment Best Practices

Define Your Hypothesis First

Before running an experiment, write down: "We believe that [change] will result in [outcome] because [reason]." This keeps you honest and helps you interpret results.

Run One Experiment at a Time or split traffic with Targeting Rules

If you're testing two different things on the same page simultaneously, it's hard to know which change caused the result. It's best to either run one experiment at a time or split traffic with Targeting Rules (for example, testing headlines with a US audience and button texts with EU audiences)

Give It Enough Time

Experiments need time to collect enough data for meaningful results. A minimum of one or two weeks is usually recommended (and longer if your traffic is low).

Measure the Right Metrics

The metric you measure should be directly connected to the feature you're testing. Don't try to measure everything — focus on what matters.

Document Your Results

Win or lose, write down what you learned. Over time, this builds institutional knowledge about what works for your users.

Don't Assume

If your hypothesis was correct for one audience, it doesn't mean it will be correct for another. Always test and validate your assumptions with data. This usually means multiple Experiments.


Experiment Lifecycle

  1. Draft — Create the experiment and configure variants
  2. Active — Turn it on and start collecting data
  3. Analysing — Review the results
  4. Complete — Roll out the winner and turn off the experiment
  5. Archived — Keep a record of past experiments

Copying an Experiment

If you want to run a similar experiment to one you've done before, use the Replicate function. This creates a copy of the experiment with all its settings, which you can modify and run again.


  • Feature Flags — the flags used in experiments
  • Metrics — what you measure in experiments
  • Experiences — for rolling out the winning variant
  • Actors — the users being split into variants