How to Measure AI Visibility (and Why Most Numbers Are Wrong)

GEOforge Research · Methodology

We've run 85,585 ChatGPT queries to measure AI visibility properly. Here's why one prompt run tells you almost nothing.

By Paris ChildressGEOforge6 min read

85,585

LLM calls run

measurement cycles

35–50

repeats per prompt

brands measured

Almost every AI visibility number you've seen was produced the same way: someone asked a model a question, looked at the answer, and wrote down whether the brand was there. That method is broken, and it's worth understanding exactly why.

The problem: the thing you're measuring won't hold still

Large language models are probabilistic. Ask the same question twice and you can get two different answers, with different brands named. In our own testing, the brands ChatGPT named in response to a repeated buying question changed 43% of the time. So a single run doesn't capture your visibility. It captures one random draw from a distribution, and you have no idea where in that distribution your draw landed.

Why one run is a sample size of one

Imagine judging a coin as "lands heads" because you flipped it once and saw heads. That's what a single-prompt visibility check does. If you appear in an answer that you'd actually appear in 30% of the time, the tool reports 100%. If you miss, it reports 0%. Both are "true" for that one run and both are useless as a measure of where you stand. The only way to recover the real rate is to run the prompt many times and count.

One prompt run

A single random draw
Reports 0% or 100%, nothing in between
Swings wildly between checks
Can't be trended over time
Feels precise, means little

Dozens of runs per prompt

A stable mention rate
Reports a real percentage with a range
Repeatable within a tight band
Can be tracked as a genuine trend
Less precise-looking, far more true

How many runs you actually need

Enough that the number stops moving. The practical answer we've settled on is 35 to 50 repeats per prompt, across a representative set of buyer questions. At that volume the mention rate converges into a tight band you can compare month to month with confidence. Fewer than that and you're still reporting noise. This is why our corpus runs to 85,585 LLM calls across 88 measurement cycles. It isn't volume for its own sake. It's the minimum that makes the number mean something.

A single prompt run is a sample size of one. To get an AI Share of Voice number that means the same thing tomorrow as today, we run 35 to 50 repeats per prompt.

What rigorous measurement costs

It costs more than a screenshot, and that's the honest trade-off. Running tens of thousands of model calls has a real bill attached, and doing it on a schedule across a portfolio of brands compounds it. But the alternative is cheaper only in the way a broken speedometer is cheaper than a working one. If your AI visibility number swings 40 points between checks, you can't make a single decision with it. The cost of proper measurement is the price of having a number you can actually act on.

What good measurement looks like

Many runs per prompt, not one. Treat each answer as a sample, not a verdict.
A representative prompt set across the buying journey, not one flattering query.
Results reported as a mention rate with a range, never a single yes/no.
A fixed cadence, so you're comparing like with like over time.
Transparency about method and sample size, so the number can be trusted.

Measuring AI visibility properly is unglamorous and it isn't free. But a number you can't trust is worse than no number at all, because you'll act on it anyway.

Get a visibility number you can actually trust. Then move it.

GEOforge measures your AI Share of Voice across dozens of runs per prompt, reports it with real confidence, and builds the citations to improve it.

Book a GEO visibility audit →

Sources & method. All figures verified against the GEOforge measurement database on 16 June 2026. Corpus: 85,585 LLM calls across 88 measurement cycles covering 21 tracked brands and 569 categorised buyer prompts, measured April–June 2026, at 35–50 runs per prompt. The 43% volatility figure is drawn from 1,302 repeated prompt-instances. Figures are ChatGPT-specific; cross-engine results may differ.

What It Actually Costs to Measure AI Visibility (and Why Most Numbers Are Wrong)

The problem: the thing you're measuring won't hold still

Why one run is a sample size of one

One prompt run

Dozens of runs per prompt

How many runs you actually need

What rigorous measurement costs

What good measurement looks like

Get a visibility number you can actually trust. Then move it.

platform

Resources

Compare Us

use cases

compare us