What It Actually Costs to Measure AI Visibility (and Why Most Numbers Are Wrong)

Paris Childress
June 16, 2026

GEOforge Research · Methodology

We've run 85,585 ChatGPT queries to measure AI visibility properly. Here's why one prompt run tells you almost nothing.

85,585
LLM calls run
88
measurement cycles
35–50
repeats per prompt
21
brands measured

Almost every AI visibility number you've seen was produced the same way: someone asked a model a question, looked at the answer, and wrote down whether the brand was there. That method is broken, and it's worth understanding exactly why.

The problem: the thing you're measuring won't hold still

Large language models are probabilistic. Ask the same question twice and you can get two different answers, with different brands named. In our own testing, the brands ChatGPT named in response to a repeated buying question changed 43% of the time. So a single run doesn't capture your visibility. It captures one random draw from a distribution, and you have no idea where in that distribution your draw landed.

Why one run is a sample size of one

Imagine judging a coin as "lands heads" because you flipped it once and saw heads. That's what a single-prompt visibility check does. If you appear in an answer that you'd actually appear in 30% of the time, the tool reports 100%. If you miss, it reports 0%. Both are "true" for that one run and both are useless as a measure of where you stand. The only way to recover the real rate is to run the prompt many times and count.

One prompt run
  • A single random draw
  • Reports 0% or 100%, nothing in between
  • Swings wildly between checks
  • Can't be trended over time
  • Feels precise, means little
Dozens of runs per prompt
  • A stable mention rate
  • Reports a real percentage with a range
  • Repeatable within a tight band
  • Can be tracked as a genuine trend
  • Less precise-looking, far more true

How many runs you actually need

Enough that the number stops moving. The practical answer we've settled on is 35 to 50 repeats per prompt, across a representative set of buyer questions. At that volume the mention rate converges into a tight band you can compare month to month with confidence. Fewer than that and you're still reporting noise. This is why our corpus runs to 85,585 LLM calls across 88 measurement cycles. It isn't volume for its own sake. It's the minimum that makes the number mean something.

A single prompt run is a sample size of one. To get an AI Share of Voice number that means the same thing tomorrow as today, we run 35 to 50 repeats per prompt.

What rigorous measurement costs

It costs more than a screenshot, and that's the honest trade-off. Running tens of thousands of model calls has a real bill attached, and doing it on a schedule across a portfolio of brands compounds it. But the alternative is cheaper only in the way a broken speedometer is cheaper than a working one. If your AI visibility number swings 40 points between checks, you can't make a single decision with it. The cost of proper measurement is the price of having a number you can actually act on.

What good measurement looks like

  • Many runs per prompt, not one. Treat each answer as a sample, not a verdict.
  • A representative prompt set across the buying journey, not one flattering query.
  • Results reported as a mention rate with a range, never a single yes/no.
  • A fixed cadence, so you're comparing like with like over time.
  • Transparency about method and sample size, so the number can be trusted.

Measuring AI visibility properly is unglamorous and it isn't free. But a number you can't trust is worse than no number at all, because you'll act on it anyway.

Get a visibility number you can actually trust. Then move it.

GEOforge measures your AI Share of Voice across dozens of runs per prompt, reports it with real confidence, and builds the citations to improve it.

Book a GEO visibility audit →

Sources & method. All figures verified against the GEOforge measurement database on 16 June 2026. Corpus: 85,585 LLM calls across 88 measurement cycles covering 21 tracked brands and 569 categorised buyer prompts, measured April–June 2026, at 35–50 runs per prompt. The 43% volatility figure is drawn from 1,302 repeated prompt-instances. Figures are ChatGPT-specific; cross-engine results may differ.

Paris Childress
CEO

Paris Childress is the CEO of Hop AI and creator of GEOforge, a platform that helps B2B brands get cited and recommended by AI assistants like ChatGPT, Perplexity, and Gemini. A former Google Country Manager and agency veteran with 20+ years in digital marketing, Paris is focused on helping brands win in the era of AI search.