Your AI Visibility Score Is Probably Wrong

Most AI visibility platforms tell you your brand's "share of voice" based on a handful of prompts run once. Our research — 15,580 responses across 45 prompts and 5 LLMs — shows why that number is statistically meaningless, and what honest measurement actually requires.

The Measurement Problem Nobody Is Admitting

There's a dirty secret in the GEO and AI visibility space: most platforms report your brand's share of voice based on single-run data. One prompt. One response. One data point per LLM. The number looks precise. It's not.

LLMs are generative models. They don't retrieve a fixed answer — they sample from a probability distribution every time they respond. The same prompt, run again five minutes later, can produce a completely different set of cited sources, recommended brands, and narrative framing. This isn't a bug. It's fundamental to how these models work.

The question every serious GEO practitioner should be asking is: how many times do you need to run a prompt before your measurement means something?

We ran the experiment. Here's what we found.

The core problem: A brand's visibility score based on a single prompt run is not a measurement. It's a coin flip with a confidence interval you're not being shown.

15,580

LLM responses
analysed

Unique prompts
tested

LLMs across
the dataset

284

Unique URLs cited
by Gemini per prompt

How We Measured: Three Lenses on Variation

We analysed 15,580 responses collected between July 2025 and January 2026, covering Claude Sonnet 4, Gemini 2.5 Pro, GPT-4o Search Preview, GPT-5, and GPT-5 Mini. The prompts focused on drone services and tracked how often a specific brand (structionsolutions.com) appeared in AI-generated answers.

We measured variation three ways: text-level similarity, content novelty, and URL/source diversity. Together they tell a complete story of how unstable LLM output actually is — and what that means for measurement.

1. Text Similarity: Are You Getting the Same Answer?

We used cosine similarity on vocabulary to compare every pair of responses for the same prompt and model. A score of 1.0 means two responses are identical. Here's what the data showed:

Claude Sonnet 4

0.75

GPT-4o Search

0.57

Gemini 2.5 Pro

0.45

GPT-5 Mini

0.38

GPT-5

0.36

Model	Avg Similarity	Consistency	Most Varied Prompt
Claude Sonnet 4	0.75	High	0.66
GPT-4o Search	0.57	Moderate	0.30
Gemini 2.5 Pro	0.45	Low	0.35
GPT-5 Mini	0.38	Low	0.29
GPT-5	0.36	Low	0.19

For four of the five models we tested, a single run shares fewer than half its content vocabulary with another run on the same prompt. For GPT-5, that drops to 36%. You are, quite literally, getting a different answer almost every time.

2. Content Novelty: Do New Runs Teach You Anything?

We shuffled response order and measured how much new content each additional run contributed. If the 50th run is still meaningfully different from everything you've seen before, you're still learning.

The responses never fully converge. Even after 150 runs, Gemini 2.5 Pro still produces responses approximately 45% different from the closest prior match. Novelty drops sharply in the first 10–15 runs, then enters a long plateau where each run still adds something — just with diminishing returns.

This has profound implications. It means LLM output space is not just large — it is practically unbounded for the models with the highest variation. No measurement cadence will ever "see everything." The goal of measurement is not completeness; it's statistical stability.

3. URL & Source Diversity: Which Brands Are Actually Being Recommended?

This is where the measurement problem becomes a business problem. Most GEO platforms track "did your brand appear?" across a sample of prompts. Our data on URL diversity reveals why that sample almost certainly doesn't capture the full picture:

Model	Avg Unique URLs / Prompt	90% Coverage @	95% Coverage @	New URLs at Last Run
Gemini 2.5 Pro	284 per prompt	~137 runs	~152 runs	~1.0 / run
GPT-5	148 per prompt	~39 runs	~42 runs	~2.4 / run
GPT-5 Mini	139 per prompt	~85 runs	~92 runs	~1.2 / run

Gemini 2.5 Pro can cite up to 284 unique URLs across 165 runs of the same prompt — and is still discovering approximately one new URL at the very last run. Complete URL coverage is not a goal; it is mathematically unreachable.

What this means for your GEO platform: If your visibility score is based on 3–5 prompt runs, you may have seen fewer than 3% of the URLs Gemini would eventually recommend. Your brand could be appearing far more frequently than you're being told — or far less.

The Near-Duplicate Problem (and Why GPT-4o Is Different)

Not all variation is created equal. We also checked how often response pairs exceed 0.7 and 0.8 cosine similarity — the threshold where two responses are essentially saying the same thing.

Near-Duplicate Response Rates

1

Claude Sonnet 4: 72.3% of pairs exceed 0.7 similarity (58.0% exceed 0.8). Claude is the most consistent model — which is useful for content pattern analysis, but means you may be missing variation that other models express.
2

GPT-4o Search: 19.4% of pairs exceed 0.7 (4.8% exceed 0.8). The only model with a meaningful near-duplicate rate outside Claude — but still producing mostly unique responses 80% of the time.
3

GPT-5, GPT-5 Mini, Gemini 2.5 Pro: Fewer than 3.5% of pairs exceed 0.7. Every run is essentially a new answer. Running these models 5 times gives you 5 largely independent samples of a much larger possibility space.

For the models that matter most to B2B buyers — GPT-5 and Gemini 2.5 Pro — a handful of prompt runs captures an almost negligible fraction of the response landscape.

How Many Runs Do You Actually Need?

We're not arguing for infinite runs. Diminishing returns are real and they kick in relatively quickly. What we are arguing for is methodological honesty: platforms should report the sample size behind every visibility score, not just the score itself.

Based on where novelty curves flatten and URL discovery slows, here are our evidence-based minimums:

Claude Sonnet 4

10–15

runs per prompt

Highly consistent (0.75 avg similarity). Patterns emerge fast. Note: small dataset in our study (n=12).

GPT-4o Search

20–25

runs per prompt

Most repetitive; 19% near-duplicate pairs. Patterns stabilise relatively quickly vs. other GPT models.

GPT-5 / GPT-5 Mini

30–50

runs per prompt

Moderately varied. Novelty flattens around run 30. Wide URL pool — GPT-5 found 148 unique URLs per prompt.

Gemini 2.5 Pro

75–100

runs per prompt

Most diverse model tested. New URLs still appearing at run 150. 284 avg unique URLs per prompt.

For brand mention tracking specifically — where you're measuring a rare, bursty signal that appears in only some runs — you need even more. For prompts where mentions do occur, 50–100 runs per prompt per model is the baseline for a statistically reliable estimate of mention rate.

"A brand's share of voice measured across 3 prompt runs is not a KPI. It's an anecdote dressed up in a dashboard."

What This Means For Your Brand — A Real Example

Our dataset included a real client: Struction Solutions (structionsolutions.com), a drone services company. Across 15,580 total responses:

Struction Solutions — Mention Analysis

A

The brand appeared in just 102 out of 15,580 responses (0.65% overall) — but was cited in URL lists 496 times, meaning citation without explicit mention is common and often unmeasured.
B

GPT-5 never mentioned the brand once across 2,064 runs. A platform running GPT-5 to track this brand's visibility would report zero — which appears to be accurate, but doesn't tell you GPT-5 is actively recommending 148 competing URLs instead.
C

GPT-5 Mini was the primary driver of mentions, concentrated on cost comparison and housing inspection prompts. A platform that doesn't run GPT-5 Mini, or that runs it on the wrong prompts, misses the brand's strongest channel.
D

23 out of 45 prompts never triggered a mention in any model. Prompt selection is not a neutral act — it is a methodology decision with enormous consequences for what you measure and report.

If a platform had run 5 prompts once each across 3 models, the Struction Solutions visibility score could have been anywhere from 0% to a misleadingly high number, depending entirely on which 5 prompts and which 3 models were selected. That's not measurement. That's sampling error presented with false precision.

What Honest AI Visibility Measurement Looks Like

We're not raising this to point fingers without solutions. We're raising it because the industry is moving fast, clients are making real budget decisions based on these numbers, and the methodological bar needs to rise. Here's what we believe responsible measurement requires:

1. Report sample sizes alongside scores. Every share of voice number should be accompanied by the number of prompt runs behind it, the models included, and the prompt set used. Without this, a "72% share of voice" is uninterpretable.

2. Use confidence intervals, not point estimates. A brand appearing in 3 out of 10 runs has a 30% mention rate with a wide confidence interval. That interval shrinks as runs increase. Platforms should show the interval — Wilson score intervals are appropriate for low-visibility brands where Wald intervals inflate.

3. Model selection is methodology, not configuration. Including or excluding GPT-5 vs. Gemini 2.5 Pro produces fundamentally different datasets. This should be disclosed and standardised, not left to default settings that users don't scrutinise.

4. Prompt sets should be disclosed and reproducible. The prompts you run are as important as the models you run them on. Intent-classified, bottom-of-funnel prompts are not the same as broad awareness queries. Treating them as equivalent inflates scores for some brands and deflates others.

5. Separate mention rate from citation rate. A brand that appears in cited URLs 496 times but in explicit mention text 102 times is having a different kind of presence than a brand explicitly named. Conflating these two signals produces misleading visibility scores.

These are not hard requirements. They are the baseline of statistical hygiene that any field selling measurement products to paying clients should be held to.

The Bottom Line

LLMs are not search engines with stable rankings you can crawl once a week and report. They are generative systems whose outputs shift with every run, and whose behaviour varies enormously between models. Gemini 2.5 Pro produces responses 45% different from the closest prior match even after 150 runs. GPT-5 cited 148 unique URLs per prompt across roughly 48 runs — and never once mentioned a brand that GPT-5 Mini was actively recommending.

For most models, 30–50 runs per prompt provides a reliable picture of the response distribution. For Gemini, push to 75–100. For brand mention tracking of rare, bursty signals, the minimum is 50–100 per prompt per model.

Any platform reporting AI visibility scores without disclosing the sample size behind those scores is either not running enough prompts to know, or running enough prompts and choosing not to tell you. Neither is acceptable when real marketing budgets and strategic decisions are downstream of those numbers.

At GEOforge, our measurement architecture is built around statistical validity from the ground up — because the whole point of GEO is to move a number, and you can't move a number you're not measuring correctly.

Your AI Visibility Score Is Probably Wrong

The Measurement Problem Nobody Is Admitting

How We Measured: Three Lenses on Variation

1. Text Similarity: Are You Getting the Same Answer?

2. Content Novelty: Do New Runs Teach You Anything?

3. URL & Source Diversity: Which Brands Are Actually Being Recommended?

The Near-Duplicate Problem (and Why GPT-4o Is Different)

How Many Runs Do You Actually Need?

What This Means For Your Brand — A Real Example

What Honest AI Visibility Measurement Looks Like

The Bottom Line

See How GEOforge Measures AI Visibility

platform

Resources

Company