Most AI visibility platforms tell you your brand's "share of voice" based on a handful of prompts run once. Our research — 15,580 responses across 45 prompts and 5 LLMs — shows why that number is statistically meaningless, and what honest measurement actually requires.
There's a dirty secret in the GEO and AI visibility space: most platforms report your brand's share of voice based on single-run data. One prompt. One response. One data point per LLM. The number looks precise. It's not.
LLMs are generative models. They don't retrieve a fixed answer — they sample from a probability distribution every time they respond. The same prompt, run again five minutes later, can produce a completely different set of cited sources, recommended brands, and narrative framing. This isn't a bug. It's fundamental to how these models work.
The question every serious GEO practitioner should be asking is: how many times do you need to run a prompt before your measurement means something?
We ran the experiment. Here's what we found.
The core problem: A brand's visibility score based on a single prompt run is not a measurement. It's a coin flip with a confidence interval you're not being shown.
We analysed 15,580 responses collected between July 2025 and January 2026, covering Claude Sonnet 4, Gemini 2.5 Pro, GPT-4o Search Preview, GPT-5, and GPT-5 Mini. The prompts focused on drone services and tracked how often a specific brand (structionsolutions.com) appeared in AI-generated answers.
We measured variation three ways: text-level similarity, content novelty, and URL/source diversity. Together they tell a complete story of how unstable LLM output actually is — and what that means for measurement.
We used cosine similarity on vocabulary to compare every pair of responses for the same prompt and model. A score of 1.0 means two responses are identical. Here's what the data showed:
| Model | Avg Similarity | Consistency | Most Varied Prompt |
|---|---|---|---|
| Claude Sonnet 4 | 0.75 | High | 0.66 |
| GPT-4o Search | 0.57 | Moderate | 0.30 |
| Gemini 2.5 Pro | 0.45 | Low | 0.35 |
| GPT-5 Mini | 0.38 | Low | 0.29 |
| GPT-5 | 0.36 | Low | 0.19 |
For four of the five models we tested, a single run shares fewer than half its content vocabulary with another run on the same prompt. For GPT-5, that drops to 36%. You are, quite literally, getting a different answer almost every time.
We shuffled response order and measured how much new content each additional run contributed. If the 50th run is still meaningfully different from everything you've seen before, you're still learning.
The responses never fully converge. Even after 150 runs, Gemini 2.5 Pro still produces responses approximately 45% different from the closest prior match. Novelty drops sharply in the first 10–15 runs, then enters a long plateau where each run still adds something — just with diminishing returns.
This has profound implications. It means LLM output space is not just large — it is practically unbounded for the models with the highest variation. No measurement cadence will ever "see everything." The goal of measurement is not completeness; it's statistical stability.
This is where the measurement problem becomes a business problem. Most GEO platforms track "did your brand appear?" across a sample of prompts. Our data on URL diversity reveals why that sample almost certainly doesn't capture the full picture:
| Model | Avg Unique URLs / Prompt | 90% Coverage @ | 95% Coverage @ | New URLs at Last Run |
|---|---|---|---|---|
| Gemini 2.5 Pro | 284 per prompt | ~137 runs | ~152 runs | ~1.0 / run |
| GPT-5 | 148 per prompt | ~39 runs | ~42 runs | ~2.4 / run |
| GPT-5 Mini | 139 per prompt | ~85 runs | ~92 runs | ~1.2 / run |
Gemini 2.5 Pro can cite up to 284 unique URLs across 165 runs of the same prompt — and is still discovering approximately one new URL at the very last run. Complete URL coverage is not a goal; it is mathematically unreachable.
What this means for your GEO platform: If your visibility score is based on 3–5 prompt runs, you may have seen fewer than 3% of the URLs Gemini would eventually recommend. Your brand could be appearing far more frequently than you're being told — or far less.
Not all variation is created equal. We also checked how often response pairs exceed 0.7 and 0.8 cosine similarity — the threshold where two responses are essentially saying the same thing.
For the models that matter most to B2B buyers — GPT-5 and Gemini 2.5 Pro — a handful of prompt runs captures an almost negligible fraction of the response landscape.
We're not arguing for infinite runs. Diminishing returns are real and they kick in relatively quickly. What we are arguing for is methodological honesty: platforms should report the sample size behind every visibility score, not just the score itself.
Based on where novelty curves flatten and URL discovery slows, here are our evidence-based minimums:
For brand mention tracking specifically — where you're measuring a rare, bursty signal that appears in only some runs — you need even more. For prompts where mentions do occur, 50–100 runs per prompt per model is the baseline for a statistically reliable estimate of mention rate.
"A brand's share of voice measured across 3 prompt runs is not a KPI. It's an anecdote dressed up in a dashboard."
Our dataset included a real client: Struction Solutions (structionsolutions.com), a drone services company. Across 15,580 total responses:
If a platform had run 5 prompts once each across 3 models, the Struction Solutions visibility score could have been anywhere from 0% to a misleadingly high number, depending entirely on which 5 prompts and which 3 models were selected. That's not measurement. That's sampling error presented with false precision.
We're not raising this to point fingers without solutions. We're raising it because the industry is moving fast, clients are making real budget decisions based on these numbers, and the methodological bar needs to rise. Here's what we believe responsible measurement requires:
1. Report sample sizes alongside scores. Every share of voice number should be accompanied by the number of prompt runs behind it, the models included, and the prompt set used. Without this, a "72% share of voice" is uninterpretable.
2. Use confidence intervals, not point estimates. A brand appearing in 3 out of 10 runs has a 30% mention rate with a wide confidence interval. That interval shrinks as runs increase. Platforms should show the interval — Wilson score intervals are appropriate for low-visibility brands where Wald intervals inflate.
3. Model selection is methodology, not configuration. Including or excluding GPT-5 vs. Gemini 2.5 Pro produces fundamentally different datasets. This should be disclosed and standardised, not left to default settings that users don't scrutinise.
4. Prompt sets should be disclosed and reproducible. The prompts you run are as important as the models you run them on. Intent-classified, bottom-of-funnel prompts are not the same as broad awareness queries. Treating them as equivalent inflates scores for some brands and deflates others.
5. Separate mention rate from citation rate. A brand that appears in cited URLs 496 times but in explicit mention text 102 times is having a different kind of presence than a brand explicitly named. Conflating these two signals produces misleading visibility scores.
These are not hard requirements. They are the baseline of statistical hygiene that any field selling measurement products to paying clients should be held to.
LLMs are not search engines with stable rankings you can crawl once a week and report. They are generative systems whose outputs shift with every run, and whose behaviour varies enormously between models. Gemini 2.5 Pro produces responses 45% different from the closest prior match even after 150 runs. GPT-5 cited 148 unique URLs per prompt across roughly 48 runs — and never once mentioned a brand that GPT-5 Mini was actively recommending.
For most models, 30–50 runs per prompt provides a reliable picture of the response distribution. For Gemini, push to 75–100. For brand mention tracking of rare, bursty signals, the minimum is 50–100 per prompt per model.
Any platform reporting AI visibility scores without disclosing the sample size behind those scores is either not running enough prompts to know, or running enough prompts and choosing not to tell you. Neither is acceptable when real marketing budgets and strategic decisions are downstream of those numbers.
At GEOforge, our measurement architecture is built around statistical validity from the ground up — because the whole point of GEO is to move a number, and you can't move a number you're not measuring correctly.

Paris Childress is the CEO of Hop AI and creator of GEOforge, a platform that helps B2B brands get cited and recommended by AI assistants like ChatGPT, Perplexity, and Gemini. A former Google Country Manager and agency veteran with 20+ years in digital marketing, Paris is focused on helping brands win in the era of AI search.
Our SignalForge module runs statistically valid prompt architectures across 5 LLMs — with confidence intervals, not just scores.