If you're tracking AI search visibility the way you track SEO rankings, you're measuring the wrong thing — and making decisions on unreliable data.
Traditional rank tracking assumes stable, deterministic results. Google shows the same top 10 results to everyone searching "best CRM software." But AI search doesn't work that way. Ask ChatGPT the same question twice, and you'll get different answers. Ask it across five sessions, and your brand might appear in three responses and vanish in two others. This isn't a bug — it's how large language models work.
Most GEO monitoring platforms ignore this reality. They run a prompt once, see your brand mentioned, and report 100% visibility. Run it again tomorrow, get zero mentions, and suddenly your visibility "dropped" to 0%. Neither number is accurate. Both are misleading. And if you're using these metrics to decide which content to prioritize or whether your GEO strategy is working, you're building on quicksand.
This guide introduces a statistically defensible approach to measuring Share of Voice in AI search through repeated sampling and confidence interval analysis. You'll learn why traditional visibility percentages fail in probabilistic systems, how to calculate confidence-adjusted Share of Voice that accounts for LLM variability, and how to systematically prioritize content opportunities based on visibility gaps.
By the end, you'll understand why GEOforge runs 30 different prompts 50 times each before reporting a Share of Voice number — and why competitors who don't are selling you vanity metrics dressed up as insights.
SEO rank tracking works because search engines are deterministic. When you check your ranking for "project management software" in Google, you get a consistent result. Position 4 today means position 4 tomorrow, assuming no algorithm updates or competitor movements. You can track that number daily, plot it on a chart, and make decisions based on directional trends.
AI search breaks this model completely. LLMs are probabilistic systems. They don't retrieve pre-ranked results from an index — they generate new text every time, sampling from probability distributions shaped by their training data, the user's prompt, conversation history, and a temperature setting that introduces controlled randomness.
This means the same prompt produces different outputs across sessions. We've observed this directly in GEOforge's monitoring data: a brand mentioned in 73% of responses for "best identity security platforms" in one 50-run sample, and 68% in another sample taken three days later — with no content changes, no new citations, and no competitor activity. The difference isn't signal. It's statistical noise inherent to how LLMs generate text.
Most GEO monitoring tools run each prompt once per tracking cycle. They query ChatGPT with "top marketing automation platforms," parse the response, check if your brand appears, and report binary visibility: yes or no, 1 or 0, 100% or 0%.
This approach has three fatal flaws:
1. No confidence interval. A single run tells you nothing about the true underlying probability that your brand will be mentioned. If you appear in one response, your actual mention rate could be anywhere from 5% to 95% — you simply don't have enough data to know.
2. Extreme volatility. When you track single-run results over time, your visibility chart becomes a jagged mess of 100% and 0% values. Marketing leaders see this and assume something broke or a competitor launched a major campaign, when in reality it's just sampling variance.
3. False precision. Reporting "you appeared in 4 out of 7 prompts this week" (57% visibility) implies a level of measurement accuracy that doesn't exist. That 57% has a margin of error so wide it's functionally useless for decision-making.
Share of Voice — your brand's visibility relative to competitors — is even more broken when measured with insufficient sample sizes.
Imagine you're tracking three competitors. You run 10 prompts once each. Your brand appears in 6 responses (60% visibility), Competitor A appears in 5 (50%), and Competitor B appears in 4 (40%). Your platform declares you the category leader with the highest Share of Voice.
But run those same 10 prompts again, and the rankings flip. Competitor A now leads with 7 mentions, you drop to 5, and Competitor B stays at 4. Which dataset is correct? Neither. Both are equally unreliable because the sample size is too small to distinguish real differences from random variation.
This is the core problem GEOforge was built to solve: you cannot measure Share of Voice accurately without accounting for the probabilistic nature of LLM outputs. And you cannot account for that without running each prompt enough times to calculate a statistically valid confidence interval.
When measuring probabilistic systems like LLMs, we need to distinguish between what we observe and what we can confidently claim about the underlying reality. This is where confidence levels come in.
A confidence level tells you how certain you can be that your measured result reflects the true behavior of the system. GEOforge targets a 90% confidence level — meaning that if we measure your Share of Voice at 42%, we're 90% confident that your true visibility falls within a calculable range around that number.
Why 90% instead of 95% or 99%? It's a practical balance. Higher confidence levels require larger sample sizes, which means more API calls, longer tracking cycles, and higher costs. A 90% confidence level provides statistical rigor while keeping measurement practical and cost-effective for continuous monitoring.
This is the same confidence level used in many market research studies and A/B testing platforms. It's rigorous enough to make defensible business decisions while pragmatic enough to implement at scale.
GEOforge's measurement methodology is built on a specific sampling framework: 30 different prompts, each run 50 times.
This isn't arbitrary. The combination of 30 prompts and 50 runs per prompt creates a dataset large enough to achieve 90% confidence in your Share of Voice measurements while covering the breadth of queries your potential customers actually ask.
Why 30 prompts? This number provides sufficient coverage of your category's query landscape. Fewer than 30 prompts and you risk missing important buyer journey stages or use cases. More than 30 and you start seeing diminishing returns — the additional prompts don't significantly improve the representativeness of your sample, but they do increase tracking costs and complexity.
Why 50 runs per prompt? This is the sample size needed to achieve 90% confidence in your visibility measurements for each individual prompt. At 50 runs, you can distinguish between a brand that appears 40% of the time versus one that appears 60% of the time with statistical confidence. Below 50 runs, the confidence intervals become too wide to make reliable comparisons.
Together, this creates 1,500 data points per tracking cycle (30 prompts × 50 runs) — a dataset robust enough to calculate Share of Voice that accounts for LLM variability while remaining practical to collect and analyze.
Let's walk through a concrete example to illustrate why this matters.
Suppose you're tracking the prompt "best marketing automation platforms for B2B SaaS." You run it 50 times and your brand appears in 32 of those responses. Your raw visibility is 32/50 = 64%.
But that 64% is a point estimate — a single number that doesn't tell you anything about the range of possible true values. With 50 runs and a 90% confidence level, statistical analysis tells us your true visibility for this prompt likely falls between 51% and 76%.
This range is your confidence interval. It means: "Based on our sample of 50 runs, we're 90% confident that if we ran this prompt 1,000 times, your brand would appear between 51% and 76% of the time."
Now compare this to a competitor who appears in 28 out of 50 runs (56% raw visibility). Their 90% confidence interval is 43% to 69%.
Notice something important: these intervals overlap significantly. While your point estimate (64%) is higher than theirs (56%), the confidence intervals overlap from 51% to 69%. This means the difference might not be statistically meaningful — you could both have similar true visibility rates, and the 8-percentage-point difference you observed is just sampling variance.
This is why single-run measurements are so misleading. Without confidence intervals, you can't tell whether observed differences represent real competitive advantages or just random noise.
Share of Voice is only meaningful if it's measured across prompts that represent real user behavior. You can't just track "What is [your product category]?" and call it done. You need a prompt set that mirrors the questions your ICP actually asks AI assistants during their research process.
GEOforge helps you identify 30 prompts across three categories:
Category-defining prompts — Broad questions that establish the competitive landscape: - "What are the best [product category] platforms for [use case]?" - "Top [product category] tools for [company size]" - "How to choose a [product category] solution"
Use-case-specific prompts — Questions tied to specific buyer problems: - "Best [product category] for [specific pain point]" - "How to [achieve outcome] with [product category]" - "[Product category] for [industry] companies"
Comparison prompts — Direct competitive evaluation queries: - "[Your brand] vs [Competitor A]" - "Alternatives to [major competitor]" - "Why choose [your brand] over [competitor]"
These prompts should be sanitized — meaning you strip out brand bias. Don't track "Why is [Your Brand] the best solution?" because that's not how users search. Track "What's the best solution for [use case]?" and measure whether your brand appears in the unbiased answer.
GEOforge's BaseForge module helps generate this prompt set by analyzing your existing content, keyword research data, and customer conversation transcripts to identify the questions your buyers are actually asking.
For each prompt in your 30-prompt set, GEOforge queries the target LLM 50 times to achieve 90% statistical confidence. We default to tracking ChatGPT (GPT-4o) because it holds 92% market share in AI search — the same reason you prioritize Google over Bing in SEO.
These 50 runs must be independent sessions. You can't run the same prompt 50 times in a single conversation thread, because LLMs have conversation memory and will produce correlated outputs. Each run needs to be a fresh session with no prior context.
GEOforge's SignalForge agent automates this by: - Spawning 50 separate browser sessions per prompt - Querying the LLM with identical prompt text in each session - Parsing each response to detect brand mentions and citations - Storing the raw response data for audit trails
This process runs weekly for each prompt in your tracking set, creating a time-series dataset that shows how your Share of Voice evolves as you publish new content and build citations.
Not all brand appearances are equal. GEOforge distinguishes between three types of visibility:
1. Direct mentions — Your brand name appears in the body of the response text, typically in a list of recommended solutions or a comparative analysis.
2. Citations — Your brand appears as a linked source that the LLM references to support a claim. This is the highest-value visibility type because it signals the LLM trusts your content as authoritative.
3. Indirect references — The LLM describes your product category or use case in a way that aligns with your positioning, but doesn't name your brand. This is tracked separately as "category visibility" and helps measure whether your thought leadership content is shaping how LLMs explain your market.
For Share of Voice calculation, we count direct mentions and citations as "visible" and indirect references as "not visible." The parsing logic uses named entity recognition to detect brand names even when they appear in varied formats (e.g., "GEOforge," "Geo Forge," "GEO Forge platform").
Once you have 50 responses per prompt, you can calculate a confidence interval that tells you the likely range of your true visibility for that prompt.
Example: Prompt = "Best GEO platforms for B2B SaaS" - Total runs: 50 - Your brand mentioned: 32 times - Raw proportion: 32/50 = 0.64 (64%)
Using standard statistical methods for binomial proportions with 90% confidence: - Lower bound: 51% - Upper bound: 76%
This means you can be 90% confident your true mention rate for this prompt is between 51% and 76%. The midpoint (64%) is your best estimate, but the range tells you the uncertainty in that estimate.
Repeat this calculation for every prompt in your tracking set. You'll end up with a distribution of confidence intervals across all 30 prompts — some high, some low, reflecting which queries you dominate and which you're invisible in.
The width of these confidence intervals depends on your sample size. With 50 runs, you get reasonably tight intervals. With only 10 runs, the intervals would be much wider and less useful for decision-making. This is why GEOforge standardizes on 50 runs — it's the inflection point where you get actionable precision without excessive API costs.
To calculate your overall Share of Voice, GEOforge takes the mean visibility rate across all 30 tracked prompts, weighted by the strategic importance of each prompt.
Example: You track 30 prompts, each run 50 times. Your visibility rates range from 8% to 72% across different prompts. The weighted mean visibility is 42%. Your reported Share of Voice is 42%.
This means that across your representative set of 30 buyer journey prompts, your brand appears in approximately 42% of AI-generated responses, measured with 90% confidence.
If you're tracking competitors, calculate their visibility rates the same way and compare. If Competitor A has a mean visibility of 35% and Competitor B has 28%, your Share of Voice breakdown is: - Your brand: 42% - Competitor A: 35% - Competitor B: 28% - Other/None: (remaining percentage)
This is a statistically defensible competitive benchmark. You're not claiming you appear in 42% of all possible AI search responses — you're claiming that across your defined prompt set, measured with 90% confidence and 50 runs per prompt, your visibility rate is 42%.
Share of Voice is a lagging indicator. When you publish new content or build citations, it takes 4-8 weeks for LLMs to ingest that content into their training data or retrieval pipelines. This is why GEOforge tracks Share of Voice weekly but reports trends monthly.
The key metric to watch is Visibility Delta — the change in your Share of Voice after a content or citation campaign. GEOforge measures this through a specific workflow:
Baseline measurement: Before you publish new content or launch a citation-building campaign, GEOforge establishes your baseline Share of Voice by running all 30 prompts 50 times each. This gives you a starting point measured with 90% confidence.
Content publication and LLM crawling period: You publish your new content. GEOforge monitors when major LLMs crawl and index your new pages. This typically takes 2-4 weeks for content to appear in LLM retrieval systems, and 6-8 weeks for content to potentially influence model behavior through training data updates.
Post-publication measurement: Eight weeks after publication (allowing time for LLM ingestion), GEOforge re-runs all 30 prompts 50 times each and calculates your new Share of Voice.
Visibility Delta calculation: The difference between your post-publication Share of Voice and your baseline Share of Voice is your Visibility Delta. If you started at 38% and you're now at 47%, that's a +9 percentage point Visibility Delta directly attributable to your GEO efforts.
GEOforge's dashboard visualizes this as a time-series chart showing both your Share of Voice trend and the confidence bands around each measurement. This prevents you from overreacting to week-to-week noise and helps you identify genuine inflection points where your strategy is working.
The Visibility Delta workflow is what transforms Share of Voice from a vanity metric into an actionable performance indicator. It creates a closed loop: measure baseline → publish content → wait for LLM ingestion → measure again → calculate lift → prioritize next content based on results.
Knowing your Share of Voice is 34% tells you where you stand. It doesn't tell you what to do about it.
This is where most GEO monitoring platforms stop. They give you dashboards full of charts showing which prompts you appear in, which competitors are beating you, and how your visibility has changed over time. But they don't answer the question every marketing leader actually needs answered: Which content should I create next to move the needle fastest?
GEOforge solves this through systematic opportunity analysis that evaluates every prompt in your tracking set across multiple dimensions to identify where new content will have the greatest impact on Share of Voice.
The foundation of content prioritization is understanding where your visibility gaps are largest and most consequential.
GEOforge analyzes each of your 30 tracked prompts to identify:
High-volume, low-visibility prompts — Questions that your target buyers ask frequently, but where your brand rarely appears in AI responses. These represent the biggest opportunities because improving visibility here affects a large portion of your potential audience.
Competitive displacement opportunities — Prompts where competitors consistently appear but you don't. If three competitors are regularly mentioned in responses to "best [category] for [use case]" and you're absent, that's a high-priority gap that's actively costing you mindshare.
Partial visibility prompts — Queries where you appear sometimes (20-40% visibility) but not consistently. These often indicate that you have some relevant content, but it's not strong enough or well-cited enough to make you a default recommendation. These prompts are often easier to improve than zero-visibility prompts because you're already on the LLM's radar.
Zero-visibility prompts with high strategic value — Questions that may not have high search volume but are critical to your positioning or target high-value buyer segments. For example, prompts about specific enterprise use cases or compliance requirements might be low-volume but represent your ideal customers.
Not all visibility gaps are equally addressable. The second dimension of opportunity analysis is evaluating whether you have the proprietary knowledge and expertise to create truly differentiated content.
GEOforge's BaseForge module analyzes your existing knowledge base — customer case studies, product documentation, support tickets, sales call transcripts, proprietary research, and subject matter expert interviews — to determine whether you have unique insights that could address each prompt.
High knowledge base coverage means you have substantial proprietary information that directly addresses the prompt. For example, if the prompt is "how to reduce identity security false positives" and your BaseForge knowledge base contains 15 customer case studies showing specific false positive reduction metrics, detailed technical documentation about your detection algorithms, and expert interviews with your security research team, you have high coverage. Building an AI-Native Knowledge Base is the most effective way to turn this internal data into the high-authority citations that LLMs prioritize.
Low knowledge base coverage means you lack proprietary insights for this prompt. You could create content, but it would largely rehash information already available elsewhere. This content is unlikely to improve your Share of Voice because it doesn't give LLMs a reason to cite you over existing sources.
Medium knowledge base coverage means you have some relevant information but would need to conduct additional research, customer interviews, or data analysis to create truly differentiated content.
This analysis prevents you from wasting resources creating generic content that won't move Share of Voice. If you identify a high-visibility-gap prompt but have low knowledge base coverage, the right action isn't to publish mediocre content — it's to conduct primary research first to build your knowledge base.
The third dimension of opportunity analysis is understanding which sources LLMs are currently citing for each prompt, and whether your path to visibility requires owned content, third-party citations, or both.
For each prompt, GEOforge analyzes the 50 responses to identify:
Citation frequency — What percentage of responses include citations at all? Some prompts generate responses with extensive citations (60-80% of responses include links), while others generate citation-free responses (0-10% include links). This tells you whether citations are important for this query type.
Citation source types — Which categories of sources are being cited? Industry publications (TechCrunch, VentureBeat)? Review platforms (G2, Capterra)? Community sites (Reddit, Quora)? Competitor blogs? Academic research? Each source type requires a different strategy to earn citations.
Citation authority distribution — Are the cited sources high-authority domains (DR 70+) or lower-authority sources? High-authority citation patterns mean you'll need a robust citation-building strategy (SiteForge) to get your brand mentioned in those sources. Lower-authority patterns mean owned content may be sufficient.
Competitor citation presence — Are your competitors being cited as sources, or are citations primarily third-party? If competitors' owned content is being cited, it signals that LLMs will cite branded sources for this query type, making owned content a viable strategy.
This analysis determines your content strategy for each prompt:
With visibility gaps, knowledge base coverage, and citation patterns analyzed for all 30 prompts, GEOforge ranks content opportunities by expected impact on Share of Voice.
The highest-priority opportunities typically have these characteristics:
The lowest-priority opportunities are typically:
GEOforge's ContentForge module automatically generates content briefs for the top-ranked opportunities, ensuring your content team always works on the prompts that will move Share of Voice most efficiently.
Let's walk through a real example from GEOforge's own content pipeline.
Prompt: "How to measure AI search visibility"
Visibility analysis: - Current GEOforge visibility: 14% (appears in 7 of 50 runs) - Competitor A visibility: 38% (appears in 19 of 50 runs) - Competitor B visibility: 26% (appears in 13 of 50 runs) - Visibility gap: 24 percentage points behind leading competitor
Knowledge base coverage: - GEOforge has proprietary methodology (30 prompts × 50 runs, 90% confidence level) - Detailed documentation of Visibility Delta workflow - Case studies showing Share of Voice improvements for customers - Technical documentation of SignalForge measurement system - Coverage assessment: High — substantial proprietary insights not available from competitors
Citation pattern analysis: - Citation frequency: 64% of responses include citations (32 of 50 responses) - Citation sources: Mix of SEO industry publications (Search Engine Journal, Moz), competitor blogs, and marketing technology sites - Competitor citations: Both competitors cited in 40% of responses that include citations - Assessment: Addressable — LLMs will cite branded content for this query type; owned content strategy is viable
Strategic importance: High — This prompt represents potential customers actively researching GEO measurement, a core part of the buyer journey
Priority ranking: Critical — Large visibility gap, high knowledge base coverage, addressable citation patterns, high strategic importance
Recommended action: Publish a comprehensive 3,000+ word guide explaining statistical measurement methodology for AI search visibility, grounded in GEOforge's proprietary 30×50 framework. Include specific examples, methodology comparisons, and case studies. Optimize for citation by other industry publications.
This is the content you're reading right now. It was prioritized using the exact framework described above.
Content priorities are dynamic. After you publish content targeting a high-priority prompt, GEOforge re-runs that prompt 50 times weekly and recalculates your visibility rate.
If your Share of Voice for that prompt increases significantly (e.g., from 14% to 45% over 8 weeks), the opportunity priority drops — the gap is closing, so the opportunity is less urgent. Your content resources should shift to the next-highest-priority prompt.
If Share of Voice doesn't improve after 8 weeks, the analysis updates based on new data:
If the content was strong but visibility didn't improve, this signals that owned content alone isn't sufficient — you need a citation-building strategy (SiteForge) to get your content mentioned in the third-party sources LLMs are already citing.
This closed-loop system ensures you're always working on the highest-leverage opportunities and not wasting resources on content that won't move Share of Voice.
Don't try to track 200 prompts on day one. Start with 30 high-intent queries that represent your core buyer journey. This is the standard GEOforge uses because it provides sufficient coverage of your category's query landscape while keeping measurement practical.
Prioritize prompts in this order: 1. Bottom-of-funnel comparison queries ("best [category] for [use case]") 2. Solution-aware problem queries ("how to [achieve outcome]") 3. Category-defining queries ("what is [product category]")
Work with your sales and customer success teams to identify the questions prospects actually ask during the buying process. Review sales call transcripts, support tickets, and customer interviews. These real-world questions are far more valuable than keyword research tools, which are optimized for traditional search, not conversational AI queries.
The 50-run standard is non-negotiable if you want statistically valid Share of Voice measurements. Running each prompt fewer times produces confidence intervals too wide to make reliable decisions.
Yes, this means 1,500 total LLM queries per tracking cycle (30 prompts × 50 runs). Yes, this has API cost implications. But the alternative — making strategic decisions based on unreliable data — is far more expensive.
GEOforge automates this process through SignalForge, which manages the query execution, session isolation, response parsing, and data storage. If you're building your own tracking system, budget for the API costs and engineering time required to run queries at this scale.
LLM training data updates on a 4-8 week cycle. Running prompts daily creates noise without signal — you'll see random fluctuations that don't reflect real changes in your content's impact.
GEOforge runs each prompt set weekly (30 prompts × 50 runs per week) and aggregates into monthly Share of Voice reports. This cadence balances statistical rigor with actionable reporting frequency.
Weekly runs also create a time-series dataset that helps you identify trends versus noise. A single week's Share of Voice might fluctuate by 3-5 percentage points due to sampling variance. But if you see a consistent upward trend over four consecutive weeks, that's a genuine signal that your content strategy is working.
When presenting Share of Voice to executives, investors, or clients, use conservative estimates that account for measurement uncertainty.
GEOforge reports Share of Voice as the mean visibility rate across your 30-prompt set, but we also calculate confidence intervals around that aggregate number. For external reporting, consider using the lower bound of the confidence interval — the worst-case scenario at your chosen confidence level.
For example, if your measured Share of Voice is 42% with a 90% confidence interval of 38-46%, report "38%+ Share of Voice" externally. This is the defensible, conservative number that accounts for sampling variability.
Save the point estimates and upper bounds for internal analysis and hypothesis testing.
Not every visibility gap can be closed with owned content. If LLMs are citing Reddit, G2, and Quora for a specific prompt, publishing a blog post on your site won't help — you need to get your brand mentioned in those third-party sources.
Use your citation pattern analysis to identify which prompts need owned content (low citation frequency or competitor citations present) versus citation building (high citation frequency with third-party sources only).
For citation-building opportunities, GEOforge's SiteForge module helps you identify which third-party sites to target and creates outreach strategies to earn mentions in those sources.
Absolute Share of Voice is less meaningful than relative Share of Voice. If your visibility increases from 25% to 32% but your top competitor increases from 40% to 55%, you're losing ground in the competitive landscape.
GEOforge tracks up to 10 competitors simultaneously and calculates visibility rates for each using the same 30-prompt, 50-run methodology. This gives you a true competitive benchmark.
When analyzing competitor data, look for: - Visibility gaps — Prompts where competitors significantly outperform you - Visibility advantages — Prompts where you outperform competitors (defend these) - Trend divergence — Competitors whose Share of Voice is growing faster than yours (investigate their content strategy)
Share of Voice measures brand visibility in AI-generated responses. It does not measure referral traffic, conversions, or pipeline impact. Those are secondary KPIs tracked separately in GA4 and your CRM.
The relationship is: Share of Voice → Brand awareness → Search volume for your brand → Direct navigation and conversions. It's a leading indicator, not a lagging one.
Expect a 6-12 week lag between Share of Voice improvements and measurable traffic or conversion impacts. This is why Best GEO Platforms for B2B prioritize closed-loop systems that connect these visibility metrics to actual business outcomes.
User behavior evolves. The prompts your buyers asked six months ago may not be the prompts they're asking today. Review your prompt set quarterly and replace low-performing or outdated queries with new high-intent prompts identified through customer interviews, sales call transcripts, and keyword research.
GEOforge's BaseForge module can auto-generate new prompt suggestions based on emerging patterns in your knowledge base and customer conversation data.
When you update your prompt set, maintain at least 20 prompts from the previous set to preserve trend continuity. If you replace all 30 prompts at once, you lose the ability to measure Share of Voice changes over time.
Diagnosis: Your content lacks information gain, LLMs haven't ingested it yet, or you need citations in addition to owned content.
Solution: Check three things.
First, verify your content is grounded in proprietary knowledge from your BaseForge knowledge base. Generic content that rehashes existing information won't improve Share of Voice because LLMs already have that information in their training data. Review your published content and ask: "Does this contain insights, data, or perspectives that don't exist elsewhere?" If not, the content lacks information gain.
Second, wait 6-8 weeks after publication. LLMs don't index new content instantly — it takes time for your content to appear in training data updates or retrieval pipelines. GEOforge's Visibility Delta workflow accounts for this by measuring baseline Share of Voice before publication, then re-measuring 8 weeks after publication to allow for LLM ingestion.
Third, analyze citation patterns for your target prompts. If 60%+ of responses include citations and those citations are primarily third-party sources (not competitor blogs), owned content alone won't be sufficient. You need to get your brand mentioned in the sources LLMs are already citing. This is where SiteForge citation-building strategies become critical.
If Share of Voice still doesn't improve after 8 weeks and you've verified your content has information gain, shift to a citation-building strategy. Identify which third-party sources are cited for your target prompts and get your brand mentioned there through contributed content, expert quotes, case study placements, or review platform presence.
Diagnosis: Sample size is too small to achieve the precision you need.
Solution: Increase runs per prompt from 50 to 75 or 100. Confidence intervals tighten as sample size grows. If you're tracking a high-stakes prompt where precision matters (e.g., a direct competitor comparison that will be presented to your board), run it 100 times to get a narrower confidence band.
The tradeoff is cost and time. Doubling your runs from 50 to 100 per prompt doubles your API costs and query execution time. For most use cases, 50 runs provides sufficient precision. Reserve higher sample sizes for your most strategic prompts.
Alternatively, if you're tracking too many prompts and can't afford to run each one 50+ times, reduce your prompt set. It's better to have 20 prompts with 50 runs each than 50 prompts with 20 runs each. The former gives you reliable data on a focused set of queries; the latter gives you unreliable data on a broader set.
Diagnosis: Competitors may be mentioned in varied formats (e.g., "Competitor Inc." vs "Competitor