Back to methodologyResearch note 2.2

How GEO Tools Work Under the Hood

Response parsing techniques, statistical methods, multi-model normalization, polling architecture, pre-publication simulation, and open-source implementations.

4
Parsing approaches in production
15%
Accuracy variance at temp=0
353
Unique prompts for 95% CI
50x/day
Tryscope polling frequency

1. Response Parsing Techniques

GEO tools must solve a core extraction problem: given an LLM’s free-text response, identify which brands were mentioned, in what order, with what sentiment, and whether they were recommended or merely referenced. Four distinct approaches exist in production.

1.1 Simple String Matching

The most widely deployed approach: case-insensitive substring search. The Bright Data LLM Mentions Tracker implements this directly with target_phrase.lower() in answer.lower(). This produces a binary “mentioned / not mentioned” signal. Fast, deterministic, zero-cost beyond the API call, but cannot detect misspellings, abbreviations, or paraphrased references.

1.2 Named Entity Recognition (NER) via spaCy

spaCy’s pre-trained NER pipeline recognizes ORG, PRODUCT, and PERSON entities. The spacy-llm package integrates LLMs directly into spaCy pipelines for zero-shot NER without training data.

Limitation: Pre-trained models often misclassify brand names as common words (e.g., “Notion” as a concept, not the product). Fine-tuning on LLM response corpora needed for production accuracy.

1.3 LLM-as-Judge (Highest Accuracy)

Uses a second LLM call to parse the first LLM’s response into structured data. Sellm extracts four sentiment dimensions: trustworthiness (0–1), authority (0–1), recommendation strength (0–1), and fit for query intent (0–1). LLM Pulse uses a 5-point sentiment scale with topic-level granularity (pricing, features, customer service, reliability).

1.4 Position / Ranking Extraction

Position matters: first-mentioned brands receive “direct-answer language” while later positions get “other options include” framing. Sellm extracts 1-indexed position ranking. Foundation Inc.’s “Generative Position” metric calculates average position across responses — positions 1–2 indicate strong preferential treatment; position 4+ suggests weak positioning.

1.5 Native Structured Output (2026 Best Practice)

Modern approach bypasses parsing entirely by requesting structured output from the LLM. OpenAI’s .parse(), Gemini’s response_schema, and Anthropic’s tool use all support forcing output into a predefined schema, eliminating regex parsing entirely. Tooling (Pydantic for Python, Zod for TypeScript) has matured.

2. Statistical Methods for LLM Variance

The Core Problem: Non-Deterministic Responses

LLM responses are non-deterministic even at temperature=0. A study of 5 frontier models across 10 identical runs found accuracy variations of up to 15%, with worst case 72% difference (Mixtral on college math). Total Agreement Rate for GPT-4o ranged from 0% to 99.6% across tasks.

Sources of variance at temperature=0 include:

  • Hardware concurrency: batch-level non-determinism from parallel GPU operations
  • Floating-point precision: FP16/BF16 rounding errors differ across hardware
  • Backend changes: model updates, routing, load balancing
  • Output length correlation: longer outputs show more instability

Sample Size Requirements

The Discovered Labs LLM Eval Calculator provides concrete guidance: at 95% confidence, +/-2% margin, K=3 resamples per prompt: 353 unique prompts, 1,059 total API calls. Tryscope runs every buyer query 50 times/day across major AI models for statistical confidence.

Core Statistical Formulas

Standard Error (Bernoulli): SE = sqrt(p * (1-p) / n)

95% Confidence Interval: CI = p +/- 1.96 * SE

Law of Total Variance: Total Var = Var(x)/n + E[sigma_i^2]/(n*K)

CLT-based confidence intervals fail when n < 100 in LLM evaluation contexts, producing intervals that are “too narrow and overly-confident.” Clustered standard errors can increase SE by 3x when questions within a topic cluster are correlated.

3. Multi-Model Normalization

GEO tools must normalize results across models with fundamentally different architectures and behaviors. Key metrics that enable cross-model comparison:

MetricDescriptionUsed By
Share of Model (SoM)% of responses mentioning brand for a given query setPeec AI, Profound
Position-Adjusted Word CountWord count weighted by position (earlier = more weight)GEO paper (KDD 2024)
Generative PositionAverage position across responses (1 = best)Foundation Inc.
Citation FrequencyRaw count of URL citations in responsesProfound, Yext
Sentiment ScoreMulti-dimensional sentiment (0-1 or 1-5 scale)Sellm, LLM Pulse

Cross-platform overlap is remarkably low: only 11% of domains are cited by both ChatGPT and Perplexity. Google AI Overviews and AI Mode cite the same URLs only 13.7% of the time. This means per-model scoring is essential — a “total GEO score” across models would be misleading.

4. Polling Architecture

Commercial Tool Approaches

ToolPolling MethodScale
Tryscope (Scope)50 polls/day per query across 4 modelsPre-publish simulation
ProfoundReal-time capture from 10+ engines15M+ prompts/day, 400M+ conversations
Evertune1M+ custom prompts per brand/month25M user behavior data
Bright DataHeadless browser → API endpoint scrapingOpen-source reference
SellmDirect API calls, structured output<$0.01/prompt, API-only

Perplexity’s Native Citations

Perplexity Sonar is unique: every response includes a citations field with URLs, making it the only API that natively shows which sources inform recommendations. This eliminates the need for parsing and makes it the easiest model to monitor for brand visibility.

5. Pre-Publication Simulation

Tryscope (Scope) pioneered the concept of testing content before publishing. Their approach: simulate how ChatGPT, Claude, Gemini, and Perplexity would recommend a brand given proposed content changes. This uses persona-based simulation and polls 50x/day.

The CORE paper (Jin et al., 2026) demonstrated that targeting the synthesis stage (rather than retrieval) achieves a 91.4% promotion success rate @Top-5 across GPT-4o, Gemini-2.5, Claude-4, and Grok-3 — validating that pre-publication optimization can meaningfully shift AI recommendations.

6. Open-Source Implementations

ProjectDescriptionStack
Bright Data LLM Mentions TrackerComplete brand monitoring pipelinePython, Bright Data proxy
spacy-llmLLM-powered NER in spaCy pipelinesPython, spaCy
GPTCacheSemantic caching for LLM responsesPython, Redis
Discovered Labs Eval CalculatorStatistical sample size calculatorWeb tool

7. Implications for Bitsy

Architecture recommendation: Use LLM-as-Judge with structured output for parsing (highest accuracy). Implement tiered polling: cheap models (GPT-4.1-nano) for daily brand detection, expensive models (Sonnet, GPT-4o) for deep sentiment analysis only when changes detected. Leverage Perplexity’s native citation format as the baseline truth source.
Do NOT: Rely on single-shot queries (variance too high). Do not use simple string matching as the primary parser (misses paraphrased references). Do not aggregate a single “GEO score” across models — per-model scoring is essential given only 11% domain overlap.