How GEO Tools Work Under the Hood
Response parsing techniques, statistical methods, multi-model normalization, polling architecture, pre-publication simulation, and open-source implementations.
1. Response Parsing Techniques
GEO tools must solve a core extraction problem: given an LLM’s free-text response, identify which brands were mentioned, in what order, with what sentiment, and whether they were recommended or merely referenced. Four distinct approaches exist in production.
1.1 Simple String Matching
The most widely deployed approach: case-insensitive substring search. The Bright Data LLM Mentions Tracker implements this directly with target_phrase.lower() in answer.lower(). This produces a binary “mentioned / not mentioned” signal. Fast, deterministic, zero-cost beyond the API call, but cannot detect misspellings, abbreviations, or paraphrased references.
1.2 Named Entity Recognition (NER) via spaCy
spaCy’s pre-trained NER pipeline recognizes ORG, PRODUCT, and PERSON entities. The spacy-llm package integrates LLMs directly into spaCy pipelines for zero-shot NER without training data.
Limitation: Pre-trained models often misclassify brand names as common words (e.g., “Notion” as a concept, not the product). Fine-tuning on LLM response corpora needed for production accuracy.
1.3 LLM-as-Judge (Highest Accuracy)
Uses a second LLM call to parse the first LLM’s response into structured data. Sellm extracts four sentiment dimensions: trustworthiness (0–1), authority (0–1), recommendation strength (0–1), and fit for query intent (0–1). LLM Pulse uses a 5-point sentiment scale with topic-level granularity (pricing, features, customer service, reliability).
1.4 Position / Ranking Extraction
Position matters: first-mentioned brands receive “direct-answer language” while later positions get “other options include” framing. Sellm extracts 1-indexed position ranking. Foundation Inc.’s “Generative Position” metric calculates average position across responses — positions 1–2 indicate strong preferential treatment; position 4+ suggests weak positioning.
1.5 Native Structured Output (2026 Best Practice)
Modern approach bypasses parsing entirely by requesting structured output from the LLM. OpenAI’s .parse(), Gemini’s response_schema, and Anthropic’s tool use all support forcing output into a predefined schema, eliminating regex parsing entirely. Tooling (Pydantic for Python, Zod for TypeScript) has matured.
2. Statistical Methods for LLM Variance
The Core Problem: Non-Deterministic Responses
Sources of variance at temperature=0 include:
- Hardware concurrency: batch-level non-determinism from parallel GPU operations
- Floating-point precision: FP16/BF16 rounding errors differ across hardware
- Backend changes: model updates, routing, load balancing
- Output length correlation: longer outputs show more instability
Sample Size Requirements
The Discovered Labs LLM Eval Calculator provides concrete guidance: at 95% confidence, +/-2% margin, K=3 resamples per prompt: 353 unique prompts, 1,059 total API calls. Tryscope runs every buyer query 50 times/day across major AI models for statistical confidence.
Core Statistical Formulas
Standard Error (Bernoulli): SE = sqrt(p * (1-p) / n)
95% Confidence Interval: CI = p +/- 1.96 * SE
Law of Total Variance: Total Var = Var(x)/n + E[sigma_i^2]/(n*K)
3. Multi-Model Normalization
GEO tools must normalize results across models with fundamentally different architectures and behaviors. Key metrics that enable cross-model comparison:
| Metric | Description | Used By |
|---|---|---|
| Share of Model (SoM) | % of responses mentioning brand for a given query set | Peec AI, Profound |
| Position-Adjusted Word Count | Word count weighted by position (earlier = more weight) | GEO paper (KDD 2024) |
| Generative Position | Average position across responses (1 = best) | Foundation Inc. |
| Citation Frequency | Raw count of URL citations in responses | Profound, Yext |
| Sentiment Score | Multi-dimensional sentiment (0-1 or 1-5 scale) | Sellm, LLM Pulse |
Cross-platform overlap is remarkably low: only 11% of domains are cited by both ChatGPT and Perplexity. Google AI Overviews and AI Mode cite the same URLs only 13.7% of the time. This means per-model scoring is essential — a “total GEO score” across models would be misleading.
4. Polling Architecture
Commercial Tool Approaches
| Tool | Polling Method | Scale |
|---|---|---|
| Tryscope (Scope) | 50 polls/day per query across 4 models | Pre-publish simulation |
| Profound | Real-time capture from 10+ engines | 15M+ prompts/day, 400M+ conversations |
| Evertune | 1M+ custom prompts per brand/month | 25M user behavior data |
| Bright Data | Headless browser → API endpoint scraping | Open-source reference |
| Sellm | Direct API calls, structured output | <$0.01/prompt, API-only |
Perplexity’s Native Citations
citations field with URLs, making it the only API that natively shows which sources inform recommendations. This eliminates the need for parsing and makes it the easiest model to monitor for brand visibility.5. Pre-Publication Simulation
Tryscope (Scope) pioneered the concept of testing content before publishing. Their approach: simulate how ChatGPT, Claude, Gemini, and Perplexity would recommend a brand given proposed content changes. This uses persona-based simulation and polls 50x/day.
The CORE paper (Jin et al., 2026) demonstrated that targeting the synthesis stage (rather than retrieval) achieves a 91.4% promotion success rate @Top-5 across GPT-4o, Gemini-2.5, Claude-4, and Grok-3 — validating that pre-publication optimization can meaningfully shift AI recommendations.
6. Open-Source Implementations
| Project | Description | Stack |
|---|---|---|
| Bright Data LLM Mentions Tracker | Complete brand monitoring pipeline | Python, Bright Data proxy |
| spacy-llm | LLM-powered NER in spaCy pipelines | Python, spaCy |
| GPTCache | Semantic caching for LLM responses | Python, Redis |
| Discovered Labs Eval Calculator | Statistical sample size calculator | Web tool |
7. Implications for Bitsy
Sources
- Bright Data — Build an LLM Mentions Tracker (2025)
- spaCy Documentation — NER
- explosion/spacy-llm — GitHub
- Sellm — Extract Brand Sentiment API
- Foundation Inc. — GEO Metrics (2026)
- LLM Response Variance Study (arXiv 2408.04667)
- Cameron Wolfe — Stats for LLM Evals
- Discovered Labs — LLM Eval Calculator
- Tryscope
- CORE Paper — Controlling Output Rankings (2026)
- Vincent Schmalbach — Temperature 0 Determinism
- DEV Community — LLM Structured Output in 2026