Back to methodologyResearch note 2.1

How LLMs Decide What to Mention

Training data pipelines, RLHF, parametric vs. RAG knowledge, frequency effects, recency bias, structured data, and the signals that drive brand recommendations.

80%+
GPT-3 tokens from CommonCrawl
41%
Signal from authoritative lists
79%
Parametric (not web search)
3x
Recency boost (<14 days)

1. Training Data Pipelines

The CommonCrawl-to-LLM Pipeline

Every major LLM starts with CommonCrawl, the largest open web corpus. GPT-3 derived over 80% of its training tokens from CommonCrawl; LLaMA allocated 67% to CommonCrawl and 4.5% to Wikipedia. At least 64% of 47 LLMs reviewed (2019–2023) used at least one filtered version of CommonCrawl.

The CCNet pipeline processes raw web data through five stages: data sourcing from CommonCrawl (a single snapshot is 8.7 TiB compressed), paragraph-level deduplication using SHA1 hashing (~70% removed), language identification via FastText, quality filtering against Wikipedia perplexity scores, and reference content filtering.

NVIDIA’s Nemotron-CC pipeline processes the full CommonCrawl English dataset of 6.3 trillion tokens into 2 trillion high-quality tokens using 28 distinct heuristic filters. Training with Nemotron-CC data boosted MMLU scores by 5.6 points over baseline (59.0 vs. 53.4).

RLHF: How Human Feedback Shapes Recommendations

After pre-training, models are fine-tuned via Reinforcement Learning from Human Feedback (RLHF) in three stages: preference dataset creation (human annotators rate outputs pairwise), reward model training, and RL fine-tuning using PPO.

RLHF doesn’t create explicit brand preferences but shapes models toward outputs human raters consider helpful and safe. Models avoid recommending controversial brands due to safety filters. “Digital consensus” matters: when multiple authoritative sources agree on a brand’s category position, RLHF-trained models adopt this as reliable information.

Licensed Data Deals

DealValueDate
Google ↔ Reddit$60M/yearFeb 2024
OpenAI ↔ Reddit~$70M/yearMay 2024
Reddit total licensing$203M+As of Feb 2024 IPO

Reddit is the most-cited domain by Google AI Overviews and Perplexity, and the second most-cited by ChatGPT. Google algorithm updates (Aug 2023–Apr 2024) nearly tripled Reddit’s readership from 132M to 346M monthly visitors.

2. Parametric Knowledge vs. RAG

The Dual System: ChatGPT defaults to parametric knowledge (baked into weights) for ~79% of prompts, triggering web search only 21% of the time (primarily for commercial/local intent). However, 46% of ChatGPT interactions now use integrated search.

This creates two optimization timelines:

MechanismTimelineVolatility
Training data (parametric)18–36 month investmentDurable once encoded
RAG (real-time search)ImmediateCan appear/disappear day to day

Frequency Drives Memorization

Research from Wang et al. (ICLR 2025) found that factual QA showed the strongest memorization effect, increasing alongside model scaling. Pre-training data frequency directly correlates with output probability distributions.

Brands frequently mentioned in high-quality content before the training cutoff become part of the model’s neural weights and get mentioned automatically, without web search. When models cannot confidently retrieve lesser-known brands, they substitute competitors with higher probabilistic weight — the “substitution effect.”

3. The Primary Recommendation Signals

SignalWeight
Authoritative list mentions (“Best of” lists, expert roundups)41%
Awards and accreditations18%
Online reviews (aggregated sentiment)16%
Traditional SEO signals (backlinks, DA, keywords)~0%
Critical finding from Onely (Dec 2025): “Traditional SEO signals — backlinks, domain authority, keyword optimization — have near-zero influence on AI recommendations.”

Only 3–4 brands are cited per ChatGPT response (vs. 13 for Perplexity, ~8 for AI Overviews), creating winner-take-all dynamics. 26% of brands have zero AI visibility. The top 50 brands capture 28.9% of all mentions.

4. Role of Recency

Knowledge Cutoff Dates

ModelKnowledge CutoffWeb Access
GPT-5.4 (ChatGPT)August 31, 2025Yes (Bing)
Claude 4.6 Opus~May 2025 (reliable)No
Gemini 3.1 ProJanuary 2025Yes (Google)
Perplexity OnlineReal-timeAlways
Llama 4 (Meta)August 2024No
DeepSeek R1October 2023No

Quantified Recency Bias

An ACM SIGIR 2025 study found that “fresh” passages shift the Top-10’s mean publication year forward by up to 4.78 years. Individual items moved by as many as 95 ranks in reranking experiments.

Content AgeShare of AI Bot Hits
<1 year65%
<2 years79%
<3 years89%
<5 years94%
6+ years6%
Brands with content updated within 14 days appeared in recommendations roughly 3x more often than brands with identical authority but stale content.

5. Role of Frequency

A brand mentioned 10,000 times in low-authority blogs may score below one mentioned 200 times in peer-reviewed publications and established industry reports.

MetricsRule
MetricFindingSource
Brands < 50 high-trust mentionsFail AI recognition 72% of the timeMetricsRule
"Best of" roundup inclusion400% more likely to be recommendedMetricsRule
Third-party vs. owned6.5x more effective from third partiesMetricsRule
Third-party share85% of brand mentions from third-party pagesAirOps
Comparative listicles32.5% of all AI citationsDigital Bloom
Multi-platform presence (4+)2.8x more likely to appearDigital Bloom

6. Structured Data: Schema.org, JSON-LD, FAQ Pages

The nuanced finding: Schema markup improves the quality and accuracy of LLM responses about your entity but does not independently increase citation frequency. The high correlation (81% of cited pages have schema) is likely confounded — well-maintained sites implement both schema and high-quality content.

Search Atlas analyzed 748,425 queries and found “schema markup does not influence LLM citation frequency.” However, controlled experiments show schema improves response quality: in a GetAISO experiment, the schema version scored 8.6/10 vs. 6.6/10 (30% improvement). GPT-4 goes from 16% to 54% correct responses when content relies on structured data.

Only 12.4% of registered domains currently implement Schema.org — a massive opportunity gap for accurate representation, if not frequency.

7. Cross-Platform Fragmentation

PlatformKey Behavior
ChatGPT47.9% of citations from Wikipedia; 2.37 brands/response avg; 68% market share
Perplexity46.7% Reddit citations; real-time search every query; 5-10 inline citations
Google AI Overviews93.67% cite top-10 organic; 6.02 brands/response; 2B+ monthly users
ClaudeHedges with "options include"; cross-references 3+ sources before surfacing

Brand visibility varies wildly across models: Ariel detergent has ~24% Share of Model on Llama but <1% on Gemini. Brand mentions disagreed 62% of the time across platforms. Only 11% of domains are cited by both ChatGPT AND Perplexity.

Unlike search engines displaying less-popular brands on later pages, AI models are merciless. If your brand doesn't register with an LLM, it simply won't appear at all.

INSEAD / HBR

8. Comprehensive Signal Ranking

SignalEvidenceStrength
Authoritative list mentions41% of ChatGPT signalVery High
Brand search volume0.334 correlationVery High
Third-party mentions6.5x more effective; 85% of mentionsVery High
Statistics in content+37–41% visibility boost (GEO paper)High
Quotations from credible sources+22–41% visibility boostHigh
Content freshness (<1 year)65% of AI hits; 3x for 14-dayHigh
Reviews (4+ stars)5.3x more recommendationsHigh
Wikipedia presence7.8–47.9% of citationsHigh
Reddit presence46.7% of Perplexity citationsHigh
Schema markupImproves accuracy, not frequencyMedium
Domain Rank0.25 correlationMedium
Backlinks (traditional)Weak/neutral correlationLow
Keyword stuffing-10% negative impactHarmful

9. Implications for Bitsy

Build: Multi-model polling is non-negotiable (62% cross-platform disagreement). Statistical sampling with 50+ samples/day. Track parametric vs. RAG mentions separately. Freshness scoring and third-party monitoring (> owned content monitoring).
Do NOT build: A keyword-stuffing optimizer (proven -10% impact). Do not promise deterministic results (<1-in-100 chance of same list twice). Do not treat all models as interchangeable. Do not rely on backlink metrics.