Back to methodologyResearch note 2.1

How LLMs Decide What to Mention

Training data pipelines, RLHF, parametric vs. RAG knowledge, frequency effects, recency bias, structured data, and the signals that drive brand recommendations.

80%+

GPT-3 tokens from CommonCrawl

41%

Signal from authoritative lists

79%

Parametric (not web search)

Recency boost (<14 days)

1. Training Data Pipelines

The CommonCrawl-to-LLM Pipeline

Every major LLM starts with CommonCrawl, the largest open web corpus. GPT-3 derived over 80% of its training tokens from CommonCrawl; LLaMA allocated 67% to CommonCrawl and 4.5% to Wikipedia. At least 64% of 47 LLMs reviewed (2019–2023) used at least one filtered version of CommonCrawl.

The CCNet pipeline processes raw web data through five stages: data sourcing from CommonCrawl (a single snapshot is 8.7 TiB compressed), paragraph-level deduplication using SHA1 hashing (~70% removed), language identification via FastText, quality filtering against Wikipedia perplexity scores, and reference content filtering.

NVIDIA’s Nemotron-CC pipeline processes the full CommonCrawl English dataset of 6.3 trillion tokens into 2 trillion high-quality tokens using 28 distinct heuristic filters. Training with Nemotron-CC data boosted MMLU scores by 5.6 points over baseline (59.0 vs. 53.4).

RLHF: How Human Feedback Shapes Recommendations

After pre-training, models are fine-tuned via Reinforcement Learning from Human Feedback (RLHF) in three stages: preference dataset creation (human annotators rate outputs pairwise), reward model training, and RL fine-tuning using PPO.

RLHF doesn’t create explicit brand preferences but shapes models toward outputs human raters consider helpful and safe. Models avoid recommending controversial brands due to safety filters. “Digital consensus” matters: when multiple authoritative sources agree on a brand’s category position, RLHF-trained models adopt this as reliable information.

Licensed Data Deals

Deal	Value	Date
Google ↔ Reddit	$60M/year	Feb 2024
OpenAI ↔ Reddit	~$70M/year	May 2024
Reddit total licensing	$203M+	As of Feb 2024 IPO

Reddit is the most-cited domain by Google AI Overviews and Perplexity, and the second most-cited by ChatGPT. Google algorithm updates (Aug 2023–Apr 2024) nearly tripled Reddit’s readership from 132M to 346M monthly visitors.

2. Parametric Knowledge vs. RAG

The Dual System: ChatGPT defaults to parametric knowledge (baked into weights) for ~79% of prompts, triggering web search only 21% of the time (primarily for commercial/local intent). However, 46% of ChatGPT interactions now use integrated search.

This creates two optimization timelines:

Mechanism	Timeline	Volatility
Training data (parametric)	18–36 month investment	Durable once encoded
RAG (real-time search)	Immediate	Can appear/disappear day to day

Frequency Drives Memorization

Research from Wang et al. (ICLR 2025) found that factual QA showed the strongest memorization effect, increasing alongside model scaling. Pre-training data frequency directly correlates with output probability distributions.

Brands frequently mentioned in high-quality content before the training cutoff become part of the model’s neural weights and get mentioned automatically, without web search. When models cannot confidently retrieve lesser-known brands, they substitute competitors with higher probabilistic weight — the “substitution effect.”

3. The Primary Recommendation Signals

Signal	Weight
Authoritative list mentions (“Best of” lists, expert roundups)	41%
Awards and accreditations	18%
Online reviews (aggregated sentiment)	16%
Traditional SEO signals (backlinks, DA, keywords)	~0%

Critical finding from Onely (Dec 2025): “Traditional SEO signals — backlinks, domain authority, keyword optimization — have near-zero influence on AI recommendations.”

Only 3–4 brands are cited per ChatGPT response (vs. 13 for Perplexity, ~8 for AI Overviews), creating winner-take-all dynamics. 26% of brands have zero AI visibility. The top 50 brands capture 28.9% of all mentions.

4. Role of Recency

Knowledge Cutoff Dates

Model	Knowledge Cutoff	Web Access
GPT-5.4 (ChatGPT)	August 31, 2025	Yes (Bing)
Claude 4.6 Opus	~May 2025 (reliable)	No
Gemini 3.1 Pro	January 2025	Yes (Google)
Perplexity Online	Real-time	Always
Llama 4 (Meta)	August 2024	No
DeepSeek R1	October 2023	No

Quantified Recency Bias

An ACM SIGIR 2025 study found that “fresh” passages shift the Top-10’s mean publication year forward by up to 4.78 years. Individual items moved by as many as 95 ranks in reranking experiments.

Content Age	Share of AI Bot Hits
<1 year	65%
<2 years	79%
<3 years	89%
<5 years	94%
6+ years	6%

Brands with content updated within 14 days appeared in recommendations roughly 3x more often than brands with identical authority but stale content.

5. Role of Frequency

“A brand mentioned 10,000 times in low-authority blogs may score below one mentioned 200 times in peer-reviewed publications and established industry reports.”
MetricsRule

Metric	Finding	Source
Brands < 50 high-trust mentions	Fail AI recognition 72% of the time	MetricsRule
"Best of" roundup inclusion	400% more likely to be recommended	MetricsRule
Third-party vs. owned	6.5x more effective from third parties	MetricsRule
Third-party share	85% of brand mentions from third-party pages	AirOps
Comparative listicles	32.5% of all AI citations	Digital Bloom
Multi-platform presence (4+)	2.8x more likely to appear	Digital Bloom

6. Structured Data: Schema.org, JSON-LD, FAQ Pages

The nuanced finding: Schema markup improves the quality and accuracy of LLM responses about your entity but does not independently increase citation frequency. The high correlation (81% of cited pages have schema) is likely confounded — well-maintained sites implement both schema and high-quality content.

Search Atlas analyzed 748,425 queries and found “schema markup does not influence LLM citation frequency.” However, controlled experiments show schema improves response quality: in a GetAISO experiment, the schema version scored 8.6/10 vs. 6.6/10 (30% improvement). GPT-4 goes from 16% to 54% correct responses when content relies on structured data.

Only 12.4% of registered domains currently implement Schema.org — a massive opportunity gap for accurate representation, if not frequency.

7. Cross-Platform Fragmentation

Platform	Key Behavior
ChatGPT	47.9% of citations from Wikipedia; 2.37 brands/response avg; 68% market share
Perplexity	46.7% Reddit citations; real-time search every query; 5-10 inline citations
Google AI Overviews	93.67% cite top-10 organic; 6.02 brands/response; 2B+ monthly users
Claude	Hedges with "options include"; cross-references 3+ sources before surfacing

Brand visibility varies wildly across models: Ariel detergent has ~24% Share of Model on Llama but <1% on Gemini. Brand mentions disagreed 62% of the time across platforms. Only 11% of domains are cited by both ChatGPT AND Perplexity.

“Unlike search engines displaying less-popular brands on later pages, AI models are merciless. If your brand doesn't register with an LLM, it simply won't appear at all.”
INSEAD / HBR

8. Comprehensive Signal Ranking

Signal	Evidence	Strength
Authoritative list mentions	41% of ChatGPT signal	Very High
Brand search volume	0.334 correlation	Very High
Third-party mentions	6.5x more effective; 85% of mentions	Very High
Statistics in content	+37–41% visibility boost (GEO paper)	High
Quotations from credible sources	+22–41% visibility boost	High
Content freshness (<1 year)	65% of AI hits; 3x for 14-day	High
Reviews (4+ stars)	5.3x more recommendations	High
Wikipedia presence	7.8–47.9% of citations	High
Reddit presence	46.7% of Perplexity citations	High
Schema markup	Improves accuracy, not frequency	Medium
Domain Rank	0.25 correlation	Medium
Backlinks (traditional)	Weak/neutral correlation	Low
Keyword stuffing	-10% negative impact	Harmful

9. Implications for Bitsy

Build: Multi-model polling is non-negotiable (62% cross-platform disagreement). Statistical sampling with 50+ samples/day. Track parametric vs. RAG mentions separately. Freshness scoring and third-party monitoring (> owned content monitoring).

Do NOT build: A keyword-stuffing optimizer (proven -10% impact). Do not promise deterministic results (<1-in-100 chance of same list twice). Do not treat all models as interchangeable. Do not rely on backlink metrics.