How LLMs Decide What to Mention
Training data pipelines, RLHF, parametric vs. RAG knowledge, frequency effects, recency bias, structured data, and the signals that drive brand recommendations.
1. Training Data Pipelines
The CommonCrawl-to-LLM Pipeline
Every major LLM starts with CommonCrawl, the largest open web corpus. GPT-3 derived over 80% of its training tokens from CommonCrawl; LLaMA allocated 67% to CommonCrawl and 4.5% to Wikipedia. At least 64% of 47 LLMs reviewed (2019–2023) used at least one filtered version of CommonCrawl.
The CCNet pipeline processes raw web data through five stages: data sourcing from CommonCrawl (a single snapshot is 8.7 TiB compressed), paragraph-level deduplication using SHA1 hashing (~70% removed), language identification via FastText, quality filtering against Wikipedia perplexity scores, and reference content filtering.
NVIDIA’s Nemotron-CC pipeline processes the full CommonCrawl English dataset of 6.3 trillion tokens into 2 trillion high-quality tokens using 28 distinct heuristic filters. Training with Nemotron-CC data boosted MMLU scores by 5.6 points over baseline (59.0 vs. 53.4).
RLHF: How Human Feedback Shapes Recommendations
After pre-training, models are fine-tuned via Reinforcement Learning from Human Feedback (RLHF) in three stages: preference dataset creation (human annotators rate outputs pairwise), reward model training, and RL fine-tuning using PPO.
RLHF doesn’t create explicit brand preferences but shapes models toward outputs human raters consider helpful and safe. Models avoid recommending controversial brands due to safety filters. “Digital consensus” matters: when multiple authoritative sources agree on a brand’s category position, RLHF-trained models adopt this as reliable information.
Licensed Data Deals
| Deal | Value | Date |
|---|---|---|
| Google ↔ Reddit | $60M/year | Feb 2024 |
| OpenAI ↔ Reddit | ~$70M/year | May 2024 |
| Reddit total licensing | $203M+ | As of Feb 2024 IPO |
Reddit is the most-cited domain by Google AI Overviews and Perplexity, and the second most-cited by ChatGPT. Google algorithm updates (Aug 2023–Apr 2024) nearly tripled Reddit’s readership from 132M to 346M monthly visitors.
2. Parametric Knowledge vs. RAG
This creates two optimization timelines:
| Mechanism | Timeline | Volatility |
|---|---|---|
| Training data (parametric) | 18–36 month investment | Durable once encoded |
| RAG (real-time search) | Immediate | Can appear/disappear day to day |
Frequency Drives Memorization
Research from Wang et al. (ICLR 2025) found that factual QA showed the strongest memorization effect, increasing alongside model scaling. Pre-training data frequency directly correlates with output probability distributions.
Brands frequently mentioned in high-quality content before the training cutoff become part of the model’s neural weights and get mentioned automatically, without web search. When models cannot confidently retrieve lesser-known brands, they substitute competitors with higher probabilistic weight — the “substitution effect.”
3. The Primary Recommendation Signals
| Signal | Weight |
|---|---|
| Authoritative list mentions (“Best of” lists, expert roundups) | 41% |
| Awards and accreditations | 18% |
| Online reviews (aggregated sentiment) | 16% |
| Traditional SEO signals (backlinks, DA, keywords) | ~0% |
Only 3–4 brands are cited per ChatGPT response (vs. 13 for Perplexity, ~8 for AI Overviews), creating winner-take-all dynamics. 26% of brands have zero AI visibility. The top 50 brands capture 28.9% of all mentions.
4. Role of Recency
Knowledge Cutoff Dates
| Model | Knowledge Cutoff | Web Access |
|---|---|---|
| GPT-5.4 (ChatGPT) | August 31, 2025 | Yes (Bing) |
| Claude 4.6 Opus | ~May 2025 (reliable) | No |
| Gemini 3.1 Pro | January 2025 | Yes (Google) |
| Perplexity Online | Real-time | Always |
| Llama 4 (Meta) | August 2024 | No |
| DeepSeek R1 | October 2023 | No |
Quantified Recency Bias
An ACM SIGIR 2025 study found that “fresh” passages shift the Top-10’s mean publication year forward by up to 4.78 years. Individual items moved by as many as 95 ranks in reranking experiments.
| Content Age | Share of AI Bot Hits |
|---|---|
| <1 year | 65% |
| <2 years | 79% |
| <3 years | 89% |
| <5 years | 94% |
| 6+ years | 6% |
5. Role of Frequency
“A brand mentioned 10,000 times in low-authority blogs may score below one mentioned 200 times in peer-reviewed publications and established industry reports.”
MetricsRule
| Metric | Finding | Source |
|---|---|---|
| Brands < 50 high-trust mentions | Fail AI recognition 72% of the time | MetricsRule |
| "Best of" roundup inclusion | 400% more likely to be recommended | MetricsRule |
| Third-party vs. owned | 6.5x more effective from third parties | MetricsRule |
| Third-party share | 85% of brand mentions from third-party pages | AirOps |
| Comparative listicles | 32.5% of all AI citations | Digital Bloom |
| Multi-platform presence (4+) | 2.8x more likely to appear | Digital Bloom |
6. Structured Data: Schema.org, JSON-LD, FAQ Pages
Search Atlas analyzed 748,425 queries and found “schema markup does not influence LLM citation frequency.” However, controlled experiments show schema improves response quality: in a GetAISO experiment, the schema version scored 8.6/10 vs. 6.6/10 (30% improvement). GPT-4 goes from 16% to 54% correct responses when content relies on structured data.
Only 12.4% of registered domains currently implement Schema.org — a massive opportunity gap for accurate representation, if not frequency.
7. Cross-Platform Fragmentation
| Platform | Key Behavior |
|---|---|
| ChatGPT | 47.9% of citations from Wikipedia; 2.37 brands/response avg; 68% market share |
| Perplexity | 46.7% Reddit citations; real-time search every query; 5-10 inline citations |
| Google AI Overviews | 93.67% cite top-10 organic; 6.02 brands/response; 2B+ monthly users |
| Claude | Hedges with "options include"; cross-references 3+ sources before surfacing |
Brand visibility varies wildly across models: Ariel detergent has ~24% Share of Model on Llama but <1% on Gemini. Brand mentions disagreed 62% of the time across platforms. Only 11% of domains are cited by both ChatGPT AND Perplexity.
“Unlike search engines displaying less-popular brands on later pages, AI models are merciless. If your brand doesn't register with an LLM, it simply won't appear at all.”
INSEAD / HBR
8. Comprehensive Signal Ranking
| Signal | Evidence | Strength |
|---|---|---|
| Authoritative list mentions | 41% of ChatGPT signal | Very High |
| Brand search volume | 0.334 correlation | Very High |
| Third-party mentions | 6.5x more effective; 85% of mentions | Very High |
| Statistics in content | +37–41% visibility boost (GEO paper) | High |
| Quotations from credible sources | +22–41% visibility boost | High |
| Content freshness (<1 year) | 65% of AI hits; 3x for 14-day | High |
| Reviews (4+ stars) | 5.3x more recommendations | High |
| Wikipedia presence | 7.8–47.9% of citations | High |
| Reddit presence | 46.7% of Perplexity citations | High |
| Schema markup | Improves accuracy, not frequency | Medium |
| Domain Rank | 0.25 correlation | Medium |
| Backlinks (traditional) | Weak/neutral correlation | Low |
| Keyword stuffing | -10% negative impact | Harmful |
9. Implications for Bitsy
Sources
- Wikimedia Foundation — Wikipedia's Value in the Age of Generative AI (July 2023)
- Springer Nature — LLM Training Data Analysis (2025)
- NVIDIA Developer Blog — Nemotron-CC Pipeline (2024)
- Onely — How ChatGPT Decides Which Brands to Recommend (Dec 2025)
- Wang et al. — Generalization vs. Memorization (ICLR 2025)
- ACM SIGIR 2025 — Do LLMs Favor Recent Content?
- Seer Interactive — AI Brand Visibility and Content Recency
- Search Atlas — Schema Markup and AI Search
- MetricsRule — Is Your Brand AI-Trainable?
- Digital Bloom — 2025 AI Citation & LLM Visibility Report
- a16z — GEO Over SEO (May 2025)
- INSEAD — Meet the Model: Marketing to LLMs
- PageOnePower — How LLMs Choose Brands