AI search is stochastic. Your visibility strategy shouldn't be.

AI search engines now influence purchasing decisions for 58% of consumers, but they behave fundamentally differently from Google. When a buyer asks ChatGPT “best online fashion store in Europe,” the response is stochastic—the same question produces different brand recommendations across runs, across models, and across phrasings. The CMU “LLM Whisperer” study found that synonym replacements alone change brand mention likelihood by up to 78%, and semantically equivalent prompts produce 7.4–18.6% mention differences [1]. Brand mentions disagree 62% of the time across platforms, with less than 1-in-100 chance of getting the same recommendation list twice [1].

Try the simulator Read the research

What the simulator measures

We poll ChatGPT, Claude, and Gemini with real buyer questions, multiple times per model at temperature=0. Every response is parsed for: which brands were mentioned, their position in the response, and whether the sentiment was positive, neutral, or negative. This follows the statistical sampling methodology validated by Tryscope (50 samples/day) [2] and the Discovered Labs framework which shows that 95% confidence requires hundreds of samples across prompts and regenerations.

From raw observations we extract 14 features per brand: mention rate, average position, top-1 rate, sentiment breakdown, model agreement, query coverage, competitive gap, and more. These features map directly to the findings of the GEO paper [3], Profound's 680-million citation study [4], and Yext's 17.2-million citation analysis [5].

Features extracted per brand

Position, sentiment, model agreement, competitive dynamics

AI models polled

ChatGPT (GPT-4o), Claude (Sonnet), Gemini (2.5 Flash)

~1ms

What-if prediction time

XGBoost surrogate vs seconds for real API calls

Why you need all three models

Yext's analysis of 17.2 million citations concluded: “There is no single AI optimization strategy. Source mix for Gemini visibility does not equal source mix for Claude visibility.” [5] Claude relies on user-generated content (reviews, social media) at rates 2–4x higher than Gemini, which favors authoritative E-E-A-T signals. Perplexity cites ~21 sources per response versus ChatGPT's ~7 — entirely different ecosystems.

Only 11% of domains are cited by both ChatGPT and Perplexity [4]. A strategy optimized for one model may be invisible on another. Our simulator tracks model_agreement (do all models mention you?) and model_spread (how much do they disagree?) as first-class features.

Model	Avg citations	Key behavior	Source
ChatGPT	~7	Parametric-first (79%); web search 21% of the time	[4]
Claude	varies	2-4x higher user-generated content reliance	[5]
Gemini	~5	Strongest E-E-A-T signals; rewards authority	[5]
Perplexity	~21	Search-first; returns native citation URLs	[4]

What the research proved works (and what doesn't)

The GEO paper [3] tested 9 content optimization strategies on 10,000 queries across 25 domains. These are the strategies our simulator lets you toggle as what-if scenarios. The results were validated on Perplexity.ai with real-world data.

Strategy	Visibility change	Note
Add expert quotations	+41%	Largest single-strategy lift
Add statistics & data	+37%	Validated on Perplexity.ai
Cite credible sources	+30%	Up to +115% for lower-ranked sites
Improve fluency	+28%	Active voice, short sentences, logical flow
Use technical terms	+18%	Domain-dependent
Authoritative tone	+12%	Best for debate, history, science
Keyword stuffing	-10%	Traditional SEO actively hurts AI visibility

The democratization effect: Lower-ranked sites benefit the most. Sites at rank #5 saw +115% improvement from citing sources, while rank #1 sites saw -30% — a competitive displacement effect [3]. The E-GEO paper from Columbia/MIT found a similar pattern with a “universally effective” optimization strategy that particularly helps underdog brands [10].

Freshness is the strongest real-world signal

Ahrefs' analysis of 17 million citations across 7 platforms found that AI-cited content is 25.7% fresher than organic search results [8]. Seer Interactive's 5,000-URL study showed 76.4% of ChatGPT's top-cited pages were updated within 30 days [7]. One client saw +300% AI traffic after refreshing outdated content.

The SIGIR-AP 2025 paper by Fang et al. confirmed this scientifically: prepending artificial publication dates to passages shifted Top-10 rankings forward by up to 4.78 years [6]. This recency bias is systematic across GPT-4o, LLaMA-3, and Qwen-2.5 model families.

This is why source_freshness is a primary feature in our surrogate model, and “Refresh content” is consistently one of the highest-lift recommendations.

The surrogate model: architecture and theory

Implementation status: The current engine uses XGBoost with feature-importance-weighted attribution and bootstrap residual intervals. The techniques below (CQR, TreeSHAP-IQ, ADWIN, CPCV) describe the target architecture we are building toward — included here for methodological transparency.

Calling LLM APIs for every what-if question costs ~$0.005/query and takes 2-5 seconds. Instead, we build a surrogate model — an XGBoost proxy trained on accumulated daily observations that predicts mention rate from 14 observation features plus 7 content features in ~1ms. Nature Scientific Reports showed XGBoost surrogates achieve up to 10⁶ speedup with R² of 0.97 [12].

Why XGBoost over neural networks

Our daily collection yields 6-30 rows per day (one per brand). Neural networks need thousands. XGBoost trains in sub-second on CPU with 30-180 rows and handles mixed feature types natively. Most importantly, tree-based models enable exact Shapley value computation via TreeSHAP [13] in O(TLD²) time — making real-time explanations practical where neural networks would require expensive approximations.

Prediction intervals: Conformalized Quantile Regression

A point prediction is useless without knowing how uncertain it is. We plan to use Conformalized Quantile Regression (CQR) [14], which provides distribution-free, finite-sample coverage guarantees:

CQR Algorithm (Romano, Patterson, Candes — NeurIPS 2019)

1. Split data into training I₁ and calibration I₂

2. Train lower quantile q̂_α/2(x) and upper q̂_1-α/2(x) on I₁

3. Conformity scores: E_i = max(q̂_lo(X_i) - Y_i, Y_i - q̂_hi(X_i))

4. Q = (1-α)(1 + 1/|I₂|)-th quantile of scores

5. Interval: C(X) = [q̂_lo(X) - Q, q̂_hi(X) + Q]

Guarantee: P(Y_new ∈ C(X_new)) ≥ 1-α (finite-sample, distribution-free). XGBoost 2.0+ supports native quantile regression via reg:quantileerror. AAAI 2025 work on Conformal Thresholded Intervals [15] produces even tighter intervals via the Neyman-Pearson lemma.

Explainability: TreeSHAP-IQ for feature interactions

Standard SHAP tells you “adding statistics contributed +4.2%.” But it misses interactions: does adding statistics help more when combined with freshness? TreeSHAP-IQ (Muschalik et al., AAAI 2024) [16] computes any-order Shapley Interaction indices in a single recursive traversal:

Shapley Interaction Index

φ_ij = ∑_S⊆N\{i,j} |S|!(|N|-|S|-2)! / (2(|N|-1)!) · [f(S∪{i,j}) - f(S∪{i}) - f(S∪{j}) + f(S)]

Positive φ_ij = synergistic (combined > sum). Negative = redundant (diminishing returns). Uses interventional SHAP [13]: E[f(x) | do(X_S=x_S)] respecting causal structure — changing content causes a freshness change, not merely correlates.

Drift detection: ADWIN on daily streams

LLM behavior changes: models update, RAG sources rotate, competitors shift. We plan to use ADWIN (Adaptive Windowing) [17] for online drift detection on each feature stream:

ADWIN (Bifet & Gavalda, 2007)

Partition window W into W₀, W₁

ε_cut = √((1/2m) · ln(4/δ)) where m = harmonic mean of |W₀|, |W₁|

Drift when: |μ_W0 - μ_W1| ≥ ε_cut

Window grows when stationary (more accuracy), shrinks on drift (discard stale data). We monitor data drift (feature distributions), concept drift (importance shifts >2x), and label drift (predicted vs actual). Framework by Hinder et al. [18].

Validation: Combinatorial Purged Cross-Validation

Standard walk-forward gives a single score and suffers from “notable shortcomings in false discovery prevention” [19]. We useCPCV (Lopez de Prado, 2017): test all C(N,k) combinations of time-ordered groups with purging (remove overlapping labels) andembargo (exclude h bars after boundaries). Output is a distribution of OOS scores, not a single number — markedly superior for preventing overfitting.

Multi-model prediction: one surrogate, three targets

Since ChatGPT, Claude, and Gemini behave differently [5], we train separate per-model XGBoost surrogates (one for ChatGPT, Claude, Gemini) rather than a single tree ensemble predicting mention rate for all three simultaneously. Each split optimizes across all targets, capturing shared signal (freshness helps everywhere) while allowing per-model divergence (Claude weights reviews higher). When a user asks “what if I add statistics?” they see three answers — predicted lift on ChatGPT, Claude, and Gemini independently, from a single model pass.

What's novel

No published paper applies surrogate models to AI search visibility prediction. The closest: E-GEO (Columbia/MIT) [10] developed a “lightweight iterative prompt-optimization algorithm” and found a “universally effective” pattern — suggesting the feature space is learnable. Harvard's manipulation study [11] showed strategic text moves products from never-recommended to top position. Recent work on LLMs as surrogates for optimization (Hao et al.) [20] and the digital twin AI framework from Lehigh/Penn/Stanford [21] both validate the core pattern: collect, train proxy, intervene via proxy. Our contribution is applying it where stochasticity [1], cross-model divergence [5], and recency bias [6] are well-documented but no prediction tool exists.

The pipeline

Collect

Poll ChatGPT, Claude, Gemini at temperature=0. Store every observation in Convex with brand, position, sentiment, model.

Extract

Compute 14 features per brand: mention rate, position stats, sentiment, model agreement, competitive dynamics, query coverage.

Train

XGBoost surrogate (aggregate + per-model) on accumulated data. Feature-importance attribution for explanations. Planned: CQR intervals, CPCV validation, TreeSHAP-IQ interactions.

Detect

Planned: ADWIN drift detection on each feature stream. When detected: auto-retrain, log, alert. Requires 30+ days of daily collections.

Predict

User toggles a GEO strategy or analyzes draft content. Surrogate predicts per-model mention rates in ~1ms. Planned: conformal intervals.

Recommend

Rank which content and observation changes help most: predicted lift, effort level, specific tactics — grounded in GEO paper findings.

Why this matters now

Gartner predicted a 25% drop in traditional search volume by 2026. The reality is tracking: Google searches per U.S. user dropped nearly 20% YoY in 2025, Safari searches declined for the first time in 22 years, and AI chatbot platforms grew 721% in monthly traffic. Ahrefs found that AI Overviews reduce organic CTR for position #1 by 58%, with 83% of AI Overview searches ending in zero clicks [8].

The follow-up GEO paper by Chen et al. [9] showed that AI search systems exhibit “systematic and overwhelming bias toward earned media over brand-owned content” — a structural shift from traditional SEO where you could rank by optimizing your own pages. Third-party mentions are 6.5x more effective than owned domain content [4].

The brands that measure and adapt to AI search will win. The ones that assume Google SEO transfers to ChatGPT will lose — the GEO paper proved that keyword stuffing, the foundation of traditional SEO, decreases AI visibility by 10% [3].

Check your AI visibility now

References

[1] Lin, W. et al. "LLM Whisperer: An Inconspicuous Attack to Bias LLM Responses." CHI 2025, ACM. arxiv.org
[2] Tryscope. AI Visibility Simulation Platform. Launched April 2026, YC-backed. tryscope.app
[3] Aggarwal, P. et al. "GEO: Generative Engine Optimization." ACM SIGKDD 2024. arxiv.org
[4] Profound. AI Platform Citation Patterns: 680M+ Citations Analysis. June 2025. www.tryprofound.com
[5] Yext. AI Citation Behavior Across Models: 17.2M Citations. Q4 2025. www.yext.com
[6] Fang, H. et al. "Recency Bias in LLM-Based Reranking." SIGIR-AP 2025. arxiv.org
[7] Seer Interactive. AI Brand Visibility and Content Recency: 5,000+ URL Study. September 2025. www.seerinteractive.com
[8] Ahrefs. AI Citation Freshness: 17M Citations Across 7 Platforms. 2025. ahrefs.com
[9] Chen, M. et al. "Generative Engine Optimization: How to Dominate AI Search." September 2025. arxiv.org
[10] Bagga, P. et al. "E-GEO: A Testbed for Generative Engine Optimization in E-Commerce." Columbia/MIT, November 2025. arxiv.org
[11] Kumar, A. & Lakkaraju, H. "Manipulating Large Language Models to Increase Product Visibility." Harvard, April 2024. arxiv.org
[12] Nature Scientific Reports, 2026. XGBoost Surrogate Technique. www.nature.com
[13] Lundberg, S. et al. "From local explanations to global understanding with explainable AI for trees." Nature Machine Intelligence, 2020. TreeSHAP: interventional vs observational. arxiv.org
[14] Romano, Y., Patterson, E., Candes, E. "Conformalized Quantile Regression." NeurIPS 2019. arxiv.org
[15] Luo, R. & Zhou, Z. "Conformal Thresholded Intervals for Efficient Regression." AAAI 2025. ojs.aaai.org
[16] Muschalik, M. et al. "Beyond TreeSHAP: Efficient Computation of Any-Order Shapley Interactions for Feature Attribution." AAAI 2024. arxiv.org
[17] Bifet, A. & Gavalda, R. "Learning from Time-Changing Data with Adaptive Windowing." SIAM SDM 2007. ADWIN algorithm. riverml.xyz
[18] Hinder, F. et al. "One or two things we know about concept drift." Frontiers in Artificial Intelligence, 2024. www.frontiersin.org
[19] Arian, H. et al. "Backtest overfitting in the machine learning era: CPCV vs walk-forward." Knowledge-Based Systems, 2024. www.sciencedirect.com
[20] Hao, H. et al. "LLMs as Surrogate Models in Evolutionary Algorithms." arXiv, June 2024. arxiv.org
[21] Zhou, R. et al. "Digital Twin AI: From LLMs to World Models." Lehigh/Penn/Stanford, January 2026. arxiv.org