Every major AI search system — Google AI Mode, Perplexity, ChatGPT Search, Microsoft Copilot — shares a common foundation: a retrieval pipeline that selects candidate content before a language model generates a single word. Yet no published methodology exists for simulating that retrieval pipeline against a website’s own content in order to measure how visible that content actually is to AI systems.
This paper describes a 22-phase pipeline for doing exactly that. The pipeline crawls a domain, audits the gap between static and JavaScript-rendered content, extracts structured and unstructured content signals, generates a prompt universe across six intent categories, and runs a hybrid lexical-semantic retrieval simulation against that prompt universe. The result is a Coverage Score and Confidence Index that together measure a site’s retrievability — not what AI systems say about it, but whether AI systems could reliably find it.
The methodology was developed during the construction of RetrieveAI, an AI retrieval and visibility audit platform. This paper documents the technical architecture as a standalone research contribution.
This paper covers:
- Why existing RAG evaluation frameworks address system quality, not content retrievability
- The rendering gap problem: why JavaScript-dependent content is invisible to AI crawlers
- Interaction and navigation auditing as a distinct retrievability signal
- Hybrid lexical-semantic retrieval simulation and its scoring mechanics
- Intent-classified prompt universe generation as the benchmark for coverage measurement
- Content gap detection, authority signals, and the composite scoring model
1. The Gap This Framework Fills
The RAG evaluation literature addresses a fundamentally different problem from the one this framework solves. RAGAS (Es et al., EACL 2024) evaluates how well a deployed RAG system retrieves and utilises context — faithfulness, context precision, context recall.[1] ARES (Saad-Falcon et al., NAACL 2024) fine-tunes LLM judges to evaluate retrieval quality for a specific corpus and query set.[2] The BEIR benchmark (Thakur et al., NeurIPS 2021) evaluates retrieval model performance across 18 heterogeneous datasets.[3] All three evaluate a system’s retrieval quality given a known corpus.
This framework inverts the problem. The question is not “how well does this retrieval system perform?” but “how retrievable is this website by any competent retrieval system?” The corpus is the website. The queries must be synthesised from the content. The retrieval system is simulated, not deployed. The output is a site-level visibility score rather than a system-level quality score.
The closest prior work is Penha et al.’s controllable query generation study (WWW 2023), which found that machine-learned retrieval systems exhibit significant retrievability bias — the majority of a corpus remains effectively invisible because the same small set of documents surface for most queries.[4] Kang et al.’s CCQGen (WSDM 2025) introduced concept-coverage-based query generation to ensure comprehensive evaluation of scientific documents.[5] Neither applies these ideas to web content retrieval simulation as a site-level visibility measurement tool.
Gao et al. (EMNLP 2023) established that retrieval recall is a hard upper bound on LLM citation performance: a system can only cite what it retrieves.[6] This framework measures that ceiling — the maximum possible citation visibility a site can achieve given the retrievability of its content.
2. The Problem of Retrieval Invisibility
Empirical research has established a striking fact: traditional search ranking is a poor predictor of AI citation. Ahrefs’ analysis of 15,000 prompts across ChatGPT, Gemini, Copilot, and Perplexity found that only 12% of AI-cited URLs appear in Google’s top 10 organic results, and 80% of AI-cited URLs don’t rank anywhere in Google’s top 100.[7] The Princeton/IIT Delhi GEO study (KDD 2024) found that content strategies improving AI visibility are fundamentally different from, and sometimes the inverse of, traditional keyword optimisation.[8]
Liu et al.’s “Lost in the Middle” finding (TACL 2024) added a second dimension: even when content is retrieved, LLM performance is highest for documents positioned at the beginning or end of the context window, with significant degradation for middle-positioned material.[9] Content visibility is therefore a two-stage problem: content must survive the retrieval gate, then be positioned favourably enough to influence generation. This framework addresses the first stage.
3. Pipeline Architecture: Seven Functional Groups
The pipeline operates as 22 sequential phases across seven functional groups. Each group hands enriched data to the next. Several phases are designated hard-fail points where the audit stops rather than producing scores from incomplete data.
22-Phase Pipeline — Functional Groups
4. Discovery and Context Selection (Phases 1–2)
The pipeline opens with URL discovery through sitemap parsing, with a fallback to robots.txt crawl directives when sitemaps are absent or malformed. The output is a candidate URL list bounded by scope type — from a single page up to a full-site crawl.
Phase 2 introduces an intelligent context selection step that distinguishes this pipeline from naive crawl-everything approaches. Rather than processing all discovered URLs equally, the system constructs a curated context bundle: primary URLs designated by the audit scope, supplemented by semantically related pages — the site’s homepage, about page, related product or category siblings, and associated blog content. This bundle-based approach reflects how AI retrieval systems actually build context: not from isolated pages but from the semantic neighbourhood around a domain’s primary content.
The sitemap step is more significant than it appears. Thiel & Kretschmer’s FAccT 2024 analysis of Common Crawl found that automated URL discovery is biased toward frequently-linked domains, effectively applying a PageRank filter to what enters LLM training data.[10] Cloudflare’s 2025 analysis of AI crawler behaviour confirms that AI bots rarely reach 3rd- and 4th-level pages.[11] The sitemap provides an authoritative URL inventory that corrects for this crawl depth bias.
5. The Crawl and Rendering Gap (Phases 3–6)
Phase 3 is a hard-fail point. The crawl uses a tiered vendor fallback strategy to render JavaScript fully, capturing both the pre-execution raw HTML and the post-execution DOM for every page. Both versions are preserved because Phase 4 requires them both.
Phase 4 — the rendering gap audit — is one of the most practically significant phases in the pipeline. It systematically compares the raw pre-JavaScript HTML against the fully rendered DOM, quantifying what proportion of a page’s content exists only after script execution. The audit specifically detects JS-only text content, JS-only headings, JS-only product schema markup, JS-dependent checkout flows, and third-party script injections.
The practical stakes are high. Google’s documentation confirms a two-phase crawl-then-render process with variable delays between phases.[12] AI crawlers — GPTBot, ClaudeBot, PerplexityBot — do not execute JavaScript at all: analysis of half a billion GPTBot fetches found zero evidence of script execution.[13] Any content that depends on JavaScript execution is permanently invisible to these crawlers.
Phase 4 — What Each Agent Class Sees
Phase 5 extends the rendering audit into interactive territory. Using a headless browser, the pipeline attempts to activate interactive elements — accordions, tabs, modals, variant selectors — and records whether substantive content appears behind those interactions. This matters because AI agents cannot click; they receive only the initial page state. Phase 5 flags pages where significant product or informational content is hidden behind interactive states that agents will never reach.
Phase 6 scores static navigability: the degree to which an AI agent could traverse the site using only the links present in raw HTML. Seven signals are evaluated, including whether navigation uses standard href attributes, whether filters and pagination are URL-addressable, and whether search functionality is accessible via HTML forms. A low crawlability score means the site’s content topology is partially or entirely inaccessible to agents that cannot execute JavaScript.
6. Content Extraction (Phases 7–9)
Phase 7 is a hard-fail point that parses every crawled page’s HTML to extract structured metadata: title tags, canonical URLs, JSON-LD schema markup, meta tags, and heading hierarchy (H1–H3). This structured extraction feeds multiple downstream phases and provides the raw material for technical audit scoring.
Phase 8 scans inline JavaScript for API endpoint patterns, identifying REST, GraphQL, and JSON endpoints that could serve machine-readable data to AI agents. This API detection step is the precursor to the Commerce Readiness dimension, which scores a site’s accessibility to autonomous AI purchasing agents.
Phase 9 extracts commerce-specific data from schema.org Product markup already identified in Phase 7: product identifiers, pricing, availability, media, and brand information. This extraction only runs on pages that contain commerce schema, keeping the pipeline efficient for non-commercial content.
The structured data extraction emphasis is grounded in evidence. Microsoft’s Bing/Copilot team confirmed at SMX Munich 2025 that schema markup directly helps their LLMs understand page content.[14] AccuraCast’s analysis of 9,000 AI citation sources found 81% of AI-cited pages include schema markup.[15]
7. The Intelligence Layer (Phases 10–13)
Phases 10–13 constitute the intelligence layer: the pipeline reads the site’s content and produces higher-order signals that feed the analysis and simulation phases.
Phase 10: Intent-Classified Prompt Universe Generation
Phase 10 generates the prompt universe — the benchmark that all subsequent retrieval simulation is measured against. The site’s content is summarised and passed to a language model, which generates up to 200 realistic queries that a user might ask an AI system about the brand or domain. Crucially, these queries are classified across six intent categories: informational, navigational, transactional, commercial, local, and other.
This intent classification is a methodological contribution beyond prior synthetic query generation approaches. InPars (Bonifacio et al., SIGIR 2022) and Promptagator (Dai et al., ICLR 2023) generate queries to train retrieval systems;[16][17] this framework generates queries to evaluate whether a site’s existing content covers the realistic intent distribution of its likely query space. A site that covers only informational queries while generating transactional traffic has a measurable intent coverage gap that this classification makes visible.
Alberti et al.’s roundtrip consistency principle (ACL 2019) informs quality control: generated prompts must demonstrate semantic alignment with the content that generated them, filtering hallucinated or off-topic queries before they enter the benchmark.[18]
Phases 11–12: Brand Monitoring and Perception
Phase 11 sends crawled content to a language model to identify brand mentions, their frequency, and associated sentiment. The top-mentioned brands are then cross-referenced against real AI citation data from a live source, bridging the gap between what the site says and what AI systems independently surface about those brands.
Phase 12 produces a multi-dimensional brand perception analysis: overall sentiment, tone, key themes, trust signals, risk signals, and brand voice summary. This output is separate from the retrieval simulation and feeds the audit’s qualitative reporting layer.
Phase 13: Entity Extraction
Phase 13 performs pure statistical text analysis without involving a language model. The full crawled content corpus is tokenised, stopwords removed, and entities counted by frequency across four content surfaces: title tags, headings, body text, and alt text. The top 50 entities by weighted frequency are then verified against Wikidata, producing a resolvability score for each.
This Wikidata verification step is significant. Pan et al.’s 2024 roadmap “Unifying Large Language Models and Knowledge Graphs” demonstrated that structured entity knowledge from knowledge graphs directly mitigates LLM hallucination and improves factual accuracy.[19] An entity that resolves to a Wikidata node — which indexes 121+ million entities with 1.65 billion semantic triples — is more likely to be part of an AI system’s structured world model than an unresolvable entity.[20]
8. Analysis: Technical, SKU, and Extractability (Phase 14)
Phase 14 is a hard-fail point running three concurrent sub-analysers. Each produces a 0–100 score with a letter grade; together they feed the Entity Strength component of the composite score.
Phase 14 — Three Concurrent Sub-Analysers
9. Gap Detection and Authority Signals (Phases 15–17)
Phase 15 operationalises the prompt universe as a content audit tool. For each generated prompt, the pipeline checks whether any entity from the Phase 13 entity index appears as a match. Unmatched prompts — those for which the site has no entity-level content — represent content gaps. The top 10 gaps are enriched with competitor analysis data, identifying which domains are already capturing traffic for those unserved intent categories.
This gap detection approach extends Penha et al.’s retrievability bias finding[4] into a practical audit tool: it does not just identify that bias exists, it identifies which specific query intents a site is failing to serve and what the competitive landscape looks like for those gaps.
Phase 16 converts the top gaps into structured content recommendations, each including a proposed heading hierarchy, key information points, and the schema type most appropriate for that content category. This bridges the measurement function of the pipeline with actionable editorial guidance.
Phase 17 evaluates six authority signals across all crawled pages: heading structure consistency, internal linking density, structured data coverage breadth, meta description coverage, author attribution presence, and content depth. These signals proxy the E-E-A-T factors that ZipTie.dev’s 2025 analysis found to be strong predictors of AI citation — pages with strong authority signals were cited 2.3× more than rank-#1 pages with weak authority.[21]
10. The Retrieval Simulation (Phase 19)
Phase 19 is the most technically distinctive phase and a hard-fail point. It simulates the retrieval gate that determines whether a site’s content would be selected as context by a RAG system for a given query.
The crawled content corpus is first segmented into overlapping chunks with a minimum and maximum character boundary and a word-level overlap between adjacent chunks. Near-duplicate chunks are eliminated through a two-stage deduplication process: exact matches first, then similarity-based near-duplicate detection.
For each prompt in the universe, the simulation runs a hybrid lexical-semantic scoring process. The first pass uses Jaccard similarity — a fast, computationally inexpensive lexical overlap measure — to identify candidate chunks across the full corpus. The top candidates from this pass are then scored using dense vector cosine similarity via semantic embeddings. The final score for each prompt-chunk pair blends these two signals: lexical matching contributes 60% and semantic similarity contributes 40%.
A prompt is classified as “covered” if any chunk in the corpus achieves a blended score above the coverage threshold. This hybrid approach reflects how production RAG systems actually work: pure semantic retrieval is expensive and misses exact-match signals that keyword-rich content provides; pure lexical retrieval misses paraphrastic matches that semantic embeddings capture. The blend corrects both failure modes simultaneously.
Phase 19 — Hybrid Retrieval Simulation Architecture
The two-stage design reflects a core tradeoff in IR: Jaccard is O(n) but misses semantic matches; dense cosine similarity captures meaning but is expensive at full-corpus scale. Running semantic scoring only on Jaccard-shortlisted candidates achieves the benefits of both at manageable compute cost.
The rationale for this hybrid approach is grounded in the information retrieval literature. Karpukhin et al.’s Dense Passage Retrieval work (EMNLP 2020) showed that dense retrieval substantially outperforms BM25 on open-domain question answering,[22] but this advantage assumes a well-optimised dense retrieval system. For the general web content case — where content quality, specificity, and keyword density vary enormously — lexical pre-filtering reduces the risk of semantic retrieval returning plausible-but-wrong matches that high-quality keyword content would correctly reject.
11. Scoring: Coverage Score, Confidence Index, and Composite (Phases 20–21)
Phase 20 is a hard-fail point that assembles the composite AI Visibility Score from all preceding phases. Two primary retrieval metrics are computed first.
The Coverage Score measures retrieval breadth: the proportion of prompts in the universe for which the site has at least one chunk meeting the coverage threshold. A Coverage Score of 0.8 means 80% of the expected query space for this domain is served by existing content. The remaining 20% represents the gap surface fed into Phase 15.
The Confidence Index measures retrieval quality: the mean blended score of the best-matching chunk across all covered prompts. Where Coverage Score measures whether retrieval succeeds, the Confidence Index measures how confidently it succeeds. A site can have broad but shallow coverage (high Coverage, low Confidence) or narrow but deep coverage (low Coverage, high Confidence) — both are distinct failure modes with different remediation strategies.
Salemi & Zamani (SIGIR 2024) found that coverage-based retrieval evaluation achieves substantially higher Kendall’s τ correlation with actual downstream RAG performance than traditional binary relevance labels — supporting Coverage Score as the primary visibility metric.[23]
Phase 20 — Composite AI Visibility Score Weights
Final score 0–100 × Scope Coverage Multiplier (0.70–1.00 based on crawl depth)
Phase 21 scores Commerce Readiness across 11 sub-dimensions covering API accessibility, protocol maturity, checkout readiness, tool-call compatibility, and agentic interaction capability. The output is a five-level Agentic Maturity rating from “No Structured Commerce” through “Reservation and Negotiation Capable.” This dimension is currently weighted at 10% of the composite score, explicitly provisional as agentic commerce matures.
12. Continuity: Snapshots and Volatility Detection (Phase 22)
Phase 22 is the final hard-fail point. It saves a point-in-time record of all scores and compares against the previous snapshot to compute score deltas. Where the delta exceeds defined thresholds, volatility alerts are fired at four severity levels: low, medium, high, and critical.
The snapshot mechanism addresses a property of AI visibility that output-monitoring tools have identified but not solved: citation patterns are temporally unstable. Profound’s analysis found that 40–60% of cited sources change monthly.[24] Algaba et al. (NAACL 2025) found that LLMs reinforce citation bias toward recent sources — content freshness signals decay, reducing visibility over time.[25] Without a snapshot history, a score is a measurement; with it, the score becomes a trend.
13. What the Pipeline Does Not Measure
The framework measures retrieval-layer and structure-layer visibility. It does not measure:
- Generation-layer utilisation. Even successfully retrieved content may not be used — the “Lost in the Middle” effect,[9] internal prior dominance,[26] and citation post-rationalisation[27] all reduce the probability that retrieved content influences output. The pipeline measures a ceiling, not a floor.
- Training data visibility. Parametric knowledge in model weights is not modelled. The simulation targets real-time RAG retrieval, which is the dominant architecture for AI search systems as of 2026.
- Platform-specific retrieval architectures. Different AI platforms use different retrieval mechanisms. The hybrid lexical-semantic simulation approximates the general case; platform-specific calibration would require separate models tuned per platform.
- Authority and trust signals beyond on-page content. Backlink profiles, brand search volume, and domain reputation affect AI citation propensity but are not modelled in the retrieval simulation.
Key Takeaways
- No existing evaluation framework addresses retrieval simulation at the content/site level — RAGAS, ARES, and BEIR all evaluate systems, not content retrievability
- The rendering gap audit is the most practically significant finding for most sites: AI crawlers execute no JavaScript, making JS-dependent content permanently invisible to them
- Intent classification of the prompt universe (six categories) is the methodological contribution beyond prior synthetic query generation work — it makes intent coverage gaps visible, not just content gaps
- Hybrid Jaccard + semantic scoring reflects how production RAG systems actually work: pure lexical retrieval misses paraphrastic matches; pure semantic retrieval is expensive and can return plausible-but-wrong results against unoptimised content
- Coverage Score (breadth) and Confidence Index (quality) are distinct metrics because broad-shallow and narrow-deep coverage failure modes require different remediation
- The snapshot mechanism converts a measurement into a trend — essential given that 40–60% of AI-cited sources change monthly
Conclusion
The AI content visibility measurement landscape has a structural problem: output-monitoring tools measure what AI systems say; RAG evaluation frameworks measure how well retrieval systems perform. Neither measures whether a specific website’s content is structurally capable of being retrieved in the first place.
The 22-phase pipeline described in this paper fills that gap. By auditing the rendering and interaction barriers that hide content from AI crawlers, generating an intent-classified prompt universe that reflects realistic query behaviour, running a hybrid lexical-semantic retrieval simulation against that universe, and tracking scores over time via snapshots, the pipeline produces a site-level measure of AI retrievability that is reproducible, interpretable, and grounded in the published retrieval literature.
The methodology was developed while building RetrieveAI and is documented here as a research contribution for others working on AI content visibility measurement. Future work should empirically validate the hybrid scoring blend against actual citation outcomes at scale, extend the prompt universe to multi-turn conversational query patterns, and investigate whether platform-specific retrieval architectures require distinct simulation models.
References
- [1] Es, S. et al. (2024). "RAGAS: Automated Evaluation of Retrieval Augmented Generation." EACL 2024. arXiv:2309.15217
- [2] Saad-Falcon, J. et al. (2024). "ARES: An Automated Evaluation Framework for RAG Systems." NAACL 2024. arXiv:2311.09476
- [3] Thakur, N. et al. (2021). "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of IR Models." NeurIPS 2021. arXiv:2104.08663
- [4] Penha, G. et al. (2023). "Improving Content Retrievability in Search with Controllable Query Generation." WWW ’23. arXiv:2303.11648
- [5] Kang, S. et al. (2025). "Improving Scientific Document Retrieval with Concept Coverage-based Query Set Generation." WSDM ’25. arXiv:2502.11181
- [6] Gao, T. et al. (2023). "Enabling LLMs to Generate Text with Citations." EMNLP 2023. arXiv:2305.14627
- [7] Ahrefs (Linehan & Guan). (2025). "AI Assistants Don’t Follow the SERPs." ahrefs.com/blog/ai-search-overlap
- [8] Aggarwal, P. et al. (2024). "GEO: Generative Engine Optimization." KDD ’24. arXiv:2311.09735
- [9] Liu, N. F. et al. (2024). "Lost in the Middle: How Language Models Use Long Contexts." TACL, 12, 157–173. arXiv:2307.03172
- [10] Thiel, H. & Kretschmer, M. (2024). "A Critical Analysis of Common Crawl." FAccT ’24. DOI:10.1145/3630106.3659033
- [11] Cloudflare (2025). "From Googlebot to GPTBot: Who’s Crawling Your Site in 2025." blog.cloudflare.com
- [12] Google Search Central (2024). "Understand JavaScript SEO Basics." developers.google.com/search/docs/crawling-indexing/javascript/javascript-seo-basics
- [13] Daydream (2025). "How OpenAI Crawls and Indexes Your Website." withdaydream.com — analysis of half a billion GPTBot fetches, zero JS execution observed.
- [14] Canel, F. (Microsoft). SMX Munich, March 2025. searchengineland.com/microsoft-bing-copilot-use-schema-for-its-llms-453455
- [15] AccuraCast (2025). "Does Schema Markup Increase Generative Search Visibility?" accuracast.com
- [16] Bonifacio, L. et al. (2022). "InPars: Unsupervised Dataset Generation for Information Retrieval." SIGIR ’22.
- [17] Dai, Z. et al. (2023). "Promptagator: Few-shot Dense Retrieval From 8 Examples." ICLR 2023. arXiv:2209.11755
- [18] Alberti, C. et al. (2019). "Synthetic QA Corpora Generation with Roundtrip Consistency." ACL 2019.
- [19] Pan, S. et al. (2024). "Unifying Large Language Models and Knowledge Graphs: A Roadmap." IEEE TKDE, 36, 3580–3599. arXiv:2306.08302
- [20] Wikidata (2025). 121M+ entities, 1.65B semantic triples. wikidata.org
- [21] ZipTie.dev (2025). "Google AI Overviews Source Selection." ziptie.dev/blog/google-ai-overviews-source-selection/
- [22] Karpukhin, V. et al. (2020). "Dense Passage Retrieval for Open-Domain Question Answering." EMNLP 2020. arXiv:2004.04906
- [23] Salemi, A. & Zamani, H. (2024). "Evaluating Retrieval Quality in RAG." SIGIR ’24. arXiv:2404.13781
- [24] EMARKETER (2026). "FAQ on GEO and AEO." emarketer.com — 40–60% of cited sources change monthly.
- [25] Algaba, A. et al. (2025). "LLMs Reflect Human Citation Patterns with Heightened Bias." NAACL 2025. arXiv:2405.15739
- [26] Wu, K. et al. (2024). "How Faithful Are RAG Models?" arXiv:2404.10198
- [27] Wallat, J. et al. (2024). "Correctness is not Faithfulness in RAG Attributions." arXiv:2412.18004