The AI search visibility market now exceeds $850 million and is projected to reach $7–20 billion by 2031–2034.[1] Yet despite hundreds of tools claiming to measure AI visibility, no published framework defines a weighted composite score combining heterogeneous quality signals — entity strength, retrieval simulation, and commerce readiness — into a single auditable number.
This paper introduces the AI Visibility Score (AVS), a three-dimensional weighted composite scoring model built from first principles during the development of RetrieveAI, an AI retrieval and visibility audit platform. The model assigns 45% weight to Entity Strength, 45% to Prompt Coverage, and 10% to Commerce Readiness, modulated by a Scope Coverage Multiplier (0.70–1.00) determined by crawl depth.
This paper covers:
- Why the existing measurement landscape lacks a composite scoring standard
- The theoretical grounding for each dimension and its weight
- The Scope Coverage Multiplier and why site coverage matters
- Grade thresholds and what scores mean in practice
- Comparison with seven major tools including Peec AI
- How the framework is implemented in RetrieveAI's 17-phase audit pipeline
1. The Measurement Gap
Gartner predicted in 2024 that traditional search engine volume will drop 25% by 2026 due to AI chatbots — later revised to potentially 50%+ by 2028.[2] Pew Research Center's gold-standard behavioral study of 900 US adults across 68,879 searches found that click-through rates with AI summaries are 8% versus 15% without — a 47% reduction — and only 1% of users click sources cited within AI Overviews.[3]
The foundational academic work — Aggarwal et al.'s "GEO: Generative Engine Optimization" (KDD 2024, Princeton/IIT Delhi) — defines three visibility metrics for generative engines and benchmarks nine optimization strategies across 10,000 queries.[4] But GEO focuses on content-level optimization tactics, not on producing a composite quality score for brand-level AI visibility. The RAG evaluation literature (RAGAS, ARES, BEIR) provides retrieval quality metrics but addresses system evaluation, not site-level visibility scoring.[5][6][7] This gap is what the AVS framework addresses.
2. How Existing Tools Measure Visibility
An analysis of seven major platforms reveals a consistent pattern: all measure what AI systems say about brands; none simulate how AI systems retrieve brand content.
| Tool | Composite Score? | Key Approach | Retrieval Simulation? |
|---|---|---|---|
| Profound | ~ AEO Content Score | ML-powered predictive citation likelihood; 400M+ prompt database; 7+ platforms | ✗ |
| AthenaHQ | ~ GEO Score (ACE) | Citation + sentiment + traffic composite; Shopify/GA4 revenue attribution | ✗ |
| Peec AI | Three-axis model | Visibility / Position / Sentiment, each 0–100; separates “used” vs “cited” sources; real UI scraping; 115+ languages | ✗ |
| Otterly AI | ~ Brand Visibility Score | 20+ on-page GEO audit factors; Gartner Cool Vendor 2025; 20,000+ users | ✗ |
| Semrush AI Toolkit | Yes (0–100) | 130M+ prompt database; SOV + sentiment + citation + prompt visibility composite | ✗ |
| Ahrefs Brand Radar | Directional SOV only | 260M+ search-backed prompts; AI Share of Voice; 6 platforms | ✗ |
| HubSpot AEO Grader | Yes (0–100) | 5-dimension: Sentiment, Recognition Depth, SOV, Market Positioning, Presence Quality | ✗ |
Peec AI's three-axis model (Visibility / Position / Sentiment) is the most parsimonious and methodologically honest among output-monitoring tools. It makes no claim to simulate retrieval mechanics, instead providing three clean, separately interpretable scores for what can actually be observed from LLM outputs. It also makes a methodologically important distinction that most tools conflate: separating sources the AI "used" (drew on for context) from sources it explicitly "cited" (named in the response) — a distinction with real significance for understanding where in the retrieval-to-generation pipeline a brand is present or absent.[8]
The existing measurement paradigm asks: "What did AI say about us?" The AI Visibility Score asks the upstream question: "Could AI reliably find, extract, and represent us in the first place?"
3. Theoretical Foundations for the Three Dimensions
3.1 Entity Strength — the knowledge graph foundation
Pan et al.'s 2024 roadmap "Unifying Large Language Models and Knowledge Graphs" demonstrated that KGs provide structured entity knowledge that directly mitigates LLM hallucination and improves factual accuracy.[9] The practitioner evidence is striking: an analysis by Wellows found that pages with 15+ recognised entities show 4.8× higher citation probability (r=0.76 correlation between entity KG density and citation rate).[10] ZipTie.dev's reverse-engineering of AI Overview source selection found that only 38% of AI-cited pages rank in Google's organic top 10 — but pages with strong E-E-A-T signals were cited 2.3× more than rank-#1 pages with weak authority.[11]
Taher et al.'s 2025 NER survey finds that 80% of top queries to a digital library contained at least one named entity.[12] Wikidata now indexes 121+ million entities with 1.65 billion semantic triples;[14] DBpedia covers 228+ million entities from 111 Wikipedia language editions.[15] These serve as the primary coverage reference for entity strength scoring — entities are scored not just on frequency but on resolvability to recognised knowledge graph nodes.
3.2 Prompt Coverage — the retrieval simulation foundation
The GEO paper found that citing sources improved visibility by +115% for lower-ranked pages — but keyword stuffing reduced it by 10%.[4] Lewis et al.'s foundational 2020 RAG paper established that retrieval recall serves as a hard upper bound on generation performance: a system can only cite what it can retrieve.[16] The RAGAS framework formalises four retrieval quality metrics, with Context Recall measuring information completeness of retrieved context.[5]
The Prompt Coverage dimension operationalises these concepts at the site level: generating a prompt universe from the site's own content, computing vector embeddings for both prompts and content chunks, and measuring cosine similarity. The resulting Coverage Score (proportion of generated prompts finding a chunk with cosine similarity ≥0.7) and Confidence Index (average similarity quality of matched chunks) combine into the sub-score. The 0.7 threshold aligns with Google Vertex AI's default grounding threshold — the industry standard for production RAG systems.[17]
3.3 Commerce Readiness — the agentic infrastructure foundation
Gartner projects AI agents will intermediate 90% of B2B purchasing by 2028, commanding over $15 trillion in purchases.[18] McKinsey projects $3–5 trillion in global agentic commerce sales by 2030.[19] Anthropic's Model Context Protocol, released November 2024, reached 10,000+ active servers and 97 million monthly SDK downloads by December 2025, establishing the technical standard for AI-to-commerce connectivity.[20] This dimension is weighted at 10% — explicitly provisional — reflecting that most AI visibility use cases in 2026 remain informational rather than transactional.
4. The AI Visibility Score: Formula and Weights
AVS = (Entity Strength × 0.45 + Prompt Coverage × 0.45 + Commerce Readiness × 0.10) × Scope Multiplier
AVS Dimensional Composition
× Scope Multiplier (0.70–1.00) · Final score 0–100
4.1 Why 45 / 45 / 10?
The weighting draws from two adjacent scoring traditions. The FICO credit score weights its top two dimensions at 35% + 30% because those factors empirically predict default far more strongly than the remaining three.[21] Core Web Vitals uses a "weakest link" composite where overall page status is limited by the worst-performing metric.[22]
Entity Strength and Prompt Coverage receive equal weighting because they are complementary rather than redundant. A site can have high entity strength (well-written, entity-rich content) but poor prompt coverage (content misaligned with how users actually query AI). The inverse is equally common: broad topic coverage with shallow entity depth. Moz's Domain Authority 2.0 transition from linear to neural weighting (combining 40+ signals, achieving r=0.9+ with competitor metrics) demonstrates that complementary signals are best combined through learned weights — the AVS's linear combination is a deliberate simplification for interpretability.[23]
Commerce Readiness is weighted at 10% because Pew's behavioral study confirms AI is more often replacing the need to visit a site than replacing e-commerce transactions — users "ended sessions after AI" at 26% vs 16% for traditional search.[3] For e-commerce verticals specifically, implementations of the AVS framework should re-weight Commerce Readiness toward 25%.
5. The Scope Coverage Multiplier
A score derived from a single-page audit should not be directly comparable to a full-site audit of 500 pages. Cloudflare's analysis found that AI bots are less likely to reach 3rd- and 4th-level pages,[25] and Thiel & Kretschmer's FAccT 2024 paper found that Common Crawl URL discovery applies a PageRank-like filter to what enters LLM training data — systematically favouring comprehensively linked domains.[24]
Scope Coverage Multiplier Values
0.70× Only anchor page signals available. Significant underestimation of domain-wide entity co-occurrence and cross-page retrieval coverage.
0.87× 15–30 semantically related pages. Cross-page entity modeling partially available; global link graph absent.
0.94× Full category crawl. Strong coverage within vertical; cross-category authority paths limited.
1.00× Complete domain audit. All signals at full weight. Enables cross-page entity reinforcement and complete prompt universe generation.
6. Grade Thresholds
AI Visibility Score — Grade Thresholds
7. Composite Scoring Precedents
Page and Brin's PageRank (Stanford, 1999) established that a quality score should combine signal quantity with signal quality in a single recursive formula.[31] The FICO credit score demonstrates that fixed-weight combination of heterogeneous signal types can produce a universally adopted composite — weighted by empirical default prediction, not theoretical elegance.[21]
Moz's Domain Authority 2.0 is the most directly analogous precedent: switching from linear to neural weighting, producing a 100-point logarithmic scale with r=0.9+ correlation across competing tools.[23] Kaplan and Norton's Balanced Scorecard (HBR, 1992), used by 53% of companies at peak adoption, established the conceptual foundation: no single metric gives a complete picture, and combining perspectives from different measurement domains into one composite is both defensible and actionable.[32]
8. Limitations and Future Work
The 45/45/10 weighting is theoretically motivated but not yet empirically validated against actual AI citation outcomes at scale. Wan, Wallace and Klein (ACL 2024) found that LLMs rely heavily on query relevance but largely ignore stylistic features — suggesting that specific Entity Strength sub-signals reflecting content clarity may deserve higher weight than richness signals.[33]
The AVS currently produces a platform-agnostic composite, but iPullRank's technical teardowns confirm that Bing Copilot uses hybrid retrieval (BM25 + vector) while Perplexity emphasises real-time web search and Google AI Mode emphasises knowledge graph grounding.[34] Platform-specific score decompositions would require separate retrieval simulation models calibrated per platform.
Algaba et al. (NAACL 2025) found that LLMs reinforce citation bias toward recent, highly cited sources, with only 8.78% of generated references matching ground truth.[35] This suggests the AVS score is not static — content freshness decay and volatility tracking should be incorporated as a fourth component. RetrieveAI's snapshot and volatility alert pipeline captures this dimension in practice but it is not yet included in the composite formula.
Key Takeaways
- No existing tool produces a weighted composite AI visibility score grounded in retrieval mechanics — all seven major platforms measure LLM outputs, not input-side retrievability
- Peec AI's three-axis model (Visibility / Position / Sentiment) is the most methodologically honest among output-monitoring tools, making the important distinction between "used" and "cited" sources
- Entity Strength and Prompt Coverage receive equal 45% weights because they are complementary — a site can have high entity quality with poor query coverage, or broad coverage with shallow entity depth
- The Scope Multiplier (0.70–1.00) corrects a real bias: single-page audits systematically underestimate domain-wide AI visibility
- The 10% Commerce Readiness weight is explicitly provisional — as agentic commerce matures, this should grow toward 25% for e-commerce verticals
- Established precedents (PageRank, FICO, Domain Authority 2.0, Core Web Vitals, Balanced Scorecard) validate multi-dimensional composite scoring; the AVS applies this tradition to AI visibility for the first time
Conclusion
The AI visibility measurement landscape is crowded with tools answering the wrong question. "How often does ChatGPT mention our brand?" is an important data point — Peec AI, Profound, Semrush and others answer it well. But it is a consequence, not a cause. The more actionable question is: "Is our content structured so that AI retrieval systems can find, extract, and represent it across the likely query space for our domain?"
The AI Visibility Score framework — built through the implementation of RetrieveAI — proposes the first published answer to that second question: a three-dimensional weighted composite (Entity Strength 45%, Prompt Coverage 45%, Commerce Readiness 10%) modulated by a Scope Coverage Multiplier, producing a single auditable 0–100 number with a clear lineage to PageRank, FICO, Domain Authority, and the Balanced Scorecard.
The weights are theoretically motivated but empirically provisional. The framework's value is not that it produces the definitive answer, but that it produces a specific, falsifiable, reproducible answer — one that future research can validate, challenge, and refine. That is what distinguishes a framework from a dashboard.
References
- [1] MarketIntelo (2025). GEO market: $848M to $19.8B by 2034. marketintelo.com/report/generative-engine-optimization-geo-market
- [2] Gartner (2024). "Search Engine Volume Will Drop 25% by 2026." gartner.com/en/newsroom/press-releases/2024-02-19
- [3] Pew Research Center (2025). "Google users are less likely to click on links when an AI summary appears." July 22, 2025. pewresearch.org
- [4] Aggarwal, P. et al. (2024). "GEO: Generative Engine Optimization." KDD '24, ACM. arXiv:2311.09735
- [5] Es, S. et al. (2024). "RAGAS: Automated Evaluation of Retrieval Augmented Generation." EACL 2024. arXiv:2309.15217
- [6] Saad-Falcon, J. et al. (2024). "ARES: An Automated Evaluation Framework for RAG Systems." NAACL 2024. arXiv:2311.09476
- [7] Thakur, N. et al. (2021). "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of IR Models." NeurIPS 2021. arXiv:2104.08663
- [8] Peec AI Documentation. "Intro to Peec AI." docs.peec.ai/intro-to-peec-ai
- [9] Pan, S. et al. (2024). "Unifying Large Language Models and Knowledge Graphs: A Roadmap." IEEE TKDE, 36, 3580–3599. arXiv:2306.08302
- [10] Wellows (2025). "Google AI Overviews Ranking Factors: 2026 Guide." wellows.com/blog/google-ai-overviews-ranking-factors/
- [11] ZipTie.dev (2025). "Google AI Overviews Source Selection." ziptie.dev/blog/google-ai-overviews-source-selection/
- [12] Taher, H. et al. (2025). "Recent Advances in Named Entity Recognition." arXiv:2401.10825v3
- [14] Wikidata (2025). 121M+ entities, 1.65B triples. wikidata.org
- [15] Lehmann, J. et al. (2015). "DBpedia — A Large-scale, Multilingual Knowledge Base." Semantic Web Journal, 6(2), 167–195.
- [16] Lewis, P. et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 33. arXiv:2005.11401
- [17] Google Vertex AI Documentation. "Dynamic retrieval." Default grounding threshold: 0.7. cloud.google.com/vertex-ai
- [18] Gartner (2025). "Top Predictions for IT Organizations in 2026 and Beyond." October 21, 2025.
- [19] McKinsey (2025). "The Agentic Commerce Opportunity." October 17, 2025. mckinsey.com
- [20] Anthropic (2025). "Donating MCP and establishing the Agentic AI Foundation." December 9, 2025.
- [21] Fair Isaac Corporation. "What's in My FICO Scores." myfico.com/credit-education/whats-in-your-credit-score
- [22] McQuade, B. & Pollard, B. (2025). "How Core Web Vitals Thresholds Were Defined." web.dev/articles/defining-core-web-vitals-thresholds
- [23] Moz (2019). "Domain Authority 2.0." Technical whitepaper. moz.com
- [24] Thiel, H. & Kretschmer, M. (2024). "A Critical Analysis of Common Crawl." FAccT '24. DOI:10.1145/3630106.3659033
- [25] Cloudflare (2025). "From Googlebot to GPTBot: Who's Crawling Your Site in 2025." blog.cloudflare.com
- [26] Canel, F. (Microsoft). SMX Munich, March 2025. searchengineland.com/microsoft-bing-copilot-use-schema-for-its-llms-453455
- [27] AccuraCast (2025). "Does Schema Markup Increase Generative Search Visibility?" accuracast.com
- [28] University of Mannheim / WDC (2024). 51.25% of 2.4B pages have structured data. uni-mannheim.de
- [29] Salemi, A. & Zamani, H. (2024). "Evaluating Retrieval Quality in RAG." SIGIR '24. arXiv:2404.13781
- [30] Gao, T. et al. (2023). "Enabling LLMs to Generate Text with Citations." EMNLP 2023. arXiv:2305.14627
- [31] Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). "The PageRank Citation Ranking." Stanford. ilpubs.stanford.edu:8090/422/
- [32] Kaplan, R.S. & Norton, D.P. (1992). "The Balanced Scorecard." Harvard Business Review, 70(1), 71–79.
- [33] Wan, A., Wallace, E., & Klein, D. (2024). "What Evidence Do Language Models Find Convincing?" ACL 2024. DOI:10.18653/v1/2024.acl-long.403
- [34] iPullRank (2025). "AI Search Architecture Deep Dive." ipullrank.com/ai-search-manual/search-architecture
- [35] Algaba, A. et al. (2025). "LLMs Reflect Human Citation Patterns with Heightened Bias." NAACL 2025. arXiv:2405.15739