I'm always excited to take on new projects and collaborate with innovative minds.

Social

Back to Blog
Research March 2, 2026 20 min read

Towards a Quantitative AI Visibility Score: A Dimensional Framework_

The first weighted composite scoring model for AI search visibility — integrating entity strength, retrieval simulation, and commerce readiness into a single, auditable 0–100 score.

Author

Akshay Dahiya

Growth & MarTech Specialist

The AI search visibility market now exceeds $850 million and is projected to reach $7–20 billion by 2031–2034.[1] Yet despite hundreds of tools claiming to measure AI visibility, no published framework defines a weighted composite score combining heterogeneous quality signals — entity strength, retrieval simulation, and commerce readiness — into a single auditable number.

This paper introduces the AI Visibility Score (AVS), a three-dimensional weighted composite scoring model built from first principles during the development of RetrieveAI, an AI retrieval and visibility audit platform. The model assigns 45% weight to Entity Strength, 45% to Prompt Coverage, and 10% to Commerce Readiness, modulated by a Scope Coverage Multiplier (0.70–1.00) determined by crawl depth.

This paper covers:
  • Why the existing measurement landscape lacks a composite scoring standard
  • The theoretical grounding for each dimension and its weight
  • The Scope Coverage Multiplier and why site coverage matters
  • Grade thresholds and what scores mean in practice
  • Comparison with seven major tools including Peec AI
  • How the framework is implemented in RetrieveAI's 17-phase audit pipeline

1. The Measurement Gap

Gartner predicted in 2024 that traditional search engine volume will drop 25% by 2026 due to AI chatbots — later revised to potentially 50%+ by 2028.[2] Pew Research Center's gold-standard behavioral study of 900 US adults across 68,879 searches found that click-through rates with AI summaries are 8% versus 15% without — a 47% reduction — and only 1% of users click sources cited within AI Overviews.[3]

80%
of companies don't track AI brand mentions
MaximusLabs, 2025
57%
of marketers still figuring out AI visibility measurement
BrightEdge, 2025
80%
of LLM-cited URLs don't rank in Google's top 100
Position Digital, 2026
11%
of domains cited by both ChatGPT and Perplexity
The Digital Bloom, 2025

The foundational academic work — Aggarwal et al.'s "GEO: Generative Engine Optimization" (KDD 2024, Princeton/IIT Delhi) — defines three visibility metrics for generative engines and benchmarks nine optimization strategies across 10,000 queries.[4] But GEO focuses on content-level optimization tactics, not on producing a composite quality score for brand-level AI visibility. The RAG evaluation literature (RAGAS, ARES, BEIR) provides retrieval quality metrics but addresses system evaluation, not site-level visibility scoring.[5][6][7] This gap is what the AVS framework addresses.

2. How Existing Tools Measure Visibility

An analysis of seven major platforms reveals a consistent pattern: all measure what AI systems say about brands; none simulate how AI systems retrieve brand content.

Tool Composite Score? Key Approach Retrieval Simulation?
Profound ~ AEO Content Score ML-powered predictive citation likelihood; 400M+ prompt database; 7+ platforms
AthenaHQ ~ GEO Score (ACE) Citation + sentiment + traffic composite; Shopify/GA4 revenue attribution
Peec AI Three-axis model Visibility / Position / Sentiment, each 0–100; separates “used” vs “cited” sources; real UI scraping; 115+ languages
Otterly AI ~ Brand Visibility Score 20+ on-page GEO audit factors; Gartner Cool Vendor 2025; 20,000+ users
Semrush AI Toolkit Yes (0–100) 130M+ prompt database; SOV + sentiment + citation + prompt visibility composite
Ahrefs Brand Radar Directional SOV only 260M+ search-backed prompts; AI Share of Voice; 6 platforms
HubSpot AEO Grader Yes (0–100) 5-dimension: Sentiment, Recognition Depth, SOV, Market Positioning, Presence Quality

Peec AI's three-axis model (Visibility / Position / Sentiment) is the most parsimonious and methodologically honest among output-monitoring tools. It makes no claim to simulate retrieval mechanics, instead providing three clean, separately interpretable scores for what can actually be observed from LLM outputs. It also makes a methodologically important distinction that most tools conflate: separating sources the AI "used" (drew on for context) from sources it explicitly "cited" (named in the response) — a distinction with real significance for understanding where in the retrieval-to-generation pipeline a brand is present or absent.[8]

The existing measurement paradigm asks: "What did AI say about us?" The AI Visibility Score asks the upstream question: "Could AI reliably find, extract, and represent us in the first place?"

3. Theoretical Foundations for the Three Dimensions

3.1 Entity Strength — the knowledge graph foundation

Pan et al.'s 2024 roadmap "Unifying Large Language Models and Knowledge Graphs" demonstrated that KGs provide structured entity knowledge that directly mitigates LLM hallucination and improves factual accuracy.[9] The practitioner evidence is striking: an analysis by Wellows found that pages with 15+ recognised entities show 4.8× higher citation probability (r=0.76 correlation between entity KG density and citation rate).[10] ZipTie.dev's reverse-engineering of AI Overview source selection found that only 38% of AI-cited pages rank in Google's organic top 10 — but pages with strong E-E-A-T signals were cited 2.3× more than rank-#1 pages with weak authority.[11]

Taher et al.'s 2025 NER survey finds that 80% of top queries to a digital library contained at least one named entity.[12] Wikidata now indexes 121+ million entities with 1.65 billion semantic triples;[14] DBpedia covers 228+ million entities from 111 Wikipedia language editions.[15] These serve as the primary coverage reference for entity strength scoring — entities are scored not just on frequency but on resolvability to recognised knowledge graph nodes.

3.2 Prompt Coverage — the retrieval simulation foundation

The GEO paper found that citing sources improved visibility by +115% for lower-ranked pages — but keyword stuffing reduced it by 10%.[4] Lewis et al.'s foundational 2020 RAG paper established that retrieval recall serves as a hard upper bound on generation performance: a system can only cite what it can retrieve.[16] The RAGAS framework formalises four retrieval quality metrics, with Context Recall measuring information completeness of retrieved context.[5]

The Prompt Coverage dimension operationalises these concepts at the site level: generating a prompt universe from the site's own content, computing vector embeddings for both prompts and content chunks, and measuring cosine similarity. The resulting Coverage Score (proportion of generated prompts finding a chunk with cosine similarity ≥0.7) and Confidence Index (average similarity quality of matched chunks) combine into the sub-score. The 0.7 threshold aligns with Google Vertex AI's default grounding threshold — the industry standard for production RAG systems.[17]

3.3 Commerce Readiness — the agentic infrastructure foundation

Gartner projects AI agents will intermediate 90% of B2B purchasing by 2028, commanding over $15 trillion in purchases.[18] McKinsey projects $3–5 trillion in global agentic commerce sales by 2030.[19] Anthropic's Model Context Protocol, released November 2024, reached 10,000+ active servers and 97 million monthly SDK downloads by December 2025, establishing the technical standard for AI-to-commerce connectivity.[20] This dimension is weighted at 10% — explicitly provisional — reflecting that most AI visibility use cases in 2026 remain informational rather than transactional.

4. The AI Visibility Score: Formula and Weights

AVS = (Entity Strength × 0.45 + Prompt Coverage × 0.45 + Commerce Readiness × 0.10) × Scope Multiplier

AVS Dimensional Composition

Entity Strength— entity density, KG resolvability, co-occurrence
45%
Prompt Coverage— vector retrieval simulation, coverage score, confidence index
45%
Commerce Readiness— MCP compatibility, API exposure, offer schema, inventory state
10%

× Scope Multiplier (0.70–1.00)  ·  Final score 0–100

4.1 Why 45 / 45 / 10?

The weighting draws from two adjacent scoring traditions. The FICO credit score weights its top two dimensions at 35% + 30% because those factors empirically predict default far more strongly than the remaining three.[21] Core Web Vitals uses a "weakest link" composite where overall page status is limited by the worst-performing metric.[22]

Entity Strength and Prompt Coverage receive equal weighting because they are complementary rather than redundant. A site can have high entity strength (well-written, entity-rich content) but poor prompt coverage (content misaligned with how users actually query AI). The inverse is equally common: broad topic coverage with shallow entity depth. Moz's Domain Authority 2.0 transition from linear to neural weighting (combining 40+ signals, achieving r=0.9+ with competitor metrics) demonstrates that complementary signals are best combined through learned weights — the AVS's linear combination is a deliberate simplification for interpretability.[23]

Commerce Readiness is weighted at 10% because Pew's behavioral study confirms AI is more often replacing the need to visit a site than replacing e-commerce transactions — users "ended sessions after AI" at 26% vs 16% for traditional search.[3] For e-commerce verticals specifically, implementations of the AVS framework should re-weight Commerce Readiness toward 25%.

5. The Scope Coverage Multiplier

A score derived from a single-page audit should not be directly comparable to a full-site audit of 500 pages. Cloudflare's analysis found that AI bots are less likely to reach 3rd- and 4th-level pages,[25] and Thiel & Kretschmer's FAccT 2024 paper found that Common Crawl URL discovery applies a PageRank-like filter to what enters LLM training data — systematically favouring comprehensively linked domains.[24]

Scope Coverage Multiplier Values

Single page
0.70×
Only anchor page signals available. Significant underestimation of domain-wide entity co-occurrence and cross-page retrieval coverage.
Context cluster
0.87×
15–30 semantically related pages. Cross-page entity modeling partially available; global link graph absent.
Category
0.94×
Full category crawl. Strong coverage within vertical; cross-category authority paths limited.
Full site
1.00×
Complete domain audit. All signals at full weight. Enables cross-page entity reinforcement and complete prompt universe generation.

6. Grade Thresholds

AI Visibility Score — Grade Thresholds

A · 85–100 Excellent AI visibility. Strong entity density, well-resolved KG nodes, prompt coverage across the domain's query space, structured data complete. Content likely to appear across multiple AI platforms without further optimisation.
B · 70–84 Good AI visibility. One or two dimension gaps — typically either entity co-occurrence weak or prompt coverage patchy on long-tail queries. Targeted optimisation would yield measurable gains.
C · 50–69 Adequate visibility. May receive occasional AI citation but not reliably retrieved. Entity structure present but schema quality inconsistent. Retrieval simulation shows significant gaps outside core topics.
D · 25–49 Poor AI visibility. Content may rank in traditional search but is structurally under-represented for AI retrieval. JS-dependent content invisible to non-headless crawlers. Retrieval simulation coverage below 50%.
F · 0–24 Effectively AI-invisible. No structured data, sparse entity signals, heavy JS dependency, minimal content chunk semantic coverage. Traditional SEO tactics will not improve this score.

7. Composite Scoring Precedents

Page and Brin's PageRank (Stanford, 1999) established that a quality score should combine signal quantity with signal quality in a single recursive formula.[31] The FICO credit score demonstrates that fixed-weight combination of heterogeneous signal types can produce a universally adopted composite — weighted by empirical default prediction, not theoretical elegance.[21]

Moz's Domain Authority 2.0 is the most directly analogous precedent: switching from linear to neural weighting, producing a 100-point logarithmic scale with r=0.9+ correlation across competing tools.[23] Kaplan and Norton's Balanced Scorecard (HBR, 1992), used by 53% of companies at peak adoption, established the conceptual foundation: no single metric gives a complete picture, and combining perspectives from different measurement domains into one composite is both defensible and actionable.[32]

8. Limitations and Future Work

The 45/45/10 weighting is theoretically motivated but not yet empirically validated against actual AI citation outcomes at scale. Wan, Wallace and Klein (ACL 2024) found that LLMs rely heavily on query relevance but largely ignore stylistic features — suggesting that specific Entity Strength sub-signals reflecting content clarity may deserve higher weight than richness signals.[33]

The AVS currently produces a platform-agnostic composite, but iPullRank's technical teardowns confirm that Bing Copilot uses hybrid retrieval (BM25 + vector) while Perplexity emphasises real-time web search and Google AI Mode emphasises knowledge graph grounding.[34] Platform-specific score decompositions would require separate retrieval simulation models calibrated per platform.

Algaba et al. (NAACL 2025) found that LLMs reinforce citation bias toward recent, highly cited sources, with only 8.78% of generated references matching ground truth.[35] This suggests the AVS score is not static — content freshness decay and volatility tracking should be incorporated as a fourth component. RetrieveAI's snapshot and volatility alert pipeline captures this dimension in practice but it is not yet included in the composite formula.

Key Takeaways
  • No existing tool produces a weighted composite AI visibility score grounded in retrieval mechanics — all seven major platforms measure LLM outputs, not input-side retrievability
  • Peec AI's three-axis model (Visibility / Position / Sentiment) is the most methodologically honest among output-monitoring tools, making the important distinction between "used" and "cited" sources
  • Entity Strength and Prompt Coverage receive equal 45% weights because they are complementary — a site can have high entity quality with poor query coverage, or broad coverage with shallow entity depth
  • The Scope Multiplier (0.70–1.00) corrects a real bias: single-page audits systematically underestimate domain-wide AI visibility
  • The 10% Commerce Readiness weight is explicitly provisional — as agentic commerce matures, this should grow toward 25% for e-commerce verticals
  • Established precedents (PageRank, FICO, Domain Authority 2.0, Core Web Vitals, Balanced Scorecard) validate multi-dimensional composite scoring; the AVS applies this tradition to AI visibility for the first time

Conclusion

The AI visibility measurement landscape is crowded with tools answering the wrong question. "How often does ChatGPT mention our brand?" is an important data point — Peec AI, Profound, Semrush and others answer it well. But it is a consequence, not a cause. The more actionable question is: "Is our content structured so that AI retrieval systems can find, extract, and represent it across the likely query space for our domain?"

The AI Visibility Score framework — built through the implementation of RetrieveAI — proposes the first published answer to that second question: a three-dimensional weighted composite (Entity Strength 45%, Prompt Coverage 45%, Commerce Readiness 10%) modulated by a Scope Coverage Multiplier, producing a single auditable 0–100 number with a clear lineage to PageRank, FICO, Domain Authority, and the Balanced Scorecard.

The weights are theoretically motivated but empirically provisional. The framework's value is not that it produces the definitive answer, but that it produces a specific, falsifiable, reproducible answer — one that future research can validate, challenge, and refine. That is what distinguishes a framework from a dashboard.

References

  1. [1] MarketIntelo (2025). GEO market: $848M to $19.8B by 2034. marketintelo.com/report/generative-engine-optimization-geo-market
  2. [2] Gartner (2024). "Search Engine Volume Will Drop 25% by 2026." gartner.com/en/newsroom/press-releases/2024-02-19
  3. [3] Pew Research Center (2025). "Google users are less likely to click on links when an AI summary appears." July 22, 2025. pewresearch.org
  4. [4] Aggarwal, P. et al. (2024). "GEO: Generative Engine Optimization." KDD '24, ACM. arXiv:2311.09735
  5. [5] Es, S. et al. (2024). "RAGAS: Automated Evaluation of Retrieval Augmented Generation." EACL 2024. arXiv:2309.15217
  6. [6] Saad-Falcon, J. et al. (2024). "ARES: An Automated Evaluation Framework for RAG Systems." NAACL 2024. arXiv:2311.09476
  7. [7] Thakur, N. et al. (2021). "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of IR Models." NeurIPS 2021. arXiv:2104.08663
  8. [8] Peec AI Documentation. "Intro to Peec AI." docs.peec.ai/intro-to-peec-ai
  9. [9] Pan, S. et al. (2024). "Unifying Large Language Models and Knowledge Graphs: A Roadmap." IEEE TKDE, 36, 3580–3599. arXiv:2306.08302
  10. [10] Wellows (2025). "Google AI Overviews Ranking Factors: 2026 Guide." wellows.com/blog/google-ai-overviews-ranking-factors/
  11. [11] ZipTie.dev (2025). "Google AI Overviews Source Selection." ziptie.dev/blog/google-ai-overviews-source-selection/
  12. [12] Taher, H. et al. (2025). "Recent Advances in Named Entity Recognition." arXiv:2401.10825v3
  13. [14] Wikidata (2025). 121M+ entities, 1.65B triples. wikidata.org
  14. [15] Lehmann, J. et al. (2015). "DBpedia — A Large-scale, Multilingual Knowledge Base." Semantic Web Journal, 6(2), 167–195.
  15. [16] Lewis, P. et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 33. arXiv:2005.11401
  16. [17] Google Vertex AI Documentation. "Dynamic retrieval." Default grounding threshold: 0.7. cloud.google.com/vertex-ai
  17. [18] Gartner (2025). "Top Predictions for IT Organizations in 2026 and Beyond." October 21, 2025.
  18. [19] McKinsey (2025). "The Agentic Commerce Opportunity." October 17, 2025. mckinsey.com
  19. [20] Anthropic (2025). "Donating MCP and establishing the Agentic AI Foundation." December 9, 2025.
  20. [21] Fair Isaac Corporation. "What's in My FICO Scores." myfico.com/credit-education/whats-in-your-credit-score
  21. [22] McQuade, B. & Pollard, B. (2025). "How Core Web Vitals Thresholds Were Defined." web.dev/articles/defining-core-web-vitals-thresholds
  22. [23] Moz (2019). "Domain Authority 2.0." Technical whitepaper. moz.com
  23. [24] Thiel, H. & Kretschmer, M. (2024). "A Critical Analysis of Common Crawl." FAccT '24. DOI:10.1145/3630106.3659033
  24. [25] Cloudflare (2025). "From Googlebot to GPTBot: Who's Crawling Your Site in 2025." blog.cloudflare.com
  25. [26] Canel, F. (Microsoft). SMX Munich, March 2025. searchengineland.com/microsoft-bing-copilot-use-schema-for-its-llms-453455
  26. [27] AccuraCast (2025). "Does Schema Markup Increase Generative Search Visibility?" accuracast.com
  27. [28] University of Mannheim / WDC (2024). 51.25% of 2.4B pages have structured data. uni-mannheim.de
  28. [29] Salemi, A. & Zamani, H. (2024). "Evaluating Retrieval Quality in RAG." SIGIR '24. arXiv:2404.13781
  29. [30] Gao, T. et al. (2023). "Enabling LLMs to Generate Text with Citations." EMNLP 2023. arXiv:2305.14627
  30. [31] Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). "The PageRank Citation Ranking." Stanford. ilpubs.stanford.edu:8090/422/
  31. [32] Kaplan, R.S. & Norton, D.P. (1992). "The Balanced Scorecard." Harvard Business Review, 70(1), 71–79.
  32. [33] Wan, A., Wallace, E., & Klein, D. (2024). "What Evidence Do Language Models Find Convincing?" ACL 2024. DOI:10.18653/v1/2024.acl-long.403
  33. [34] iPullRank (2025). "AI Search Architecture Deep Dive." ipullrank.com/ai-search-manual/search-architecture
  34. [35] Algaba, A. et al. (2025). "LLMs Reflect Human Citation Patterns with Heightened Bias." NAACL 2025. arXiv:2405.15739
Author
Akshay Dahiya

Growth & MarTech Specialist

Digital marketing professional with 7+ years of experience in SEO, analytics, and marketing automation. Currently building RetrieveAI, MarAI, and RankScan tools that solve real problems I've run into working in growth and search.