Research March 2, 2026 • 20 min read

Towards a Quantitative AI Visibility Score: A Dimensional Framework_

The first weighted composite scoring model for AI search visibility — integrating entity strength, retrieval simulation, and commerce readiness into a single, auditable 0–100 score.

Akshay Dahiya

Growth & MarTech Specialist

The AI search visibility market now exceeds $850 million and is projected to reach $7–20 billion by 2031–2034.^[1] Yet despite hundreds of tools claiming to measure AI visibility, no published framework defines a weighted composite score combining heterogeneous quality signals — entity strength, retrieval simulation, and commerce readiness — into a single auditable number.

This paper introduces the AI Visibility Score (AVS), a three-dimensional weighted composite scoring model built from first principles during the development of RetrieveAI, an AI retrieval and visibility audit platform. The model assigns 45% weight to Entity Strength, 45% to Prompt Coverage, and 10% to Commerce Readiness, modulated by a Scope Coverage Multiplier (0.70–1.00) determined by crawl depth.

This paper covers:

Why the existing measurement landscape lacks a composite scoring standard
The theoretical grounding for each dimension and its weight
The Scope Coverage Multiplier and why site coverage matters
Grade thresholds and what scores mean in practice
Comparison with seven major tools including Peec AI
How the framework is implemented in RetrieveAI's 17-phase audit pipeline

1. The Measurement Gap

Gartner predicted in 2024 that traditional search engine volume will drop 25% by 2026 due to AI chatbots — later revised to potentially 50%+ by 2028.^[2] Pew Research Center's gold-standard behavioral study of 900 US adults across 68,879 searches found that click-through rates with AI summaries are 8% versus 15% without — a 47% reduction — and only 1% of users click sources cited within AI Overviews.^[3]

80%

of companies don't track AI brand mentions

MaximusLabs, 2025

57%

of marketers still figuring out AI visibility measurement

BrightEdge, 2025

80%

of LLM-cited URLs don't rank in Google's top 100

Position Digital, 2026

11%

of domains cited by both ChatGPT and Perplexity

The Digital Bloom, 2025

The foundational academic work — Aggarwal et al.'s "GEO: Generative Engine Optimization" (KDD 2024, Princeton/IIT Delhi) — defines three visibility metrics for generative engines and benchmarks nine optimization strategies across 10,000 queries.^[4] But GEO focuses on content-level optimization tactics, not on producing a composite quality score for brand-level AI visibility. The RAG evaluation literature (RAGAS, ARES, BEIR) provides retrieval quality metrics but addresses system evaluation, not site-level visibility scoring.^[5][6][7] This gap is what the AVS framework addresses.

2. How Existing Tools Measure Visibility

An analysis of seven major platforms reveals a consistent pattern: all measure what AI systems say about brands; none simulate how AI systems retrieve brand content.

Tool	Composite Score?	Key Approach	Retrieval Simulation?
Profound	~ AEO Content Score	ML-powered predictive citation likelihood; 400M+ prompt database; 7+ platforms	✗
AthenaHQ	~ GEO Score (ACE)	Citation + sentiment + traffic composite; Shopify/GA4 revenue attribution	✗
Peec AI	Three-axis model	Visibility / Position / Sentiment, each 0–100; separates “used” vs “cited” sources; real UI scraping; 115+ languages	✗
Otterly AI	~ Brand Visibility Score	20+ on-page GEO audit factors; Gartner Cool Vendor 2025; 20,000+ users	✗
Semrush AI Toolkit	Yes (0–100)	130M+ prompt database; SOV + sentiment + citation + prompt visibility composite	✗
Ahrefs Brand Radar	Directional SOV only	260M+ search-backed prompts; AI Share of Voice; 6 platforms	✗
HubSpot AEO Grader	Yes (0–100)	5-dimension: Sentiment, Recognition Depth, SOV, Market Positioning, Presence Quality	✗

Peec AI's three-axis model (Visibility / Position / Sentiment) is the most parsimonious and methodologically honest among output-monitoring tools. It makes no claim to simulate retrieval mechanics, instead providing three clean, separately interpretable scores for what can actually be observed from LLM outputs. It also makes a methodologically important distinction that most tools conflate: separating sources the AI "used" (drew on for context) from sources it explicitly "cited" (named in the response) — a distinction with real significance for understanding where in the retrieval-to-generation pipeline a brand is present or absent.^[8]

The existing measurement paradigm asks: "What did AI say about us?" The AI Visibility Score asks the upstream question: "Could AI reliably find, extract, and represent us in the first place?"

3. Theoretical Foundations for the Three Dimensions

3.1 Entity Strength — the knowledge graph foundation

Pan et al.'s 2024 roadmap "Unifying Large Language Models and Knowledge Graphs" demonstrated that KGs provide structured entity knowledge that directly mitigates LLM hallucination and improves factual accuracy.^[9] The practitioner evidence is striking: an analysis by Wellows found that pages with 15+ recognised entities show 4.8× higher citation probability (r=0.76 correlation between entity KG density and citation rate).^[10] ZipTie.dev's reverse-engineering of AI Overview source selection found that only 38% of AI-cited pages rank in Google's organic top 10 — but pages with strong E-E-A-T signals were cited 2.3× more than rank-#1 pages with weak authority.^[11]

Taher et al.'s 2025 NER survey finds that 80% of top queries to a digital library contained at least one named entity.^[12] Wikidata now indexes 121+ million entities with 1.65 billion semantic triples;^[14] DBpedia covers 228+ million entities from 111 Wikipedia language editions.^[15] These serve as the primary coverage reference for entity strength scoring — entities are scored not just on frequency but on resolvability to recognised knowledge graph nodes.

3.2 Prompt Coverage — the retrieval simulation foundation

The GEO paper found that citing sources improved visibility by +115% for lower-ranked pages — but keyword stuffing reduced it by 10%.^[4] Lewis et al.'s foundational 2020 RAG paper established that retrieval recall serves as a hard upper bound on generation performance: a system can only cite what it can retrieve.^[16] The RAGAS framework formalises four retrieval quality metrics, with Context Recall measuring information completeness of retrieved context.^[5]

The Prompt Coverage dimension operationalises these concepts at the site level: generating a prompt universe from the site's own content, computing vector embeddings for both prompts and content chunks, and measuring cosine similarity. The resulting Coverage Score (proportion of generated prompts finding a chunk with cosine similarity ≥0.7) and Confidence Index (average similarity quality of matched chunks) combine into the sub-score. The 0.7 threshold aligns with Google Vertex AI's default grounding threshold — the industry standard for production RAG systems.^[17]

3.3 Commerce Readiness — the agentic infrastructure foundation

Gartner projects AI agents will intermediate 90% of B2B purchasing by 2028, commanding over $15 trillion in purchases.^[18] McKinsey projects $3–5 trillion in global agentic commerce sales by 2030.^[19] Anthropic's Model Context Protocol, released November 2024, reached 10,000+ active servers and 97 million monthly SDK downloads by December 2025, establishing the technical standard for AI-to-commerce connectivity.^[20] This dimension is weighted at 10% — explicitly provisional — reflecting that most AI visibility use cases in 2026 remain informational rather than transactional.

4. The AI Visibility Score: Formula and Weights

AVS = (Entity Strength × 0.45 + Prompt Coverage × 0.45 + Commerce Readiness × 0.10) × Scope Multiplier

AVS Dimensional Composition

Entity Strength— entity density, KG resolvability, co-occurrence

45%

Prompt Coverage— vector retrieval simulation, coverage score, confidence index

45%

Commerce Readiness— MCP compatibility, API exposure, offer schema, inventory state

10%

× Scope Multiplier (0.70–1.00) · Final score 0–100

4.1 Why 45 / 45 / 10?

The weighting draws from two adjacent scoring traditions. The FICO credit score weights its top two dimensions at 35% + 30% because those factors empirically predict default far more strongly than the remaining three.^[21] Core Web Vitals uses a "weakest link" composite where overall page status is limited by the worst-performing metric.^[22]

Entity Strength and Prompt Coverage receive equal weighting because they are complementary rather than redundant. A site can have high entity strength (well-written, entity-rich content) but poor prompt coverage (content misaligned with how users actually query AI). The inverse is equally common: broad topic coverage with shallow entity depth. Moz's Domain Authority 2.0 transition from linear to neural weighting (combining 40+ signals, achieving r=0.9+ with competitor metrics) demonstrates that complementary signals are best combined through learned weights — the AVS's linear combination is a deliberate simplification for interpretability.^[23]

Commerce Readiness is weighted at 10% because Pew's behavioral study confirms AI is more often replacing the need to visit a site than replacing e-commerce transactions — users "ended sessions after AI" at 26% vs 16% for traditional search.^[3] For e-commerce verticals specifically, implementations of the AVS framework should re-weight Commerce Readiness toward 25%.

5. The Scope Coverage Multiplier

A score derived from a single-page audit should not be directly comparable to a full-site audit of 500 pages. Cloudflare's analysis found that AI bots are less likely to reach 3rd- and 4th-level pages,^[25] and Thiel & Kretschmer's FAccT 2024 paper found that Common Crawl URL discovery applies a PageRank-like filter to what enters LLM training data — systematically favouring comprehensively linked domains.^[24]

Scope Coverage Multiplier Values

Single page
0.70× Only anchor page signals available. Significant underestimation of domain-wide entity co-occurrence and cross-page retrieval coverage.

Context cluster
0.87× 15–30 semantically related pages. Cross-page entity modeling partially available; global link graph absent.

Category
0.94× Full category crawl. Strong coverage within vertical; cross-category authority paths limited.

Full site
1.00× Complete domain audit. All signals at full weight. Enables cross-page entity reinforcement and complete prompt universe generation.

6. Grade Thresholds

AI Visibility Score — Grade Thresholds

A · 85–100 Excellent AI visibility. Strong entity density, well-resolved KG nodes, prompt coverage across the domain's query space, structured data complete. Content likely to appear across multiple AI platforms without further optimisation.

B · 70–84 Good AI visibility. One or two dimension gaps — typically either entity co-occurrence weak or prompt coverage patchy on long-tail queries. Targeted optimisation would yield measurable gains.

C · 50–69 Adequate visibility. May receive occasional AI citation but not reliably retrieved. Entity structure present but schema quality inconsistent. Retrieval simulation shows significant gaps outside core topics.

D · 25–49 Poor AI visibility. Content may rank in traditional search but is structurally under-represented for AI retrieval. JS-dependent content invisible to non-headless crawlers. Retrieval simulation coverage below 50%.

F · 0–24 Effectively AI-invisible. No structured data, sparse entity signals, heavy JS dependency, minimal content chunk semantic coverage. Traditional SEO tactics will not improve this score.

7. Composite Scoring Precedents

Page and Brin's PageRank (Stanford, 1999) established that a quality score should combine signal quantity with signal quality in a single recursive formula.^[31] The FICO credit score demonstrates that fixed-weight combination of heterogeneous signal types can produce a universally adopted composite — weighted by empirical default prediction, not theoretical elegance.^[21]

Moz's Domain Authority 2.0 is the most directly analogous precedent: switching from linear to neural weighting, producing a 100-point logarithmic scale with r=0.9+ correlation across competing tools.^[23] Kaplan and Norton's Balanced Scorecard (HBR, 1992), used by 53% of companies at peak adoption, established the conceptual foundation: no single metric gives a complete picture, and combining perspectives from different measurement domains into one composite is both defensible and actionable.^[32]

8. Limitations and Future Work

The 45/45/10 weighting is theoretically motivated but not yet empirically validated against actual AI citation outcomes at scale. Wan, Wallace and Klein (ACL 2024) found that LLMs rely heavily on query relevance but largely ignore stylistic features — suggesting that specific Entity Strength sub-signals reflecting content clarity may deserve higher weight than richness signals.^[33]

The AVS currently produces a platform-agnostic composite, but iPullRank's technical teardowns confirm that Bing Copilot uses hybrid retrieval (BM25 + vector) while Perplexity emphasises real-time web search and Google AI Mode emphasises knowledge graph grounding.^[34] Platform-specific score decompositions would require separate retrieval simulation models calibrated per platform.

Algaba et al. (NAACL 2025) found that LLMs reinforce citation bias toward recent, highly cited sources, with only 8.78% of generated references matching ground truth.^[35] This suggests the AVS score is not static — content freshness decay and volatility tracking should be incorporated as a fourth component. RetrieveAI's snapshot and volatility alert pipeline captures this dimension in practice but it is not yet included in the composite formula.

Key Takeaways

No existing tool produces a weighted composite AI visibility score grounded in retrieval mechanics — all seven major platforms measure LLM outputs, not input-side retrievability
Peec AI's three-axis model (Visibility / Position / Sentiment) is the most methodologically honest among output-monitoring tools, making the important distinction between "used" and "cited" sources
Entity Strength and Prompt Coverage receive equal 45% weights because they are complementary — a site can have high entity quality with poor query coverage, or broad coverage with shallow entity depth
The Scope Multiplier (0.70–1.00) corrects a real bias: single-page audits systematically underestimate domain-wide AI visibility
The 10% Commerce Readiness weight is explicitly provisional — as agentic commerce matures, this should grow toward 25% for e-commerce verticals
Established precedents (PageRank, FICO, Domain Authority 2.0, Core Web Vitals, Balanced Scorecard) validate multi-dimensional composite scoring; the AVS applies this tradition to AI visibility for the first time

Conclusion

The AI visibility measurement landscape is crowded with tools answering the wrong question. "How often does ChatGPT mention our brand?" is an important data point — Peec AI, Profound, Semrush and others answer it well. But it is a consequence, not a cause. The more actionable question is: "Is our content structured so that AI retrieval systems can find, extract, and represent it across the likely query space for our domain?"

The AI Visibility Score framework — built through the implementation of RetrieveAI — proposes the first published answer to that second question: a three-dimensional weighted composite (Entity Strength 45%, Prompt Coverage 45%, Commerce Readiness 10%) modulated by a Scope Coverage Multiplier, producing a single auditable 0–100 number with a clear lineage to PageRank, FICO, Domain Authority, and the Balanced Scorecard.

The weights are theoretically motivated but empirically provisional. The framework's value is not that it produces the definitive answer, but that it produces a specific, falsifiable, reproducible answer — one that future research can validate, challenge, and refine. That is what distinguishes a framework from a dashboard.

References

[1] MarketIntelo (2025). GEO market: $848M to $19.8B by 2034. marketintelo.com/report/generative-engine-optimization-geo-market
[2] Gartner (2024). "Search Engine Volume Will Drop 25% by 2026." gartner.com/en/newsroom/press-releases/2024-02-19
[3] Pew Research Center (2025). "Google users are less likely to click on links when an AI summary appears." July 22, 2025. pewresearch.org
[4] Aggarwal, P. et al. (2024). "GEO: Generative Engine Optimization." KDD '24, ACM. arXiv:2311.09735
[5] Es, S. et al. (2024). "RAGAS: Automated Evaluation of Retrieval Augmented Generation." EACL 2024. arXiv:2309.15217
[6] Saad-Falcon, J. et al. (2024). "ARES: An Automated Evaluation Framework for RAG Systems." NAACL 2024. arXiv:2311.09476
[7] Thakur, N. et al. (2021). "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of IR Models." NeurIPS 2021. arXiv:2104.08663
[8] Peec AI Documentation. "Intro to Peec AI." docs.peec.ai/intro-to-peec-ai
[9] Pan, S. et al. (2024). "Unifying Large Language Models and Knowledge Graphs: A Roadmap." IEEE TKDE, 36, 3580–3599. arXiv:2306.08302
[10] Wellows (2025). "Google AI Overviews Ranking Factors: 2026 Guide." wellows.com/blog/google-ai-overviews-ranking-factors/
[11] ZipTie.dev (2025). "Google AI Overviews Source Selection." ziptie.dev/blog/google-ai-overviews-source-selection/
[12] Taher, H. et al. (2025). "Recent Advances in Named Entity Recognition." arXiv:2401.10825v3
[14] Wikidata (2025). 121M+ entities, 1.65B triples. wikidata.org
[15] Lehmann, J. et al. (2015). "DBpedia — A Large-scale, Multilingual Knowledge Base." Semantic Web Journal, 6(2), 167–195.
[16] Lewis, P. et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 33. arXiv:2005.11401
[17] Google Vertex AI Documentation. "Dynamic retrieval." Default grounding threshold: 0.7. cloud.google.com/vertex-ai
[18] Gartner (2025). "Top Predictions for IT Organizations in 2026 and Beyond." October 21, 2025.
[19] McKinsey (2025). "The Agentic Commerce Opportunity." October 17, 2025. mckinsey.com
[20] Anthropic (2025). "Donating MCP and establishing the Agentic AI Foundation." December 9, 2025.
[21] Fair Isaac Corporation. "What's in My FICO Scores." myfico.com/credit-education/whats-in-your-credit-score
[22] McQuade, B. & Pollard, B. (2025). "How Core Web Vitals Thresholds Were Defined." web.dev/articles/defining-core-web-vitals-thresholds
[23] Moz (2019). "Domain Authority 2.0." Technical whitepaper. moz.com
[24] Thiel, H. & Kretschmer, M. (2024). "A Critical Analysis of Common Crawl." FAccT '24. DOI:10.1145/3630106.3659033
[25] Cloudflare (2025). "From Googlebot to GPTBot: Who's Crawling Your Site in 2025." blog.cloudflare.com
[26] Canel, F. (Microsoft). SMX Munich, March 2025. searchengineland.com/microsoft-bing-copilot-use-schema-for-its-llms-453455
[27] AccuraCast (2025). "Does Schema Markup Increase Generative Search Visibility?" accuracast.com
[28] University of Mannheim / WDC (2024). 51.25% of 2.4B pages have structured data. uni-mannheim.de
[29] Salemi, A. & Zamani, H. (2024). "Evaluating Retrieval Quality in RAG." SIGIR '24. arXiv:2404.13781
[30] Gao, T. et al. (2023). "Enabling LLMs to Generate Text with Citations." EMNLP 2023. arXiv:2305.14627
[31] Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). "The PageRank Citation Ranking." Stanford. ilpubs.stanford.edu:8090/422/
[32] Kaplan, R.S. & Norton, D.P. (1992). "The Balanced Scorecard." Harvard Business Review, 70(1), 71–79.
[33] Wan, A., Wallace, E., & Klein, D. (2024). "What Evidence Do Language Models Find Convincing?" ACL 2024. DOI:10.18653/v1/2024.acl-long.403
[34] iPullRank (2025). "AI Search Architecture Deep Dive." ipullrank.com/ai-search-manual/search-architecture
[35] Algaba, A. et al. (2025). "LLMs Reflect Human Citation Patterns with Heightened Bias." NAACL 2025. arXiv:2405.15739

Akshay Dahiya

Growth & MarTech Specialist

Digital marketing professional with 7+ years of experience in SEO, analytics, and marketing automation. Currently building RetrieveAI, MarAI, and RankScan tools that solve real problems I've run into working in growth and search.

SEO

February 5, 2026 • 14 min read

Programmatic GEO Pages at Scale Using Search Intent Clustering

How to build scalable local SEO pages using AI validation and intent clustering without

Analytics

February 1, 2026 • 15 min read

Measuring Visibility Loss to AI Answers Using Click-Through Suppression Models

How to quantify traffic loss caused by AI-generated answers and build CFO-level

SEO

January 27, 2026 • 13 min read

Engineering Content for Citation by AI Search Engines

How to optimize content for AI retrieval and citation, not just rankings—turning your pages into trusted sources

Towards a Quantitative AI Visibility Score: A Dimensional Framework_

This paper covers:

1. The Measurement Gap

2. How Existing Tools Measure Visibility

3. Theoretical Foundations for the Three Dimensions

3.1 Entity Strength — the knowledge graph foundation

3.2 Prompt Coverage — the retrieval simulation foundation

3.3 Commerce Readiness — the agentic infrastructure foundation

4. The AI Visibility Score: Formula and Weights

4.1 Why 45 / 45 / 10?

5. The Scope Coverage Multiplier

6. Grade Thresholds

7. Composite Scoring Precedents

8. Limitations and Future Work

Key Takeaways

Conclusion

References

Akshay Dahiya

Related Posts

Programmatic GEO Pages at Scale Using Search Intent Clustering

Measuring Visibility Loss to AI Answers Using Click-Through Suppression Models

Engineering Content for Citation by AI Search Engines