I'm always excited to take on new projects and collaborate with innovative minds.

Social

Back to Blog
Research March 9, 2026 22 min read

Why the AI Visibility Market is Measuring the Wrong Thing_

An analysis of 80+ tools, a structural blind spot shared by every one of them, and why the entire AI visibility industry has confused consequences for causes.

Author

Akshay Dahiya

Growth & MarTech Specialist

There are now more than 80 tools competing to measure AI visibility. They have raised hundreds of millions in venture capital. They are featured in Gartner reports. They are used by the marketing teams of Fortune 500 companies. And with very few exceptions, every single one of them is measuring the same thing in the same way — and it is the wrong thing.

The thing they are measuring is output: what AI systems say about brands when prompted. How often does ChatGPT mention your company? What sentiment does Perplexity express? What share of AI-generated answers include your product? These are legitimate data points. They are not, however, measurements of AI visibility. They are measurements of AI opinion — downstream consequences of a retrieval and generation process that these tools never examine.

This paper argues that the AI visibility market has made a category error. It has built an industry around measuring effects while ignoring causes. And the cause — whether a site’s content is structurally accessible to the retrieval pipeline that precedes AI generation — remains almost entirely unmeasured.

This paper argues:
  • The entire AI visibility industry operates on the output side of the retrieval-generation pipeline
  • Output measurement produces descriptive data, not diagnostic data — it tells you what happened, not why
  • The structural causes of AI invisibility are upstream: rendering gaps, retrieval coverage, entity resolvability, intent alignment
  • The Ahrefs finding that 80% of AI-cited URLs don’t rank in Google’s top 100 proves that traditional proxies for AI visibility are broken
  • The market needs input-side measurement — retrieval simulation against a site’s own content — to move from observation to diagnosis

1. The Market That Built Itself on the Wrong Foundation

The AI visibility measurement market emerged in 2023 and accelerated sharply through 2024 and 2025. By early 2026 it exceeds $850 million and is projected to reach $7–20 billion by 2031–2034 depending on the forecaster.[1] The category attracted serious capital: Profound raised funding at a reported $1 billion valuation; Scrunch AI raised $19 million; Evertune raised $19 million; Peec AI raised approximately €28 million.[2]

The commercial urgency was real and well-founded. Gartner predicted in 2024 that traditional search engine volume would drop 25% by 2026 due to AI chatbots, later revising that estimate upward to potentially 50%+ decline by 2028.[3] Pew Research Center’s behavioral study of 68,879 actual searches found that click-through rates drop from 15% to 8% when AI summaries are present — a 47% reduction — with only 1% of users clicking sources within AI Overviews.[4] Marketers needed to understand their position in this new landscape. The tools that emerged to serve that need are, in many cases, technically sophisticated and genuinely useful.

The problem is not that these tools exist. The problem is the question they are built to answer. Every major tool in the market asks: “What does AI say about this brand?” Not one of them asks: “Can AI actually retrieve this brand’s content reliably?”

80+
tools now competing in AI visibility measurement
Market analysis, April 2026
$850M
GEO/AEO market size in 2025
MarketIntelo, 2025
0
of 80+ tools simulate retrieval against a site’s own content
This analysis
80%
of AI-cited URLs don’t appear in Google’s top 100
Ahrefs, 2025

2. A Taxonomy of What the Market Actually Measures

To make the structural argument precisely, it is useful to map the methodologies of the most prominent tools. Across the 80+ platforms in the market, five distinct measurement approaches account for the vast majority of the category.

Approach 1: Share of Voice monitoring

The dominant methodology. A prompt database — ranging from a few thousand to hundreds of millions of queries — is run against AI platforms, and brand mentions are counted, ranked, and trended over time. Ahrefs Brand Radar runs 260 million search-backed prompts monthly across six platforms. Semrush’s AI Toolkit draws on 130 million prompts. BrightEdge tracks citations across Google AI Overviews, ChatGPT, and Perplexity with real-time data.[5]

Share of Voice monitoring answers the question: “How often does AI mention us compared to competitors?” It does not answer: “Why does AI mention them more than us?” or “What would we need to change for AI to mention us more?”

Approach 2: Multi-axis output scoring

More sophisticated tools decompose AI outputs into multiple scored dimensions. Peec AI’s three-axis model (Visibility, Position, Sentiment — each 0–100) is the most methodologically transparent in the market: it measures exactly what it says it measures (what AI outputs contain) and makes no claim to explain why.[6] Peec AI also makes the important methodological distinction between sources an AI “used” (incorporated into context) and sources it “cited” (named in output) — a distinction most tools collapse. HubSpot’s AEO Grader scores five dimensions: Sentiment, Recognition Depth, Share of Voice, Market Positioning, and Presence Quality.

Approach 3: Predictive citation scoring

A small number of tools attempt to move from descriptive to predictive. Profound’s AEO Content Score uses machine learning trained on historical citation data to estimate the probability that a piece of content will be cited by AI systems. AthenaHQ’s Athena Citation Engine (ACE) combines citation count, sentiment, traffic, and query type signals into a composite score. These are the most ambitious approaches in the output-side category — but they are still trained on outputs, not on retrieval mechanics. They learn correlates of citation, not causes.

Approach 4: On-page GEO auditing

Tools like Otterly AI score 20+ on-page factors associated with GEO best practices: structured data presence, content clarity, FAQ implementation, citation readiness. This approach is closer to the input side than the first three — it examines the content itself rather than AI outputs about the content. But it evaluates content quality according to published GEO guidelines rather than simulating whether that content would actually be retrieved for a realistic query distribution. A page can score highly on every GEO checklist item and still have 40% of its content hidden behind JavaScript that AI crawlers never execute.

Approach 5: Infrastructure and agent connectivity

The most recently emerged category. Conductor has built MCP server infrastructure enabling its platform to connect to AI tools. Scrunch AI positions itself as an Agent Experience Platform, providing middleware between AI crawlers and enterprise content. These approaches address a real infrastructure layer — whether AI agents can connect to a site at all — but they focus on connectivity rather than content retrievability.

The AI Visibility Landscape — What Each Approach Actually Measures

Approach Representative tools Measures output? Measures input? Simulates retrieval? Explains why?
Share of Voice monitoring Ahrefs, Semrush, BrightEdge, Previsible
Multi-axis output scoring Peec AI, HubSpot AEO Grader, Otterly
Predictive citation scoring Profound, AthenaHQ ~ Partial ~ Correlates only
On-page GEO auditing Otterly, LLMClicks, Goodie AI ~ Checklist only
Infrastructure & agent connectivity Conductor, Scrunch AI ~ Connectivity only
Retrieval simulation RetrieveAI

3. The Evidence That Output Measurement Is Insufficient

The case against output-only measurement is not theoretical. It is empirical, and the evidence is now substantial.

The Ahrefs 12% finding

Ahrefs’ analysis of 15,000 prompts across ChatGPT, Gemini, Copilot, and Perplexity is the most damaging piece of evidence against the assumption that SEO performance predicts AI citation. Only 12% of URLs cited by AI assistants also appear in Google’s top 10 organic results. 80% of AI-cited URLs don’t rank anywhere in Google’s top 100.[7]

This finding has a precise implication for the AI visibility market: if traditional search ranking does not predict AI citation, then any tool that uses search rank as a proxy for AI visibility — implicitly or explicitly — is building on sand. And many do. Tools that prioritise high-ranking pages for monitoring, that benchmark against organic search positions, or that recommend the same on-page optimisations as traditional SEO are all operating on the assumption that search rank and AI citation are correlated. The Ahrefs data says they are not.

The temporal instability problem

Output monitoring tools face a second empirical problem: the thing they measure changes faster than any monitoring cadence can reliably track. Industry analysis finds that 40–60% of sources cited by AI systems change on a monthly basis.[8] Only 30% of brands remain visible in back-to-back AI responses for the same query.[9] Ahrefs tracked Google AI Overview citations dropping from 76% overlap with the top-10 to 38% overlap following Google’s upgrade to Gemini 3 — a near-halving of the SEO-AI correlation in a single model update.

Output monitoring of a system this volatile is not measurement — it is sampling. You are taking periodic readings of a signal that fluctuates dramatically between readings, driven by model updates, retrieval architecture changes, and training data shifts that the monitoring tool has no visibility into. The output changes because the inputs changed. Output monitoring cannot tell you which inputs changed or why.

The citation post-rationalisation problem

Wallat et al.’s 2024 research introduced a distinction that should fundamentally challenge the AI visibility measurement paradigm: the difference between citation correctness (a cited document supports the statement) and citation faithfulness (the model genuinely relied on the document to generate the statement). Their finding: up to 57% of citations are post-rationalised — the model generated the content from parametric knowledge and then selected a supporting citation retrospectively.[10]

This means that a significant proportion of what output-monitoring tools measure as “AI citations” are not evidence of retrieval at all. They are evidence of the model’s training data and its ability to find plausible post-hoc support. A brand that appears frequently in AI citations may be appearing because it is well-represented in training data, not because its current web content is being retrieved. Output monitoring cannot distinguish these two cases. Retrieval simulation can.

The most fundamental problem with output measurement is that it cannot distinguish between a brand that is cited because its content was retrieved and a brand that is cited because it was already in the model’s parametric memory. These require completely different remediation strategies. Output monitoring makes them indistinguishable.

The cross-platform consistency collapse

If AI citation were primarily driven by content quality, we would expect strong cross-platform consistency: a piece of content that is well-structured, entity-rich, and retrieval-accessible should be cited by ChatGPT, Perplexity, Google AI Mode, and Copilot at similar rates. The data shows the opposite. Profound’s analysis of 680 million citations found that only 11% of domains are cited by both ChatGPT and Perplexity for the same query.[11] Yext’s analysis found that Gemini favours brand-owned websites (52% of citations), ChatGPT favours directories, and Perplexity favours niche industry directories — fundamentally different source preferences for the same underlying content landscape.[12]

If citation were driven primarily by content quality and structure, the cross-platform correlation would be much higher. The low correlation is evidence that citation is driven heavily by platform-specific training data, fine-tuning, and retrieval architecture differences — factors that output monitoring cannot access or explain.

4. The Causes That Output Monitoring Cannot See

If output monitoring measures effects, what are the causes? The structural factors that determine whether a site’s content enters an AI retrieval pipeline at all operate upstream of everything output-monitoring tools can observe.

The rendering gap: content that does not exist for AI crawlers

GPTBot, ClaudeBot, and PerplexityBot all operate on raw HTML only. They execute no JavaScript. Daydream’s analysis of half a billion GPTBot fetches found zero evidence of script execution.[13] For the typical modern website built on React, Vue, or Next.js, this means a substantial fraction of the page’s content — products, descriptions, pricing, variant data, dynamically loaded articles — simply does not exist for AI crawlers. They receive the skeleton; they never see the body.

No output-monitoring tool measures this. You could have 30% of your product content hidden behind JavaScript and every output-monitoring platform would report your AI citation rates without any indication that the cause was the rendering gap. The citations you do receive would come from whatever static content is visible — while the JS-dependent content remains permanently invisible, not because it is poor quality but because it is structurally inaccessible.

Upstream causes vs downstream effects — what each measurement type sees

What output monitoring sees

  • Citation frequency per platform
  • Share of voice vs competitors
  • Sentiment in AI outputs
  • Brand mention position
  • Citation rate over time
  • Which competitor is cited instead

What output monitoring cannot see

  • % of content hidden behind JavaScript
  • Prompt universe coverage gaps
  • Entity resolvability to knowledge graphs
  • Retrieval simulation score vs threshold
  • Intent distribution of covered queries
  • Whether citations are retrieved or parametric

The prompt coverage gap: query intents the site does not serve

Penha et al.’s WWW 2023 research found that machine-learned retrieval systems exhibit strong retrievability bias: across a typical corpus, the same small set of documents surface for most queries, leaving the majority of the corpus effectively invisible.[14] This finding — from the information retrieval literature, not the GEO/AEO industry — has direct implications for AI content visibility.

A site may have excellent content for informational queries about its category while having no content that retrieves well for transactional, comparative, or local queries. An output-monitoring tool will report the informational citation performance accurately. It will not tell you that the transactional query space — potentially more commercially valuable — is uncovered, because it is not querying against that intent distribution or measuring whether the site’s content semantically aligns with it.

The entity resolvability gap: content that AI cannot place in context

Pan et al.’s 2024 roadmap “Unifying Large Language Models and Knowledge Graphs” demonstrated that structured entity knowledge directly mitigates LLM hallucination and improves factual accuracy.[15] Entities that resolve to Wikidata nodes — which indexes 121+ million entities — are more likely to be part of an AI system’s structured world model. Entities that exist only in a site’s own content, with no external knowledge graph presence, are harder for AI systems to contextualise and less likely to be retrieved with high confidence.

Wellows’ analysis of AI Overview citations found that pages with 15+ recognised entities show 4.8× higher citation probability, with an r=0.76 correlation between entity knowledge graph density and citation rate.[16] Output-monitoring tools measure whether a page is cited. They do not measure whether the page’s entities are resolvable, co-occurring correctly, or appropriately dense — the upstream structural factors driving that citation probability.

The interaction visibility gap: content behind clicks

Beyond the rendering gap lies a second layer of invisibility: content that requires user interaction to reveal. Accordions, tabs, modals, variant selectors, expandable FAQs — all of these hide content from AI agents that cannot click. A product page that displays its key specifications behind a “View Details” accordion has effectively hidden that content from every AI crawler, regardless of how well that content would perform if retrieved.

This is not a fringe case. The modern UX pattern of progressive disclosure — hiding secondary information behind interactive elements to reduce visual complexity — is nearly universal in e-commerce. It is also a systematic AI visibility failure that no output-monitoring tool measures, because the output-monitoring tool never sees the content that was never retrieved.

5. Why the Market Built Itself on Outputs

The output-measurement bias is not irrational. It is the product of several legitimate constraints.

Outputs are observable without access. Any tool can send prompts to ChatGPT and record the response. Measuring retrieval-layer inputs requires crawling the site, parsing its content, embedding it, and simulating the retrieval pipeline against a generated prompt universe. The technical barriers to output monitoring are low; the barriers to input-side simulation are significantly higher.

Outputs are what customers can see. A marketing director can open ChatGPT, type a query, and verify whether the tool’s share of voice data matches what they observe. Retrieval simulation produces scores that abstract away from observable outputs, requiring more interpretation. Selling a number that corresponds to a directly observable experience is easier than selling a number that corresponds to a structural property of content that never appears in any visible output.

Outputs are what investors can demo. A dashboard showing your brand’s citation rate across seven AI platforms, with trending lines and competitor comparison, is a compelling visual. A retrieval simulation score requires explanation. The funding rounds that have built this market were raised on the strength of the former.

These are real advantages. They explain why the market built the way it did. They do not make the category error less significant.

6. The Diagnostic vs Descriptive Distinction

The clearest way to articulate what is missing is the distinction between descriptive measurement and diagnostic measurement.

Descriptive measurement records what happened: your brand was cited in 34% of relevant ChatGPT responses last month, down from 41% the month before. This is useful. It tells you the situation has changed. It does not tell you why it changed, which part of your content infrastructure changed, or what action would reverse the decline.

Diagnostic measurement identifies why: your prompt coverage score declined because a JavaScript framework migration moved 28% of your product description content behind a rendering gap; your entity resolvability score declined because three key product entities were removed from their Wikipedia pages; your retrieval confidence on transactional queries is 0.38, below the effective retrieval threshold, explaining why competitor citations are increasing in purchase-intent query categories.

The medical analogy is precise. Blood pressure monitoring is descriptive: it tells you the reading is elevated. It does not tell you whether the cause is dietary, cardiovascular, renal, or pharmacological. Diagnosis requires examining the underlying systems, not just their observable outputs. An industry that has only blood pressure monitors and no diagnostic tools is not equipped to improve health — only to observe its deterioration.

Descriptive (what exists)

  • “Your citation rate dropped 18% last month”
  • “Competitor X is cited 2.4× more often for [category]”
  • “Perplexity sentiment is neutral; ChatGPT is positive”
  • “You rank 4th in AI share of voice in your category”

Diagnostic (what causes it)

  • “31% of your product content is JS-dependent, invisible to AI crawlers”
  • “Transactional query coverage: 42% — you serve informational but not purchase-intent queries”
  • “18 of your top 50 entities have no Wikidata resolvability”
  • “Retrieval confidence score 0.41 — below the effective retrieval threshold for your category”

7. What Measurement Should Actually Look Like

A complete AI visibility measurement system needs both layers. Output monitoring is not wrong — it is incomplete. Knowing that your citation rate declined is valuable. Knowing that it declined because your last site migration created a rendering gap that hid 28% of your content from AI crawlers is actionable. The industry has built the first layer thoroughly. The second is largely absent.

The input-side measurement layer requires different technical foundations. Rather than sending prompts to AI platforms and recording outputs, it requires: crawling the site with full JavaScript rendering to identify the rendering gap; auditing whether interactive content is accessible without clicks; extracting and scoring entity density, resolvability, and co-occurrence against external knowledge graphs; generating a prompt universe that reflects realistic intent distribution; and running retrieval simulation against that universe to produce a Coverage Score (what proportion of realistic queries find relevant content) and Confidence Index (how strongly that content matches).

RetrieveAI was built to provide this input-side layer. Its 22-phase pipeline produces both output-side signals (brand monitoring, brand perception, entity analysis) and input-side retrieval simulation — the Coverage Score and Confidence Index that measure not what AI systems say about a site, but whether a site’s content is structurally accessible to the retrieval pipeline that determines what AI systems can say about it.

8. The Implications for Practitioners

For practitioners using AI visibility tools, the category error has three practical consequences.

You may be optimising for the wrong variable. If your citation rate is low because 35% of your content is hidden behind JavaScript, then no amount of prompt optimisation, content strategy revision, or structured data addition to visible pages will fix the underlying problem. You are optimising outputs while the cause is structural. The tools you are using will show your outputs improving if you add cited sources to visible pages — but the rendering gap remains, and the invisible content remains invisible.

You cannot distinguish traction from noise. When your citation rate improves, output monitoring tells you it improved. It does not tell you whether the improvement is because your content quality genuinely increased, because a model update happened to favour your training-data footprint, because a competitor’s content became less retrievable, or because the random variation inherent in stochastic model outputs happened to run in your favour. Without input-side measurement, you cannot attribute causes to effects.

Your most important content gaps are invisible. Output monitoring shows you the queries where competitors are cited and you are not. It does not show you the queries where nobody is cited — where the entire category has a retrieval gap that represents an uncontested opportunity. Prompt coverage analysis against a complete intent-classified query universe reveals the shape of the uncovered space, not just the competitive position within the covered space.

Key Takeaways
  • Every major AI visibility tool operates on the output side of the retrieval-generation pipeline — measuring what AI says, not why AI says it
  • The Ahrefs 12% finding destroys the assumption that search ranking predicts AI citation — the two are largely independent signals requiring independent measurement
  • Up to 57% of AI citations are post-rationalised from parametric memory rather than from live retrieval — output monitoring cannot distinguish these two cases
  • The rendering gap (JS-dependent content invisible to AI crawlers), interaction gap (content behind clicks), prompt coverage gap, and entity resolvability gap are all upstream causes that output monitoring cannot see
  • The distinction between descriptive and diagnostic measurement is the key framing: the market has thorough blood pressure monitors and almost no diagnostic tools
  • A complete measurement system requires both layers — output monitoring for what AI systems say, input-side retrieval simulation for why they say it

Conclusion

The AI visibility market is not broken. It is incomplete. It has built a sophisticated layer of output measurement — share of voice, sentiment analysis, citation tracking, multi-axis scoring — that accurately describes the downstream consequences of AI retrieval and generation. What it has not built is the upstream layer that explains those consequences.

The paradigm flaw is not that output measurement is wrong. It is that the market treats output measurement as sufficient when it is only half the picture. The other half — whether content is structurally accessible to AI retrieval pipelines, whether it covers the realistic intent distribution of its domain’s query space, whether its entities are resolvable, whether its retrieval confidence exceeds the threshold that determines whether it enters a context window at all — has been left unmeasured.

This is a solvable problem. The retrieval simulation methodology exists. The evaluation literature provides the theoretical grounding. The practical gap between a citation rate dashboard and a structural content retrievability audit is the opportunity the next phase of this market needs to fill.

References

  1. [1] MarketIntelo (2025). GEO/AEO market: $848M growing to $19.8B by 2034. marketintelo.com/report/generative-engine-optimization-geo-market; IntelMarketResearch (2025): $1.01B → $17B by 2034.
  2. [2] Crunchbase / press coverage: Profound valuation (2025); Scrunch AI Series A $19M (2024); Evertune $19M (2024); Peec AI €28M (2024–2025).
  3. [3] Gartner (2024). "Search Engine Volume Will Drop 25% by 2026." gartner.com/en/newsroom/press-releases/2024-02-19; updated 2025 estimate: 50%+ by 2028.
  4. [4] Pew Research Center (2025). "Google users are less likely to click on links when an AI summary appears." July 22, 2025. 68,879 searches, 900 participants. pewresearch.org
  5. [5] Ahrefs (2026). Brand Radar methodology. ahrefs.com; Semrush AI Toolkit documentation. semrush.com; BrightEdge (2025). "AI Search in 2025." brightedge.com
  6. [6] Peec AI Documentation. "Intro to Peec AI — Visibility, Position, Sentiment model." docs.peec.ai/intro-to-peec-ai
  7. [7] Ahrefs Research (Linehan & Guan). (2025). "AI Assistants Don’t Follow the SERPs." 15,000 prompts, 4 platforms. ahrefs.com/blog/ai-search-overlap
  8. [8] EMARKETER (2026). "FAQ on GEO and AEO." 40–60% source turnover monthly. emarketer.com
  9. [9] Superlines (2026). "AI Visibility Benchmark Report." 30% brand retention across consecutive AI responses. superlines.com
  10. [10] Wallat, J. et al. (2024). "Correctness is not Faithfulness in RAG Attributions." arXiv:2412.18004. Up to 57% citations post-rationalised.
  11. [11] Profound Research (2025). 680M citations, Aug 2024–Jun 2025. 11% cross-platform domain overlap. tryprofound.com/blog
  12. [12] Yext Research (2025). 6.8M citations, 1.6M responses. Gemini: 52% brand-owned; ChatGPT: directory-dominant; Perplexity: niche industry. yext.com/blog
  13. [13] Daydream (2025). "How OpenAI Crawls and Indexes Your Website." Half a billion GPTBot fetches analysed, zero JS execution observed. withdaydream.com
  14. [14] Penha, G. et al. (2023). "Improving Content Retrievability in Search with Controllable Query Generation." WWW ’23. arXiv:2303.11648
  15. [15] Pan, S. et al. (2024). "Unifying Large Language Models and Knowledge Graphs: A Roadmap." IEEE TKDE, 36, 3580–3599. arXiv:2306.08302
  16. [16] Wellows (2025). "Google AI Overviews Ranking Factors: 2026 Guide." 15+ entities = 4.8× higher citation probability, r=0.76. wellows.com/blog/google-ai-overviews-ranking-factors/
  17. [17] Aggarwal, P. et al. (2024). "GEO: Generative Engine Optimization." KDD ’24. arXiv:2311.09735. Princeton/IIT Delhi.
  18. [18] Liu, N. F. et al. (2024). "Lost in the Middle: How Language Models Use Long Contexts." TACL, 12, 157–173. arXiv:2307.03172
  19. [19] Wu, K. et al. (2024). "How Faithful Are RAG Models?" arXiv:2404.10198. 94% of model errors corrected by accurate retrieved context.
  20. [20] ZipTie.dev (2025). "E-E-A-T for AI Search." Strong E-E-A-T pages cited 2.3× more than rank-#1 pages with weak authority. ziptie.dev/blog/eeat-for-ai-search/
Author
Akshay Dahiya

Growth & MarTech Specialist

Digital marketing professional with 7+ years of experience in SEO, analytics, and marketing automation. Currently building RetrieveAI, MarAI, and RankScan tools that solve real problems I've run into working in growth and search.