The emergence of AI agents as autonomous purchasers rather than passive research assistants creates an urgent and largely unmeasured infrastructure problem for e-commerce operators. While the AI visibility research community has developed sophisticated frameworks for measuring brand citation rates, no existing tool or framework addresses the upstream question: can an AI agent actually execute a transaction on a given website?
This paper presents the Commerce AI Readiness Framework (CARF), a four-dimension scoring model developed through the construction of RetrieveAI, an AI retrieval and visibility audit platform. The four dimensions MCP Compatibility, API Readiness, Tool-Call Compatibility, and Inventory Simulation each weighted at 25%, produce a composite Commerce AI Readiness Score (CARS) between 0 and 100.
This paper covers:
- Why existing AI visibility frameworks are insufficient for agentic commerce
- The four CARF dimensions and scoring logic
- Weighting rationale why equal 25/25/25/25 distribution
- Grade thresholds and benchmark scoring
- How the framework is implemented inside RetrieveAI's audit pipeline
- Competitive positioning vs 80+ existing tools
1. The Agentic Commerce Inflection
On November 25, 2024, Anthropic released the Model Context Protocol (MCP) an open standard for connecting AI systems to external data and tools.[1] Within sixteen months, MCP had accumulated over 10,000 active public servers, 97 million monthly SDK downloads, and adoption by OpenAI, Google DeepMind, Microsoft, and every major AI infrastructure provider.[3]
On March 26, 2025, OpenAI's Sam Altman stated: "People love MCP and we are excited to add support across our products."[4] Shopify shipped four MCP servers making every one of its 5.6 million merchant stores natively queryable by AI agents.[6] OpenAI and Stripe co-developed the Agentic Commerce Protocol (ACP), enabling purchase transactions directly inside ChatGPT.[7] Google released the Universal Commerce Protocol (UCP) in partnership with Shopify, Etsy, Wayfair, Target, and Walmart.[8]
These are not projections about a distant future. During Cyber Week 2025, AI and agents drove $67 billion in sales, influencing 20% of all purchases.[12] Nearly 60% of Americans now use generative AI for online shopping.[17]
In this context, we identified a critical gap while building RetrieveAI: no metric existed to measure whether a website was actually accessible to an AI agent attempting a transaction. The entire AI visibility research community measures LLM outputs what AI systems say about brands. Nobody measures the inputs whether an AI agent can find, parse, understand, and act on a product.
2. Why Existing Frameworks Are Insufficient
The state of AI readiness measurement in e-commerce can be summarised as follows: frameworks exist either at the infrastructure layer (can AI agents technically connect?) or the visibility layer (does AI mention my brand?), but none bridge the two with a unified commerce-specific scoring model.
The GEO research community including the foundational Princeton/IIT Delhi paper "GEO: Generative Engine Optimization"[21] focuses on optimising for LLM citation frequency in response outputs. Tools like Profound track share of voice across 10+ AI platforms. AthenaHQ measures entity signal density. Otterly scores 25+ on-page factors. None contain a dimension for MCP compatibility, tool-call response structure, or inventory data completeness.
When building the commerce phase of RetrieveAI's 17-phase audit pipeline, we found no existing scoring rubric that answered the question an AI agent actually asks: "Is this product retrievable, parseable, priced, and purchasable through a machine interface?" The Commerce AI Readiness Framework was built to answer that question.
3. The Commerce AI Readiness Framework (CARF)
CARF comprises four dimensions, each contributing 25% to a composite Commerce AI Readiness Score (CARS) between 0 and 100. The equal weighting reflects a deliberate design choice: all four dimensions are necessary conditions for full agentic accessibility, and weakness in any single dimension substantially degrades agent performance regardless of the others.
CARF Dimension Weighting Commerce AI Readiness Score (CARS)
CARS = (D1 + D2 + D3 + D4) / 4 · Each dimension scored 0–100 · Composite 0–100
Dimension 1: MCP Compatibility (25%)
Measures whether the domain exposes a functional Model Context Protocol endpoint that AI agents can use to discover and retrieve structured product data. MCP is the emerging universal standard for AI-to-data connectivity 10,000+ public servers and adopted by every major AI platform as of December 2025.[3]
High scores require: a discoverable MCP server endpoint, support for product listing and detail tool calls, valid JSON-RPC 2.0 response structure, and schema.org/Product entity alignment in tool responses. Shopify confirmed all 5.6M+ stores expose MCP endpoints by default since Summer 2025.[6] Outside Shopify, MCP adoption in e-commerce remains near zero making this dimension the single most discriminating signal between agentic-ready and agentic-blind operators.
Scoring signals
Dimension 2: API Readiness (25%)
Measures whether the site exposes machine-readable product data endpoints that AI agents can programmatically query without MCP the pre-MCP layer of agentic accessibility. A site without MCP can still partially serve AI agents through RESTful or GraphQL product APIs.
High scores require: at least one publicly accessible product data endpoint (e.g. /products.json, /api/products, /graphql), JSON response format with structured product objects, and endpoint density above a minimum threshold. OpenAPI/Swagger specification presence is scored as a strong positive signal. RetrieveAI's API detection phase (Phase 4a) extracts endpoints from JavaScript bundles, network request patterns, link-rel headers, and sitemap metadata.
Scoring signals
Dimension 3: Tool-Call Compatibility (25%)
Measures the structural quality of data returned in tool-call responses the dimension most frequently overlooked by existing readiness frameworks. An AI agent that successfully calls a product endpoint but receives malformed, incomplete, or poorly typed data cannot make a reliable purchase decision.
High scores require: consistent parameter typing (string/number/boolean without mixed types), enumerated value sets for categorical fields, complete price fields with currency codes, product identifiers (SKU/GTIN/MPN) present and parseable, and response latency below the agent timeout threshold. Academic research confirms that LLMs grounded in well-structured data achieve 300% higher task accuracy compared to those relying on unstructured sources.[20]
Scoring signals
Dimension 4: Inventory Simulation (25%)
Measures whether real-time inventory state is machine-accessible the final link in the agentic purchase chain. An AI agent comparing products needs not just price and description, but current availability, variant-level stock counts, and estimated shipping windows. Without this, agent recommendations are potentially stale at transaction time.
High scores require: schema.org/Offer with availability property present and up-to-date, variant-level inventory data accessible, and a cart or checkout simulation endpoint that confirms actual purchase feasibility. Sites where availability signals are embedded only in JavaScript-rendered state invisible to non-headless agents receive significantly penalised scores. RetrieveAI's rendering gap audit (Phase 3.5) specifically detects this.
Scoring signals
4. Weighting Rationale: Why Equal Distribution?
The equal 25/25/25/25 weighting requires explicit justification, as alternative schemes are plausible one might argue MCP compatibility should carry greater weight given its status as the emerging universal standard.
We arrived at equal weighting through three arguments tested against RetrieveAI's pipeline behaviour:
The complementarity argument: The four dimensions are not substitutes but complements. A score of 100 on MCP Compatibility provides no commercial benefit if Inventory Simulation scores 0 the agent can discover products it cannot verify as available. This interdependence argues against any single dimension dominating the composite.
The temporal stability argument: MCP is the current dominant standard, but OpenAI's ACP, Google's UCP, and Anthropic's MCP are all gaining commercial adoption simultaneously. An overweight on MCP today risks penalising sites that implement ACP or UCP equally well.
The pipeline argument: In RetrieveAI's implementation, each dimension corresponds to a distinct data collection phase. Equal weighting decouples the scoring model from temporal market conditions, making the framework more stable as a long-term benchmark.
5. Scoring Grades and Benchmark Thresholds
CARF produces a composite CARS between 0 and 100. The following grade thresholds are derived from the score distribution produced by RetrieveAI's commerce phase:
CARS Grade Thresholds
6. What Existing Tools Miss
Mapping CARF against the most prominent AI visibility tools as of April 2026 reveals a consistent pattern: every existing tool either monitors LLM outputs or audits content quality, but none score all four dimensions of commerce agent accessibility.
| Tool | MCP Score | API Ready | Tool-Call | Inventory |
|---|---|---|---|---|
| RetrieveAI CARF | ✓ Full | ✓ Full | ✓ Full | ✓ Full |
| Profound | ✗ | ✗ | ✗ | ✗ |
| AthenaHQ | ✗ | ✗ | ~ Partial | ✗ |
| Conductor | ~ Outbound only | ~ Limited | ✗ | ✗ |
| LLMClicks.ai | ✗ | ~ Checklist | ✗ | ✗ |
| Goodie AI | ✗ | ✗ | ✗ | ~ Schema only |
The comparison reveals that no existing commercial tool implements all four CARF dimensions. Conductor is the closest having built MCP server infrastructure but this enables Conductor's platform to connect to AI tools, not to score a website's readiness to serve AI agents. These are fundamentally different problems.
7. The Structured Data–Agent Performance Relationship
A central theoretical assumption of CARF is that structured, machine-readable product data improves AI agent task performance. This assumption has strong empirical support from multiple research streams.
Microsoft Research's "Table Meets LLM" study (WSDM '24, 91 citations) found that HTML-formatted data outperformed CSV/TSV by 6.76% on structured task benchmarks demonstrating that format and structure meaningfully affect agent performance.[24] Research from data.world found that LLMs grounded in knowledge graphs achieve 300% higher accuracy versus unstructured text.[20]
Microsoft's Bing/Copilot team explicitly confirmed at SMX Munich in March 2025 that "schema markup helps Microsoft's LLMs understand content" and that freshness of structured data is specifically valued.[16] AccuraCast's analysis of 9,000 AI citation sources found that 81% of AI-cited pages include schema markup.[14]
An AI agent that can discover a product via MCP but cannot parse its pricing structure due to poor tool-call response formatting is no more useful than one that cannot connect at all. Commerce AI readiness is a chain; its strength is determined by the weakest link.
8. Implementation in RetrieveAI
The Commerce AI Readiness Framework is fully implemented in RetrieveAI's audit pipeline as Phase 19, following 16 prior phases that establish the content, entity, and structural context required for accurate commerce scoring.
- Phase 3 (Headless Crawl) Playwright-based rendering that executes JavaScript to surface dynamically loaded product data, cart states, and API calls invisible to lightweight crawlers. Essential for scoring Dimension 4's JS-independence signal.
- Phase 3.5 (Rendering Gap Audit) Explicit comparison between raw HTML and rendered content. Sites with high rendering gaps on product data receive Dimension 4 penalties regardless of schema quality.
- Phase 4a (API Detection) Extraction of API endpoints from JavaScript bundles, network patterns, and sitemap metadata. Feeds Dimension 2 scoring.
- Phase 4b (Commerce Data Extraction) Extraction of product schema (JSON-LD and microdata), offer data, pricing, availability, SKU, and variant structure. Feeds Dimensions 3 and 4.
- Phase 19 (Commerce Audit) Assembly of the four CARF dimension scores into a weighted composite CARS, alongside dimension-level breakdowns and failed service flags.
Within RetrieveAI's overall AI Visibility Score, the CARS contributes 10% (alongside Entity Strength at 45% and Prompt Coverage at 45%). Commerce readiness is treated as a necessary but insufficient condition for AI visibility reflecting the current reality that most AI visibility use cases remain informational rather than transactional. As agentic commerce grows, this weighting will be revisited.
Key Takeaways
- No existing AI visibility tool implements all four dimensions of commerce agent accessibility
- MCP compatibility is the single most discriminating signal near zero adoption outside Shopify
- Tool-call response quality is distinct from endpoint presence and has significant practical consequences
- JavaScript-dependent inventory data is effectively invisible to non-headless AI agents
- Equal 25/25/25/25 weighting reflects complementarity weakness in any one dimension degrades the whole chain
- McKinsey projects $3–5T global agentic commerce by 2030; CARF provides the first systematic tool for measuring readiness
Conclusion
The question is no longer whether AI agents will purchase on behalf of consumers. The question is whether your store can be found, understood, and transacted with when they do.
The Commerce AI Readiness Framework (CARF) presents the first structured scoring model for measuring e-commerce accessibility to autonomous AI agents built through the implementation of RetrieveAI. The framework makes three original contributions: defining MCP compatibility as a first-class readiness dimension, distinguishing tool-call response quality from endpoint presence, and embedding commerce scoring within a multi-phase audit that accounts for JavaScript-dependent content invisible to non-headless agents.
A brand that scores an F on CARS in 2026 will be structurally excluded from the agentic commerce channel as it grows toward Bain's projected 25% of e-commerce. The cost of that exclusion compounds annually.
References
- [1] Anthropic. "Introducing the Model Context Protocol." November 25, 2024. anthropic.com/news/model-context-protocol
- [3] Anthropic. "Donating MCP and establishing the Agentic AI Foundation." December 9, 2025. anthropic.com
- [4] Altman, S. (OpenAI). Quoted in TechCrunch, March 26, 2025.
- [6] Shopify. "About Storefront MCP." shopify.dev/docs/apps/build/storefront-mcp
- [7] Stripe / OpenAI. "Agentic Commerce Protocol." September 2025. stripe.com
- [8] Google Developers Blog. "Universal Commerce Protocol (UCP)." developers.googleblog.com
- [9] Gartner. "60% of Brands Will Use Agentic AI." January 15, 2026. gartner.com
- [10] Gartner. "Top Predictions 2026." October 21, 2025. gartner.com
- [11] Bain & Company (via Digital Commerce 360). "Agentic AI: 25% of US e-commerce by 2030." December 2025.
- [12] Salesforce. "Cyber Week 2025." December 5, 2025. salesforce.com
- [13] McKinsey. "The Agentic Commerce Opportunity." October 17, 2025. mckinsey.com
- [14] AccuraCast. "Does Schema Markup Increase Generative Search Visibility?" December 2025. accuracast.com
- [16] Canel, F. (Microsoft). SMX Munich, March 2025. searchengineland.com
- [17] Omnisend Survey. "60% of Americans Use Gen AI for Shopping." July 2025. prnewswire.com
- [20] data.world benchmark. "LLMs + knowledge graphs: 300% higher accuracy." 2023.
- [21] Aggarwal et al. "GEO: Generative Engine Optimization." Princeton/IIT Delhi. arXiv:2311.09735.
- [24] Sui et al. (Microsoft Research). "Table Meets LLM." WSDM '24. arXiv:2305.13062.