The hosting industry has spent 2026 talking about an “agent-ready web.” Vendors advertise crawler-level controls, AI-aware infrastructure, and machine-readable publishing standards. The messaging suggests a mature ecosystem where companies actively manage how AI systems interact with their content.
To verify that assumption, we stopped looking at marketing and inspected the actual machine-readable declarations these companies publish themselves.
We collected and analyzed robots.txt files from 736 companies across hosting providers, CDNs, cloud platforms, registrars, and related infrastructure services. The goal was simple: determine what these companies explicitly say to automated systems about AI crawling and training.
What we found is not a coordinated strategy, but widespread absence of one.
Scope of the Audit
On June 11, 2026, we retrieved robots.txt files from 736 domains. From these, 573 returned readable, valid responses suitable for analysis.
Each file was evaluated across three signals:
- Whether it explicitly names AI crawlers (such as GPTBot, ClaudeBot, Google-Extended, etc.)
- Whether it includes a Content-Signal declaration (a newer structured policy format for AI usage)
- Whether it references an llms.txt file (a machine-readable catalog designed for AI agents)
These three elements form the basic “AI posture” layer: how a site declares what AI systems may do with its content.
The Core Finding: Near Silence
Across the 573 readable robots.txt files, explicit AI policy is rare.
- Only 47 mention any AI crawler at all
- Only 25 include a Content-Signal declaration
- Only 8 reference llms.txt
- In total, just 58 companies (around 10%) show any AI-specific machine-readable policy
That leaves roughly 90% of the analyzed industry with no explicit statement about AI crawling or training behavior in robots.txt.
The absence is most visible among large, mainstream hosting brands—precisely the companies most heavily marketing “AI-ready” infrastructure.
What the Silence Actually Looks Like
Silence is not uniform. It takes different forms:
- Some companies publish a full robots.txt but never mention AI systems at all
- Others include outdated crawler rules that predate modern AI agents
- Some provide completely empty files that implicitly allow all access
- A smaller group has no robots.txt file at all
From the perspective of machine-readable policy, all of these result in the same outcome: no declared AI governance.
The Small Minority That Does Something
The 58 companies with any AI-related configuration fall into several distinct behavioral patterns rather than a single shared standard.
1. Open Access Models
A small set of infrastructure providers explicitly allow broad AI crawling. These companies treat AI systems as general web clients and do not separate training from retrieval. The logic is simple: if content is public, it is available.
2. Selective Restriction
Another group allows general access but differentiates between use cases. They may permit indexing or real-time answering while restricting training. This reflects a more nuanced legal and product distinction between “reading” and “learning.”
3. Hard Refusal
A subset explicitly blocks AI training crawlers while still allowing normal search bots. These files often list major AI agents individually and deny them access based on policy rather than technical limitation.
4. Catalog-First Strategy
Instead of focusing on access control, some companies publish llms.txt files that summarize their products, pricing, and documentation in structured form. This shifts the emphasis from restriction to controlled presentation of information.
5. Fully Outsourced Policy
A notable pattern is infrastructure-driven configuration. Some hosting brands do not define their own AI rules at all; instead, their CDN or edge provider supplies a standardized robots policy. In these cases, “AI posture” is effectively inherited rather than designed.

The Distribution Is Heavily Skewed
When broken down, the imbalance is stark:
- Roughly 10% of companies show any AI-related signal
- 90% show none
- Most meaningful configurations come from a small cluster of infrastructure providers rather than mainstream hosts
Even more striking is where configuration does not appear: major hosting brands, hyperscale cloud providers, and large SaaS platforms overwhelmingly remain silent.
This includes many of the companies most frequently referenced in “AI-ready hosting” marketing.
The Naming Reality: Who Gets Recognized
Among files that do name AI crawlers, the distribution is uneven.
Certain crawlers appear repeatedly across the dataset, especially OpenAI’s GPTBot, which is the most frequently referenced.
Other widely used crawlers include:
- Google’s AI-related extended crawler
- Common Crawl infrastructure bots
- Anthropic’s ClaudeBot
- Meta’s external agent crawlers
- Various Perplexity and indexing bots
One pattern stands out: most companies treat OpenAI’s crawler as the default reference point. Others appear inconsistently, often depending on whether a company explicitly chose to enumerate them.
This suggests that “AI crawler policy” is still reactive rather than standardized.
No Industry Layer Is Actually Ahead
When segmented by infrastructure category—CDNs, cloud providers, PaaS platforms, managed WordPress hosts, and mass-market hosting—the result is consistent:
No segment shows meaningful leadership.
- CDNs are not more configured than shared hosting providers
- Cloud platforms do not outperform registrars
- Developer platforms show the same level of silence as legacy hosting
Technical sophistication does not correlate with policy clarity.
The Catalog Layer Is More Active Than Policy
An unexpected pattern emerges when comparing robots.txt policy with llms.txt adoption.
More companies publish llms.txt catalogs than explicitly reference them in robots.txt. However, many of these files appear without discovery links or standard integration.
These catalogs fall into four categories:
- Well-structured, useful documentation indexes
- Automatically generated files produced by SEO tools
- Minimal text descriptions without actionable links
- Large, overloaded dumps that exceed practical size for AI use
This creates a paradox: companies are more willing to describe their content for machines than to define access rules for them.
The Structural Gap Between Marketing and Reality
The key contradiction is not technical—it is declarative.
Companies actively market AI-readiness, yet:
- Most do not define how AI systems may access their content
- Most do not specify training permissions
- Most do not publish machine-readable intent beyond basic crawler defaults
In practice, “AI-ready infrastructure” is rarely backed by explicit AI policy at the edge layer.
What the Data Actually Suggests
Several conclusions emerge from the dataset:
- Explicit AI policy in hosting infrastructure is still uncommon
- When it exists, it is usually recent and unevenly adopted
- Infrastructure providers often define policy indirectly for downstream brands
- Most companies have not yet formalized how they want AI systems to interact with their content
- Silence is the dominant default, not active choice
The Real Divide Is Not Technical
The industry is often described as split between “open” and “closed” approaches to AI crawling. The data does not support that framing.
The real divide is between:
- Companies that have explicitly thought about AI interaction and documented it
- Companies that have not yet addressed it at all in machine-readable form
Everything else—openness, restriction, cataloging—is secondary to that first step.
Closing Observation
The hosting industry is building infrastructure for an agent-driven web while most of its own properties do not declare how they participate in it.
Only a small minority has written machine-readable rules for AI systems. Even fewer have converged on consistent standards. And in many cases, the policy is not written by the company at all, but inherited from upstream infrastructure.
If the agentic web is arriving, it is doing so faster than the industry is documenting its boundaries.