We Examined the AI Crawl Policies of Hosting Companies. The Industry Is Mostly Silent. – PurleyHosting – UK web hosting, Joomla hosting, Drupal hosting, Blog hosting, cPanel hosting and domain registration

admin
January 30, 2026
No Comments

The hosting industry has spent 2026 talking about an “agent-ready web.” Vendors advertise crawler-level controls, AI-aware infrastructure, and machine-readable publishing standards. The messaging suggests a mature ecosystem where companies actively manage how AI systems interact with their content.

To verify that assumption, we stopped looking at marketing and inspected the actual machine-readable declarations these companies publish themselves.

We collected and analyzed robots.txt files from 736 companies across hosting providers, CDNs, cloud platforms, registrars, and related infrastructure services. The goal was simple: determine what these companies explicitly say to automated systems about AI crawling and training.

What we found is not a coordinated strategy, but widespread absence of one.

Scope of the Audit

On June 11, 2026, we retrieved robots.txt files from 736 domains. From these, 573 returned readable, valid responses suitable for analysis.

Each file was evaluated across three signals:

Whether it explicitly names AI crawlers (such as GPTBot, ClaudeBot, Google-Extended, etc.)
Whether it includes a Content-Signal declaration (a newer structured policy format for AI usage)
Whether it references an llms.txt file (a machine-readable catalog designed for AI agents)

These three elements form the basic “AI posture” layer: how a site declares what AI systems may do with its content.

The Core Finding: Near Silence

Across the 573 readable robots.txt files, explicit AI policy is rare.

Only 47 mention any AI crawler at all
Only 25 include a Content-Signal declaration
Only 8 reference llms.txt
In total, just 58 companies (around 10%) show any AI-specific machine-readable policy

That leaves roughly 90% of the analyzed industry with no explicit statement about AI crawling or training behavior in robots.txt.

The absence is most visible among large, mainstream hosting brands—precisely the companies most heavily marketing “AI-ready” infrastructure.

What the Silence Actually Looks Like

Silence is not uniform. It takes different forms:

Some companies publish a full robots.txt but never mention AI systems at all
Others include outdated crawler rules that predate modern AI agents
Some provide completely empty files that implicitly allow all access
A smaller group has no robots.txt file at all

From the perspective of machine-readable policy, all of these result in the same outcome: no declared AI governance.

The Small Minority That Does Something

The 58 companies with any AI-related configuration fall into several distinct behavioral patterns rather than a single shared standard.

1. Open Access Models

A small set of infrastructure providers explicitly allow broad AI crawling. These companies treat AI systems as general web clients and do not separate training from retrieval. The logic is simple: if content is public, it is available.

2. Selective Restriction

Another group allows general access but differentiates between use cases. They may permit indexing or real-time answering while restricting training. This reflects a more nuanced legal and product distinction between “reading” and “learning.”

3. Hard Refusal

A subset explicitly blocks AI training crawlers while still allowing normal search bots. These files often list major AI agents individually and deny them access based on policy rather than technical limitation.

4. Catalog-First Strategy

Instead of focusing on access control, some companies publish llms.txt files that summarize their products, pricing, and documentation in structured form. This shifts the emphasis from restriction to controlled presentation of information.

5. Fully Outsourced Policy

A notable pattern is infrastructure-driven configuration. Some hosting brands do not define their own AI rules at all; instead, their CDN or edge provider supplies a standardized robots policy. In these cases, “AI posture” is effectively inherited rather than designed.

The Distribution Is Heavily Skewed

When broken down, the imbalance is stark:

Roughly 10% of companies show any AI-related signal
90% show none
Most meaningful configurations come from a small cluster of infrastructure providers rather than mainstream hosts

Even more striking is where configuration does not appear: major hosting brands, hyperscale cloud providers, and large SaaS platforms overwhelmingly remain silent.

This includes many of the companies most frequently referenced in “AI-ready hosting” marketing.

The Naming Reality: Who Gets Recognized

Among files that do name AI crawlers, the distribution is uneven.

Certain crawlers appear repeatedly across the dataset, especially OpenAI’s GPTBot, which is the most frequently referenced.

Other widely used crawlers include:

Google’s AI-related extended crawler
Common Crawl infrastructure bots
Anthropic’s ClaudeBot
Meta’s external agent crawlers
Various Perplexity and indexing bots

One pattern stands out: most companies treat OpenAI’s crawler as the default reference point. Others appear inconsistently, often depending on whether a company explicitly chose to enumerate them.

This suggests that “AI crawler policy” is still reactive rather than standardized.

No Industry Layer Is Actually Ahead

When segmented by infrastructure category—CDNs, cloud providers, PaaS platforms, managed WordPress hosts, and mass-market hosting—the result is consistent:

No segment shows meaningful leadership.

CDNs are not more configured than shared hosting providers
Cloud platforms do not outperform registrars
Developer platforms show the same level of silence as legacy hosting

Technical sophistication does not correlate with policy clarity.

The Catalog Layer Is More Active Than Policy

An unexpected pattern emerges when comparing robots.txt policy with llms.txt adoption.

More companies publish llms.txt catalogs than explicitly reference them in robots.txt. However, many of these files appear without discovery links or standard integration.

These catalogs fall into four categories:

Well-structured, useful documentation indexes
Automatically generated files produced by SEO tools
Minimal text descriptions without actionable links
Large, overloaded dumps that exceed practical size for AI use

This creates a paradox: companies are more willing to describe their content for machines than to define access rules for them.

The Structural Gap Between Marketing and Reality

The key contradiction is not technical—it is declarative.

Companies actively market AI-readiness, yet:

Most do not define how AI systems may access their content
Most do not specify training permissions
Most do not publish machine-readable intent beyond basic crawler defaults

In practice, “AI-ready infrastructure” is rarely backed by explicit AI policy at the edge layer.

What the Data Actually Suggests

Several conclusions emerge from the dataset:

Explicit AI policy in hosting infrastructure is still uncommon
When it exists, it is usually recent and unevenly adopted
Infrastructure providers often define policy indirectly for downstream brands
Most companies have not yet formalized how they want AI systems to interact with their content
Silence is the dominant default, not active choice

The Real Divide Is Not Technical

The industry is often described as split between “open” and “closed” approaches to AI crawling. The data does not support that framing.

The real divide is between:

Companies that have explicitly thought about AI interaction and documented it
Companies that have not yet addressed it at all in machine-readable form

Everything else—openness, restriction, cataloging—is secondary to that first step.

Closing Observation

The hosting industry is building infrastructure for an agent-driven web while most of its own properties do not declare how they participate in it.

Only a small minority has written machine-readable rules for AI systems. Even fewer have converged on consistent standards. And in many cases, the policy is not written by the company at all, but inherited from upstream infrastructure.

If the agentic web is arriving, it is doing so faster than the industry is documenting its boundaries.