We Analyzed robots.txt Across Cloudflare's Network: Which AI Crawlers Get Blocked Most and Why (Q1 2026)

Q1 2026 analysis of robots.txt across Cloudflare's network reveals which AI crawlers get blocked most, crawl-to-refer ratios by operator, blocking trends, and a data-driven framework for your AI bot policy.

Published 27 min read

We Analyzed robots.txt Across Cloudflare's Network: Which AI Crawlers Get Blocked Most and Why (Q1 2026)

After analyzing robots.txt directives across Cloudflare's global network in Q1 2026, we found that GPTBot is the most blocked AI crawler, Anthropic's ClaudeBot crawls 20,583 pages for every single referral it returns, and 89.4% of all AI crawler traffic serves training or mixed purposes rather than search. These findings reshape how every website owner should think about their robots.txt AI crawlers policy.

Key findings from our Q1 2026 robots.txt analysis:

  • GPTBot is the most blocked AI crawler, appearing in more DISALLOW rules than any other AI bot, followed by CCBot, ClaudeBot, and Google-Extended
  • Anthropic's crawl-to-refer ratio is 20,583:1 — ClaudeBot crawls 20,583 pages for every single referral it sends back to publishers. OpenAI is 1,255:1. Meta sends zero referrals.
  • 89.4% of AI crawler traffic is training or mixed-purpose — only 8% is search-related and just 2.2% responds to actual user queries
  • 30.6% of all web traffic is bots — AI crawlers now account for a growing share. Googlebot alone generates 35.2% of all AI-categorized bot traffic.
  • Retail absorbs 28.1% of AI crawling — more than double any other industry. Computer Software is second at 13.6%.
  • ClaudeBot blocking grew fastest during Q1 — its share of DISALLOW rules rose from 9.6% in January to 10.1% by March, while GPTBot stayed flat
  • Some AI bots are explicitly welcomed — PerplexityBot and ChatGPT-User appear more in ALLOW rules than DISALLOW, likely because they return traffic

Every robots.txt file is a policy decision. Allow this bot. Block that one. Ignore the rest. But most website owners are making these decisions with almost no data. We decided to fix that.

As CTO of TechnologyChecker.io, I oversee the crawling infrastructure that scans 50 million domains monthly. We see robots.txt files at scale every single day. But to understand the full picture, not just what our crawler encounters but what the entire internet is doing, we pulled Q1 2026 data from Cloudflare Radar, which aggregates traffic patterns across 81 million HTTP requests per second and 67 million DNS queries per second through 330 cities in 125+ countries.

Here's what the data actually says.

Which AI crawlers are blocked most in robots.txt?

Chart showing GPTBot as the most blocked AI crawler in robots.txt DISALLOW rules across Cloudflare network Q1 2026

GPTBot leads all AI crawlers in DISALLOW rules, but it also leads in ALLOW rules. The internet is genuinely split on OpenAI's bot. We pulled robots.txt directive data from Cloudflare Radar for the full Q1 2026 period (January 1 through March 31). Among all domains with DISALLOW rules targeting AI crawlers, here's the share each bot accounts for:

AI Crawler Operator DISALLOW Share ALLOW Share Net Sentiment
GPTBot OpenAI 5.52% 5.65% Mixed — blocked AND allowed frequently
CCBot Common Crawl 5.08% Mostly blocked
ClaudeBot Anthropic 4.88% 4.24% Slightly more blocked than allowed
Google-Extended Google 4.44% 4.29% Slightly more blocked than allowed
Bytespider ByteDance 4.23% Mostly blocked
meta-externalagent Meta 3.82% Blocked, never explicitly allowed
Amazonbot Amazon 3.80% Mostly blocked
Applebot-Extended Apple 3.67% Mostly blocked
Googlebot Google 2.92% 9.40% Overwhelmingly allowed

On a single day (March 30, 2026), Cloudflare parsed 4,047 robots.txt files. Of those, 557 mentioned GPTBot (13.8%), 466 mentioned ClaudeBot (11.5%), 452 mentioned CCBot (11.2%), and 434 mentioned Google-Extended (10.7%).

Three patterns stand out.

GPTBot leads both blocking AND allowing. It's the most frequently mentioned AI crawler in robots.txt, period. Some sites block it, others explicitly allow it. That's because GPTBot is actually two things: a training crawler and a gateway to ChatGPT search results. Sophisticated site owners are starting to differentiate, blocking GPTBot (training) while allowing OAI-SearchBot (search). We'll come back to this.

Meta-ExternalAgent never appears in ALLOW rules. Every other major AI crawler shows up in both ALLOW and DISALLOW directives. Meta's bot is exclusively blocked or ignored. Nobody is going out of their way to welcome it. Given that it sends zero referral traffic (more on that below), this makes sense.

Googlebot is overwhelmingly allowed. 9.4% of ALLOW rules mention Googlebot compared to just 2.9% of DISALLOW rules, a 3.2x ratio favoring access. Website owners understand that blocking Googlebot means disappearing from search results entirely. No other bot gets this kind of preferential treatment.

External context: According to a BuzzStream study of top publishers, 79% of top news sites now block AI training bots via robots.txt, while only 14% of publishers block all AI bots completely. Our Cloudflare Radar data shows a similar selective blocking pattern across all industries, not just publishers.

Why are websites blocking AI crawlers? The crawl-to-refer ratio tells the story

Scatter plot showing crawl-to-refer ratios per AI platform from DuckDuckGo at 1.5 to Anthropic at 20,583

Blocking decisions aren't random. There's a straightforward economic calculation behind them: how much does this bot take from my site (crawl volume) versus how much does it send back (referral traffic)? Cloudflare Radar tracks this as a crawl-to-refer ratio.

Here's the Q1 2026 data:

Operator Crawl-to-Refer Ratio Translation Trend vs Q4 2025
Anthropic 20,583:1 Crawls 20,583 pages per 1 referral sent Worsened
OpenAI 1,255:1 Crawls 1,255 pages per 1 referral sent Improved slightly
Perplexity 111:1 Crawls 111 pages per 1 referral sent Stable
Microsoft 32:1 Crawls 32 pages per 1 referral sent Stable
Mistral 24:1 Crawls 24 pages per 1 referral sent
Yandex 21:1 Crawls 21 pages per 1 referral sent Stable
Baidu 5.2:1 Crawls 5.2 pages per 1 referral sent Stable
Google 5.0:1 Crawls 5 pages per 1 referral sent Stable
ByteDance 3.5:1 Crawls 3.5 pages per 1 referral sent Stable
DuckDuckGo 1.5:1 Nearly 1-to-1 Stable

Read those numbers carefully.

Anthropic crawls 20,583 pages for every single referral it sends back. That's not a typo. ClaudeBot is the most aggressive pure consumer of web content relative to what it returns. For context, Google crawls 5 pages per referral. DuckDuckGo is nearly 1-to-1. Anthropic's ratio is 4,117x worse than Google's.

OpenAI is 1,255:1 — better than Anthropic but still extractive. The ratio has improved slightly since ChatGPT began sending referral traffic through its browse and search features, but it's still 251x worse than Google.

Perplexity is the best among dedicated AI companies at 111:1. That's still 22x worse than Google, but it's a fundamentally different model. Perplexity's entire product is a search engine that links to sources. The ratio reflects this.

The correlation between blocking and ratio is almost perfect. The bots with the worst crawl-to-refer ratios (Anthropic, OpenAI) are the most blocked in robots.txt. The bots with good ratios (Google, DuckDuckGo, ByteDance) are almost never blocked. The data tells website owners exactly what we'd expect: they tolerate crawling that sends traffic back, and they block crawling that doesn't.

Looking at the daily timeseries data, Anthropic's ratio was volatile throughout Q1, spiking above 100,000:1 on some days in January and gradually declining toward 10,000-15,000:1 by March. That's still astronomically high but trending in the right direction. If Anthropic wants to reduce blocking, improving this ratio is the single most effective thing they can do.

What to do with this insight: Use crawl-to-refer ratios as your primary decision metric. Block bots with ratios above 1,000:1 (training-focused crawlers). Allow bots under 200:1 (search-focused crawlers that return traffic). Review quarterly as ratios shift.

How much traffic do AI crawlers actually generate?

Bar chart showing AI bot traffic share with Googlebot at 35.2 percent and Meta-ExternalAgent at 13.9 percent

Before discussing whether to block or allow AI crawlers, you need to know how much traffic they actually generate. Here's the AI bot traffic breakdown for Q1 2026:

AI Bot Traffic Share Operator Primary Purpose
Googlebot 35.2% Google Search indexing + AI training
Meta-ExternalAgent 13.9% Meta AI training (zero referrals)
GPTBot 12.5% OpenAI AI training + ChatGPT data
ClaudeBot 11.3% Anthropic AI training
Bingbot 9.2% Microsoft Search indexing + Copilot
Amazonbot 4.9% Amazon Alexa + product data
Applebot 3.6% Apple Siri + Apple Intelligence
Bytespider 3.5% ByteDance AI training + TikTok data
OAI-SearchBot 2.2% OpenAI ChatGPT search (sends referrals)
Other 3.7% Various Various

Googlebot alone generates 35.2% of all AI-categorized bot requests. This is critical context for the robots.txt discussion. Google operates in a dual role: it crawls for search indexing (which everyone wants) and for AI training through Google-Extended (which some want to block). The Google-Extended user agent lets site owners block the AI training component while keeping search indexing intact. 10.7% of analyzed robots.txt files already use this distinction.

Meta-ExternalAgent at 13.9% sends zero referral traffic. It's the second-highest volume AI crawler and returns nothing to publishers. Combined with its absence from ALLOW rules, the data paints a clear picture: Meta is extracting content at scale with no reciprocal benefit to site owners. According to SEOmator's analysis of Cloudflare data, Meta-ExternalAgent surged from 8.5% to 11.6% of global AI bot traffic between December 2025 and January 2026 alone, a 36% relative jump in 30 days.

OpenAI runs two separate bots. GPTBot (12.5%) handles training data collection, while OAI-SearchBot (2.2%) handles ChatGPT search queries that can send referral traffic. Forward-thinking site owners are blocking GPTBot while allowing OAI-SearchBot, getting ChatGPT search visibility without providing free training data. We see this pattern in the ALLOW data: OAI-SearchBot appears in 4.22% of ALLOW rules, nearly matching GPTBot's 5.65%.

And all of this sits within a broader context: 30.6% of all web traffic in Q1 2026 was bots. Not just AI crawlers, all bots combined. According to HUMAN Security's 2026 State of AI Traffic report, AI-driven traffic grew 187% in 2025, and agentic AI traffic specifically grew 7,851% year-over-year. The scale is accelerating.

What are AI crawlers actually doing with your content?

Pie chart showing crawl purpose breakdown with Training at 45 percent and Mixed Purpose at 44.4 percent

Not all AI crawling is the same. Cloudflare Radar categorizes AI bot traffic by declared crawl purpose:

Crawl Purpose Share of AI Traffic What It Means
Training 45.0% Collecting data to train language models
Mixed Purpose 44.4% Combines training, indexing, and product features
Search 8.0% Powering AI search products (Perplexity, ChatGPT Browse)
User Action 2.2% Fetching pages in response to real user queries
Undeclared 0.4% Purpose not specified in user agent

89.4% of AI crawler traffic is training or mixed-purpose. Only 8% is search-related (which might send referral traffic back) and just 2.2% responds to actual user queries in real time.

This is the core of the robots.txt debate. If 89.4% of AI crawling is taking content to train models, not to serve it back to users who might click through, then the ROI for publishers is structurally negative. The crawl-to-refer ratios confirm this: platforms whose crawling is primarily training-focused (Anthropic, Meta) have the worst ratios. Platforms with significant search components (Perplexity, Google) have better ratios.

The 2.2% "User Action" category is the fastest-growing segment. These are bots like ChatGPT-User that visit pages when a user asks a question. They represent genuine, query-driven traffic that can generate real referrals. According to Cloudflare's own analysis, training now drives nearly 80% of AI bot activity, up from 72% a year ago. But the user-action slice grew 15x year-over-year according to Cloudflare's 2025 Year in Review, and the growth continued into Q1 2026. As AI assistants increasingly browse the web on behalf of users, this category will matter more.

What to do with this insight: Don't treat all AI crawlers as equal. Block training-focused bots (GPTBot, ClaudeBot, Meta-ExternalAgent, CCBot) while explicitly allowing search and user-action bots (OAI-SearchBot, ChatGPT-User, PerplexityBot). You'll block 89.4% of the extractive traffic while preserving the 10.2% that could send actual visitors.

Which industries absorb the most AI crawling?

Horizontal bar chart showing Retail at 28.1 percent and Computer Software at 13.6 percent of AI crawler traffic

AI crawlers don't hit all industries equally. Here's where they spend their bandwidth:

Industry AI Crawler Traffic Share All Crawler Traffic Share AI Overweight?
Retail 28.1% 20.8% Yes — disproportionately targeted
Computer Software 13.6% 17.3% No — slightly under
IT Services 5.8% 5.5% Neutral
Internet 5.0% 4.9% Neutral
Gambling & Casinos 2.8% 6.5% No — less AI crawling than average
Online Media 2.7%
Telecommunications 2.7% 3.1% Neutral
Media 2.7% 5.1% No — less AI crawling than average
Marketing & Advertising 6.4% No — less AI crawling
Adult Entertainment 2.6% 4.0% No — less AI crawling

Retail takes 28.1% of AI crawler traffic, more than double any other industry. Product descriptions, pricing pages, reviews, and comparison content are exactly the kind of structured, factual data that language models consume voraciously. If you run an e-commerce site (like those built on Shopify), AI crawlers are likely your single largest non-human traffic source after Googlebot.

Computer Software at 13.6% makes sense. Documentation, API references, tutorials, and technical content are high-value training material for code-focused AI models.

Media is underrepresented at 2.7%. This is likely because media companies were among the first to aggressively block AI crawlers through robots.txt, and the blocking is working. Their share of AI crawler traffic is lower than their share of all crawler traffic (5.1%), suggesting the blocking is effectively reducing AI-specific crawling. HUMAN Security's data corroborates this pattern, showing that more than 95% of AI-driven traffic in 2025 was concentrated in retail and e-commerce, streaming and media, and travel and hospitality.

The domain category data from robots.txt tells the same story from the other side. Among the 4,047 robots.txt files Cloudflare parsed on March 30, Technology domains (910) and Business domains (798) had the most AI-specific DISALLOW rules. E-commerce (287) came third.

Is AI crawler blocking increasing?

Line chart showing DISALLOW rule trends over Q1 2026 with ClaudeBot and meta-externalagent rising

Yes. We tracked the weekly share of DISALLOW rules for each AI crawler across the full 13 weeks of Q1 2026:

AI Crawler Jan 5 Share Mar 30 Share Q1 Change Trend
GPTBot 11.93% 11.75% -0.18pp Flat (already peak)
CCBot 10.59% 10.41% -0.18pp Flat
ClaudeBot 9.58% 10.09% +0.51pp Growing fastest
Google-Extended 8.82% 9.22% +0.40pp Growing
Bytespider 8.45% 8.58% +0.13pp Flat
meta-externalagent 6.93% 7.44% +0.51pp Growing fastest
Amazonbot 6.44% 7.31% +0.87pp Growing fastest
Googlebot 8.09% 7.69% -0.40pp Declining
Applebot-Extended 6.53% 6.92% +0.39pp Growing

Three trends are clear:

GPTBot blocking has plateaued. It was already the most blocked bot at the start of Q1 and stayed flat throughout. Most sites that intend to block GPTBot have already done so. According to a Search Engine Journal report on a Hostinger study, GPTBot's website coverage dropped from 84% to 12% over their study period, while OAI-SearchBot reached 55.67% average coverage. That gap tells you everything: sites are surgically blocking the training bot while welcoming the search bot.

ClaudeBot and meta-externalagent are catching up fast. Both gained +0.51 percentage points in Q1, the joint-fastest growth in DISALLOW rules. ClaudeBot's growth correlates directly with awareness of its terrible crawl-to-refer ratio. As more SEO professionals and publishers discover the 20,583:1 number, blocking will likely continue to accelerate.

Amazonbot blocking grew the most at +0.87pp. Amazonbot isn't typically discussed alongside GPTBot and ClaudeBot in the AI crawler conversation, but the data shows website owners are increasingly blocking it. This may reflect concerns about Amazon using crawled data to compete with the very retailers it crawls.

Googlebot blocking is declining. This is the only major crawler seeing reduced blocking, likely because website owners are learning to use the Google-Extended user agent to block Google's AI training specifically while keeping search indexing intact. They're switching from blunt Googlebot blocks to targeted Google-Extended blocks.

External context: An academic study published on arXiv found that AI-blocking by reputable sites increased from 23% in September 2023 to nearly 60% by May 2025, with reputable sites forbidding an average of 15.5 AI user agents. Meanwhile, misinformation sites prohibit fewer than one. The trend we're measuring in Q1 2026 is a continuation of this multi-year acceleration.

Which AI crawlers are explicitly welcomed?

Illustration of AI crawlers being allowed versus blocked at a digital checkpoint with green and red lanes

The robots.txt story isn't only about blocking. Some AI crawlers are actively welcomed through explicit ALLOW rules:

AI Crawler ALLOW Share DISALLOW Share Net Direction
PerplexityBot 5.16% Strongly welcomed
ChatGPT-User 4.76% Strongly welcomed
OAI-SearchBot 4.22% Strongly welcomed
GPTBot 5.65% 5.52% Split — some allow, some block
ClaudeBot 4.24% 4.88% Slightly more blocked
Google-Extended 4.29% 4.44% Slightly more blocked

PerplexityBot, ChatGPT-User, and OAI-SearchBot are net positive. They appear in ALLOW rules without significant DISALLOW presence. All three have something in common: they're search or user-action bots that send referral traffic back to publishers. The robots.txt data confirms what the crawl-to-refer ratios suggest, sites are willing to give access to bots that return traffic.

GPTBot is genuinely split. Nearly identical ALLOW and DISALLOW percentages mean the internet is divided on OpenAI's training crawler. This probably reflects the messy reality that GPTBot serves dual purposes, and different sites weigh those purposes differently.

This is where sophisticated operators are getting granular. Instead of a blanket block on all OpenAI crawlers, they're writing robots.txt rules that block GPTBot (training) while explicitly allowing OAI-SearchBot (search) and ChatGPT-User (user action). That's the data-informed approach.

The complete list of known AI crawler user-agent strings

Illustration of AI crawler bots marching in formation approaching a registration checkpoint

One of the biggest challenges in managing robots.txt AI crawlers is simply knowing which user-agent strings to target. AI companies don't always publicize their bots, and new ones appear regularly.

Here's the most current list of AI crawler user-agent strings, compiled from our crawling infrastructure's observations, Cloudflare Radar data, and the community-maintained ai-robots-txt GitHub repository:

Training crawlers (high crawl, low/zero referrals)

User-Agent Operator Purpose
GPTBot OpenAI Model training data
ClaudeBot Anthropic Model training data
Claude-Web Anthropic Model training data
Meta-ExternalAgent Meta AI training (zero referrals)
CCBot Common Crawl Open dataset for AI training
Google-Extended Google Gemini / AI training
Bytespider ByteDance AI training + TikTok
Amazonbot Amazon Alexa + AI features
Applebot-Extended Apple Apple Intelligence training
Diffbot Diffbot Web data extraction
FacebookBot Meta AI training
Omgilibot Webz.io Data mining
cohere-ai Cohere Model training
AI2Bot Allen AI Research model training
Kangaroo Bot Kangaroo LLM Model training
Timpibot Timpi Decentralized search training
VelenPublicWebCrawler Velen Model training
Webzio-Extended Webz.io Extended data collection
iaskspider iAsk.ai AI training

Search and user-action crawlers (send referral traffic)

User-Agent Operator Purpose
OAI-SearchBot OpenAI ChatGPT search results
ChatGPT-User OpenAI Real-time user queries
PerplexityBot Perplexity AI search engine
Bingbot Microsoft Search + Copilot
Googlebot Google Search indexing
YouBot You.com AI search

This distinction matters. We've seen site owners paste a "block all AI crawlers" snippet into their robots.txt without realizing they're also blocking the search bots that actually send visitors. Use the training vs. search categorization above to make targeted decisions.

Common mistakes with AI crawler directives

Illustration of a poorly built wall with gaps letting robot figures slip through while correct blueprint sits ignored

After scanning millions of robots.txt files through our technology detection pipeline, I've seen the same mistakes repeatedly. Here are the five most common, with fixes.

Mistake 1: Blocking all OpenAI bots instead of just the training bot

# Wrong — blocks search referrals too
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /
# Right — blocks training, allows search
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

OpenAI operates three distinct user-agent strings. GPTBot handles training data collection. OAI-SearchBot powers ChatGPT search results and actually sends referral traffic back. ChatGPT-User fetches pages in real time when users ask questions. Blocking all three costs you search visibility with zero upside. According to a Reuters Institute study, 48% of the most widely used news websites across ten countries block OpenAI's crawlers, but many don't differentiate between the training and search bots.

Mistake 2: Placing AI bot rules after a wildcard disallow

# Wrong — wildcard catches everything first
User-agent: *
Disallow: /private/

User-agent: GPTBot
Disallow: /

Some robots.txt parsers process rules differently. The spec says the most specific user-agent match wins, but not all crawlers follow the spec perfectly. Always test your configuration with Google's robots.txt tester to verify your rules work as intended.

Mistake 3: Forgetting that robots.txt is case-sensitive for paths

The user-agent field is case-insensitive, but paths in Disallow and Allow directives are case-sensitive. Disallow: /Blog/ won't block access to /blog/. Double-check your actual URL paths.

Mistake 4: Assuming robots.txt blocks indexing

robots.txt blocks crawling, not indexing. If other sites link to your pages, search engines can still index the URLs without crawling the content. If you need to prevent indexing entirely, use a noindex meta tag or X-Robots-Tag HTTP header instead.

Mistake 5: Not updating rules as new AI bots appear

AI companies launch new crawlers regularly. A robots.txt written in 2024 that blocks GPTBot and ClaudeBot misses meta-externalagent, Bytespider, Amazonbot, Applebot-Extended, and dozens of others. Review your robots.txt quarterly. The ai-robots-txt GitHub repository is a good reference for staying current.

robots.txt limitations: when to use firewall rules instead

Illustration comparing a flimsy paper sign barrier versus a solid steel wall with security cameras

Here's something most guides won't tell you plainly: robots.txt is voluntary. It's a set of suggestions, not a technical barrier. Any crawler can read your robots.txt and choose to ignore it completely.

According to a UC San Diego research paper, 77% of participants surveyed had never even heard of robots.txt. And among AI companies, compliance with robots.txt varies. Multiple reports have documented AI crawlers accessing content that robots.txt explicitly blocks.

This means robots.txt should be your first line of defense but not your only one. Here's how the options compare:

Method Blocks Compliant Bots Blocks Non-Compliant Bots Implementation Difficulty Risk
robots.txt Yes No Low (text file) None
Cloudflare WAF rules Yes Yes Medium May block legitimate users if misconfigured
Rate limiting Partial Yes Medium May affect site performance
User-agent blocking (server-side) Yes Yes (if detected) Medium-High Spoofed user agents bypass it
Bot management solutions Yes Yes High (cost + setup) Most effective but expensive

When robots.txt is enough

  • You're blocking well-known, compliant bots (GPTBot, ClaudeBot, Googlebot)
  • Your goal is to communicate policy, not enforce it technically
  • You want to control search engine AI training without losing search indexing

When you need firewall rules

  • You've identified bots ignoring your robots.txt
  • Your server costs are spiking from aggressive crawling
  • You need to protect high-value content (pricing pages, proprietary data)
  • You're dealing with bots that disguise their user-agent strings

Cloudflare Turnstile and similar tools add a verification layer that bots can't bypass by simply ignoring robots.txt. For most mid-market sites, the combination of robots.txt for policy signaling and a WAF for enforcement gives you the best coverage.

llms.txt vs robots.txt: a new standard for AI-specific instructions

Illustration of an old brass skeleton key and a modern digital keycard hanging side by side on a rack

While robots.txt was designed for search engine crawlers in the 1990s, a new file called llms.txt has emerged specifically for communicating with large language models. It's worth understanding the difference and when you might use both.

robots.txt uses the Robots Exclusion Protocol to tell crawlers which URLs they can access. It was built for search engines that crawl, index, and link back to your content. The format is technical: User-agent strings, Allow/Disallow directives, Sitemap references.

llms.txt is a proposed convention for providing AI-readable information about your site. Instead of restricting access, it describes your content in a structured way that LLMs can use. Think of it as a site map written for AI understanding rather than search engine indexing.

Feature robots.txt llms.txt
Purpose Control crawl access Describe site content for AI
Format User-agent + Disallow/Allow Markdown-like, human readable
Enforceability Voluntary (standard since 1994) Voluntary (emerging, no standard body)
AI training control Yes (via specific user agents) No (descriptive, not restrictive)
Search impact Direct (blocks crawling) None (informational only)
Adoption Widespread Early stage

Here's the practical takeaway: robots.txt and llms.txt serve different functions. robots.txt controls access. llms.txt describes content. If your priority is blocking AI training crawlers, robots.txt is still the tool. If you want to help AI systems understand your site structure and cite you accurately, llms.txt adds value on top.

We're monitoring llms.txt adoption across our 50M+ domain dataset and will publish adoption numbers once the sample is statistically meaningful. For now, focus your energy on getting robots.txt right since it has direct, measurable impact today.

A data-driven framework for robots.txt decisions

Illustration of a forked road with three paths showing green allow, yellow rate-limit, and red block directions

Based on everything we've analyzed, here's a decision framework for every major AI crawler:

Crawler Recommendation Rationale
GPTBot (OpenAI) Block unless you want ChatGPT training visibility 1,255:1 ratio. Block GPTBot, allow OAI-SearchBot separately.
ClaudeBot (Anthropic) Block 20,583:1 ratio. Near-zero referral ROI.
Meta-ExternalAgent (Meta) Block Zero referrals. Pure training extraction.
Bytespider (ByteDance) Block Low referral return. Training-focused.
CCBot (Common Crawl) Block unless you value open datasets Used by many AI companies indirectly.
Google-Extended Block Blocks Google's AI training while keeping search indexing. Best of both worlds.
OAI-SearchBot (OpenAI) Allow Handles ChatGPT search — sends referral traffic.
ChatGPT-User (OpenAI) Allow User-action bot — responds to real queries.
PerplexityBot (Perplexity) Allow 111:1 ratio — best among dedicated AI companies.
Bingbot (Microsoft) Allow 32:1 ratio. Search indexing + Copilot. Acceptable trade-off.
Googlebot Always allow 5:1 ratio. Blocking means search invisibility.

Here's what a data-informed robots.txt looks like in practice:

# Block AI training crawlers (high crawl-to-refer ratio)
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: FacebookBot
Disallow: /

# Allow AI search and user-action bots (send referral traffic)
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

# Always allow search engines
User-agent: Googlebot
Allow: /

User-agent: bingbot
Allow: /

This approach blocks the vast majority of training crawling while preserving the bots that could send actual referral traffic back to your site.

2026 AI crawler benchmarks at a glance

Illustration of a dashboard control panel with gauges showing green and red zones for AI crawler metrics

Here are the key benchmarks from our Q1 2026 analysis, consolidated for quick reference:

Benchmark Value Source Notes
Most blocked AI crawler GPTBot (5.52% of DISALLOW rules) Our Cloudflare Radar analysis Followed by CCBot (5.08%)
Worst crawl-to-refer ratio Anthropic — 20,583:1 Our Cloudflare Radar analysis 4,117x worse than Google
Best AI-company crawl ratio Perplexity — 111:1 Our Cloudflare Radar analysis Search model returns traffic
Training + mixed traffic share 89.4% of AI crawling Our Cloudflare Radar analysis Only 10.2% is search/user action
Bot traffic as % of all web 30.6% Our Cloudflare Radar analysis AI crawlers are a growing share
AI traffic growth (2025 YoY) 187% HUMAN Security report Agentic AI traffic up 7,851%
Top news sites blocking AI bots 79% BuzzStream study 14% block all bots, 18% block none
Retail share of AI crawling 28.1% Our Cloudflare Radar analysis More than double any other industry
Fastest-growing DISALLOW bot Amazonbot (+0.87pp in Q1) Our Cloudflare Radar analysis ClaudeBot and Meta tied at +0.51pp
GPTBot website coverage decline 84% down to 12% Search Engine Journal / Hostinger OAI-SearchBot reached 55.67%

What this means for technology intelligence

Technology detection pipeline showing DNS, HTTP headers, JS signatures, and historical data analysis

The robots.txt arms race creates a growing problem for any data product that relies on web crawling. As more sites block more bots, crawling-dependent tools develop blind spots.

At TechnologyChecker, we've been dealing with this reality since day one. Our technology detection pipeline uses multi-signal detection specifically because we knew crawling access would tighten:

  • DNS record analysis — Certificate Authority data reveals cloud providers (we tracked 10.9 billion certificates in Q1 2026), CNAME records reveal CDN and hosting infrastructure, MX records reveal email providers
  • HTTP header fingerprinting — Server headers, X-Powered-By headers, and security headers reveal backend technologies without needing to render page content
  • JavaScript signature detection — Library fingerprints in public assets identify frontend frameworks
  • Historical baseline data — 20 years of technology adoption patterns let us detect changes even when current access is restricted

The robots.txt data we've analyzed in this report quantifies what we've observed operationally: the open web is closing. The 13.8% of sites that already block GPTBot will likely double within a year. Crawl-to-refer ratios will be the metric that determines whether a bot gets access or gets blocked.

Technology intelligence tools that depend on a single data collection method, whether that's crawling, browser extensions, or job posting analysis, will face increasing gaps. The sites that are blocking most aggressively (Technology at 910 DISALLOW rules, Business at 798) are exactly the sites that B2B sales teams need intelligence on. That makes multi-signal detection not just a nice-to-have but a requirement for accurate technographic data.

Frequently Asked Questions

Is there a robots.txt for AI?

Yes. The standard robots.txt file is the primary mechanism for controlling AI crawler access. You specify AI-specific user-agent strings (GPTBot, ClaudeBot, Meta-ExternalAgent, etc.) with Disallow directives. There's also an emerging llms.txt standard designed specifically for communicating with large language models, but it's descriptive rather than restrictive. For blocking AI training crawlers, robots.txt remains the established and widely supported approach.

robots.txt itself isn't a law. It's a technical protocol, a set of instructions that well-behaved crawlers follow voluntarily. However, courts have considered robots.txt as evidence of a website's expressed wishes regarding access. In several legal disputes involving web scraping, including cases against AI companies, robots.txt policies have been cited as proof that content was accessed against the site owner's stated terms. It's a policy signal, not a legal contract, but it carries weight in court.

Is robots.txt a vulnerability?

Not a security vulnerability in the traditional sense, but it can expose information. robots.txt files are publicly readable, and Disallow paths can reveal the existence of directories or files you might not want exposed (like /admin/, /staging/, or /internal-api/). Never rely on robots.txt as a security measure. Use proper authentication, access controls, and server-side rules to protect sensitive content. robots.txt is for managing crawler behavior, not securing your site.

How do you block AI crawlers in robots.txt in 2026?

Add specific User-agent and Disallow directives for each AI crawler you want to block. In 2026, you need to target at least 10-15 user-agent strings to cover the major AI crawlers. We recommend using the data-driven framework in this article: block training bots (GPTBot, ClaudeBot, Meta-ExternalAgent, CCBot, Bytespider, Google-Extended, Amazonbot) while allowing search bots that return traffic (OAI-SearchBot, ChatGPT-User, PerplexityBot). The full robots.txt code example earlier in this post is ready to copy and deploy.

Should I allow or block GPTBot and ClaudeBot?

The data suggests blocking both for most websites. GPTBot has a 1,255:1 crawl-to-refer ratio, meaning it crawls 1,255 pages for every referral it sends back. ClaudeBot is far worse at 20,583:1. However, if you block GPTBot, make sure you separately allow OAI-SearchBot, which handles ChatGPT search and actually sends referral traffic (it appears in 4.22% of ALLOW rules in our data). There's no similar search-specific bot from Anthropic, so blocking ClaudeBot has no downside for referral traffic. The one exception: if you specifically want your content reflected in ChatGPT or Claude responses, allowing training access is the trade-off.

Our Methodology

This analysis uses data from multiple Cloudflare Radar API endpoints, which aggregate traffic patterns across Cloudflare's global network spanning 330 cities in 125+ countries. Cloudflare processes over 81 million HTTP requests per second and 67 million DNS queries per second, providing one of the most extensive views of global internet activity available.

Data period: January 1 through March 31, 2026 (Q1 2026)

Geographic scope: Global

Endpoints used:

  • robots.txt analysis (get_robots_txt_data) — DISALLOW and ALLOW directive distributions by user agent, domain category breakdowns, weekly timeseries trends. Sample: 4,047 robots.txt files parsed on March 30, 2026.
  • Crawl-to-refer ratios (get_bots_crawlers_data) — Ratio of crawl requests to referral traffic by bot operator, with daily timeseries.
  • AI bot traffic (get_ai_data) — Traffic share by user agent, crawl purpose classification, industry targeting.
  • HTTP traffic (get_http_data) — Human vs bot traffic split.

Limitations: robots.txt percentages represent the share of directives across Cloudflare's monitored domains, not absolute blocking rates across all internet domains. Crawl-to-refer ratios can vary significantly day-to-day, and daily extremes (such as Anthropic's spikes above 100,000:1) may reflect transient crawl bursts rather than sustained behavior. Some AI crawlers may not identify themselves accurately in user agent strings, which means true crawl volumes could be higher. This analysis also doesn't account for sites using server-side blocking or WAF rules that never appear in robots.txt.

We plan to update this analysis quarterly. If you'd like to see how AI crawler blocking affects technology detection accuracy across your target accounts, explore our platform rankings data or check how cloud provider traffic share interacts with bot management at the infrastructure level.


Data source: Cloudflare Radar API (radar.cloudflare.com) Analysis and insights by TechnologyChecker.io

Share: