Guides & How-To

robots.txt for AI Crawlers: The Complete Guide with Copy-Paste Templates

Published 2026-04-20 · PROGEOLAB Research

An AI-aware robots.txt is a small file that makes a deliberate statement: here are the AI crawlers I know about, and here is what I want each of them to do. Without it, your website's interaction with AI answer engines is governed by wildcard rules designed for search engines twenty years ago and by WAF rules configured for a different threat model. Most Fortune 500 companies are in that default state.

This guide publishes four copy-paste templates, a directory of the 24 AI crawlers worth naming in 2026, and the Fortune 500 benchmarks for each decision. Pick the template that matches your business model. Review quarterly. Verify with curl.

The AI crawler directory

24 AI-specific crawlers are worth knowing about, operated by 14 vendors:

Crawler	Operator	Purpose
`GPTBot`	OpenAI	Model training
`ChatGPT-User`	OpenAI	Retrieval (browsing tool)
`OAI-SearchBot`	OpenAI	ChatGPT Search index
`ClaudeBot`	Anthropic	Model training
`Claude-User`	Anthropic	Retrieval (Claude tool use)
`Google-Extended`	Google	Gemini training (opt-out of Googlebot)
`PerplexityBot`	Perplexity	Retrieval for answers
`CCBot`	Common Crawl	Public dataset for LLM training
`Bytespider`	ByteDance	Doubao / TikTok AI training
`meta-externalagent`	Meta	Llama training data
`Applebot-Extended`	Apple	Apple Intelligence training (opt-out of Applebot)
`Amazonbot`	Amazon	Alexa knowledge base
`Cohere-ai`	Cohere	Enterprise LLM training
`AI2Bot`	Allen Institute	Open-source AI research

Plus 10 more niche bots (FacebookBot, Diffbot, YouBot, MistralAI-User, PanguBot, and others) documented in the full research.

Template 1 — Allow All (HP pattern)

Use when: the brand benefits from AI citation more than it loses to AI summarization. Typical fit: content publishers, B2B services, consumer durables.

# Allow all AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: CCBot
Allow: /

User-agent: Amazonbot
Allow: /

# Sitemap for AI systems
Sitemap: https://yourdomain.com/sitemap.xml

Fortune 500 benchmark: HP names 10 AI bots and allows each. Nvidia names 17 (the full list above plus variants).

Template 2 — Training-Retrieval Split (recommended for most enterprises)

Use when: current content should reach AI answers, but model training contributes to a competitor's advantage. Typical fit: pharma, financial services, legal publishers, paywalled research.

# Block training crawlers — content should not be baked into model weights
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# Allow retrieval crawlers — live queries should see current content
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Fortune 500 benchmark: 0 companies have this template deployed. First-mover advantage is fully available. Requires WAF rules that match — see Verification below.

Template 3 — Block All (Amazon pattern)

Use when: the business model is transactional and AI citation cannibalizes revenue. Typical fit: e-commerce marketplaces, classifieds, pricing-intensive listings.

# Block all known AI crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Amazonbot
Disallow: /

# ... repeat for every crawler in the directory above

Fortune 500 benchmark: Amazon names 16 AI crawlers, all Disallow. Extend the list as new crawlers are published; the Amazon case study publishes an always-updated version.

Template 4 — Selective Per-Model (partnership pattern)

Use when: a bilateral licensing deal with one AI vendor makes that vendor preferred. Typical fit: publishers with explicit OpenAI or Anthropic partnerships.

# Partner: OpenAI has a licensing deal — allow all variants
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

# All other AI crawlers blocked
User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

# ... etc
User-agent: *
Disallow: /private/

Verification: your WAF overrides your robots.txt

Publishing robots.txt is half the work. The other half is ensuring your WAF actually honors what robots.txt declares. Goldman Sachs allows ChatGPT-User in robots.txt and blocks it at the WAF; the declaration is meaningless.

Verify from a non-corporate IP with:

# Test ChatGPT-User access
curl -sI -H "User-Agent: ChatGPT-User/1.0 (+https://openai.com/bot)" https://yourdomain.com/

# Test GPTBot access
curl -sI -H "User-Agent: GPTBot" https://yourdomain.com/

# Test with a fresh datacenter IP (from a cloud VM, not your corporate network)
# Layer-2 datacenter blocks only reveal themselves from commercial IP ranges

A 200 OK response with content means the file-level policy and the WAF agree. A 403 Forbidden means the WAF is overriding robots.txt. A challenge page (200 OK but with a JavaScript challenge) means the WAF is serving an anti-bot intermediate that AI crawlers cannot execute.

The 53-point checklist includes the full verification workflow. The pillar guide's 5-hour roadmap places robots.txt + WAF coordination at Hour 1.

Key takeaways

24 AI crawlers from 14 operators OpenAI (GPTBot, ChatGPT-User, OAI-SearchBot), Anthropic (ClaudeBot, Claude-User), Google (Google-Extended), Perplexity, Common Crawl, ByteDance, Meta, Apple, Amazon, Cohere, Diffbot, Allen Institute
Four strategic templates Block All (Amazon pattern), Allow All (HP pattern), Training-Retrieval Split (nuanced), Selective Per-Model (partnership-driven)
Only 20 of 267 Fortune 500 use any AI directive The remaining 247 have no explicit AI policy — their AI accessibility is determined by wildcard rules and WAF defaults
0 Fortune 500 companies split training from retrieval The most strategically nuanced template is also the least adopted. First mover advantage is available
robots.txt ≠ actual access Your WAF rules override your robots.txt. Goldman Sachs allows GPTBot in robots.txt but blocks at the WAF. Verify with curl