robots.txt for AI Crawlers: The Complete Guide with Copy-Paste Templates
Published 2026-04-20 · PROGEOLAB Research
An AI-aware robots.txt is a small file that makes a deliberate statement: here are the AI crawlers I know about, and here is what I want each of them to do. Without it, your website's interaction with AI answer engines is governed by wildcard rules designed for search engines twenty years ago and by WAF rules configured for a different threat model. Most Fortune 500 companies are in that default state.
This guide publishes four copy-paste templates, a directory of the 24 AI crawlers worth naming in 2026, and the Fortune 500 benchmarks for each decision. Pick the template that matches your business model. Review quarterly. Verify with curl.
The AI crawler directory
24 AI-specific crawlers are worth knowing about, operated by 14 vendors:
| Crawler | Operator | Purpose |
|---|---|---|
GPTBot | OpenAI | Model training |
ChatGPT-User | OpenAI | Retrieval (browsing tool) |
OAI-SearchBot | OpenAI | ChatGPT Search index |
ClaudeBot | Anthropic | Model training |
Claude-User | Anthropic | Retrieval (Claude tool use) |
Google-Extended | Gemini training (opt-out of Googlebot) | |
PerplexityBot | Perplexity | Retrieval for answers |
CCBot | Common Crawl | Public dataset for LLM training |
Bytespider | ByteDance | Doubao / TikTok AI training |
meta-externalagent | Meta | Llama training data |
Applebot-Extended | Apple | Apple Intelligence training (opt-out of Applebot) |
Amazonbot | Amazon | Alexa knowledge base |
Cohere-ai | Cohere | Enterprise LLM training |
AI2Bot | Allen Institute | Open-source AI research |
Plus 10 more niche bots (FacebookBot, Diffbot, YouBot, MistralAI-User, PanguBot, and others) documented in the full research.
Template 1 — Allow All (HP pattern)
Use when: the brand benefits from AI citation more than it loses to AI summarization. Typical fit: content publishers, B2B services, consumer durables.
# Allow all AI crawlers User-agent: GPTBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: OAI-SearchBot Allow: / User-agent: ClaudeBot Allow: / User-agent: Claude-User Allow: / User-agent: Google-Extended Allow: / User-agent: PerplexityBot Allow: / User-agent: Applebot-Extended Allow: / User-agent: CCBot Allow: / User-agent: Amazonbot Allow: / # Sitemap for AI systems Sitemap: https://yourdomain.com/sitemap.xml
Fortune 500 benchmark: HP names 10 AI bots and allows each. Nvidia names 17 (the full list above plus variants).
Template 2 — Training-Retrieval Split (recommended for most enterprises)
Use when: current content should reach AI answers, but model training contributes to a competitor's advantage. Typical fit: pharma, financial services, legal publishers, paywalled research.
# Block training crawlers — content should not be baked into model weights User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: meta-externalagent Disallow: / User-agent: CCBot Disallow: / User-agent: Bytespider Disallow: / # Allow retrieval crawlers — live queries should see current content User-agent: ChatGPT-User Allow: / User-agent: Claude-User Allow: / User-agent: OAI-SearchBot Allow: / User-agent: PerplexityBot Allow: / Sitemap: https://yourdomain.com/sitemap.xml
Fortune 500 benchmark: 0 companies have this template deployed. First-mover advantage is fully available. Requires WAF rules that match — see Verification below.
Template 3 — Block All (Amazon pattern)
Use when: the business model is transactional and AI citation cannibalizes revenue. Typical fit: e-commerce marketplaces, classifieds, pricing-intensive listings.
# Block all known AI crawlers User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Claude-User Disallow: / User-agent: Google-Extended Disallow: / User-agent: PerplexityBot Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: CCBot Disallow: / User-agent: Bytespider Disallow: / User-agent: Amazonbot Disallow: / # ... repeat for every crawler in the directory above
Fortune 500 benchmark: Amazon names 16 AI crawlers, all Disallow. Extend the list as new crawlers are published; the Amazon case study publishes an always-updated version.
Template 4 — Selective Per-Model (partnership pattern)
Use when: a bilateral licensing deal with one AI vendor makes that vendor preferred. Typical fit: publishers with explicit OpenAI or Anthropic partnerships.
# Partner: OpenAI has a licensing deal — allow all variants User-agent: GPTBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: OAI-SearchBot Allow: / # All other AI crawlers blocked User-agent: ClaudeBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: PerplexityBot Disallow: / # ... etc User-agent: * Disallow: /private/
Verification: your WAF overrides your robots.txt
Publishing robots.txt is half the work. The other half is ensuring your WAF actually honors what robots.txt declares. Goldman Sachs allows ChatGPT-User in robots.txt and blocks it at the WAF; the declaration is meaningless.
Verify from a non-corporate IP with:
# Test ChatGPT-User access curl -sI -H "User-Agent: ChatGPT-User/1.0 (+https://openai.com/bot)" https://yourdomain.com/ # Test GPTBot access curl -sI -H "User-Agent: GPTBot" https://yourdomain.com/ # Test with a fresh datacenter IP (from a cloud VM, not your corporate network) # Layer-2 datacenter blocks only reveal themselves from commercial IP ranges
A 200 OK response with content means the file-level policy and the WAF agree. A 403 Forbidden means the WAF is overriding robots.txt. A challenge page (200 OK but with a JavaScript challenge) means the WAF is serving an anti-bot intermediate that AI crawlers cannot execute.
The 53-point checklist includes the full verification workflow. The pillar guide's 5-hour roadmap places robots.txt + WAF coordination at Hour 1.