Platform Research Insights Glossary Tools Compare FAQ Request Demo
Guides & How-To

robots.txt for AI Crawlers: The Complete Guide with Copy-Paste Templates

Published 2026-04-20 · PROGEOLAB Research

An AI-aware robots.txt is a small file that makes a deliberate statement: here are the AI crawlers I know about, and here is what I want each of them to do. Without it, your website's interaction with AI answer engines is governed by wildcard rules designed for search engines twenty years ago and by WAF rules configured for a different threat model. Most Fortune 500 companies are in that default state.

This guide publishes four copy-paste templates, a directory of the 24 AI crawlers worth naming in 2026, and the Fortune 500 benchmarks for each decision. Pick the template that matches your business model. Review quarterly. Verify with curl.

The AI crawler directory

24 AI-specific crawlers are worth knowing about, operated by 14 vendors:

Crawler Operator Purpose
GPTBotOpenAIModel training
ChatGPT-UserOpenAIRetrieval (browsing tool)
OAI-SearchBotOpenAIChatGPT Search index
ClaudeBotAnthropicModel training
Claude-UserAnthropicRetrieval (Claude tool use)
Google-ExtendedGoogleGemini training (opt-out of Googlebot)
PerplexityBotPerplexityRetrieval for answers
CCBotCommon CrawlPublic dataset for LLM training
BytespiderByteDanceDoubao / TikTok AI training
meta-externalagentMetaLlama training data
Applebot-ExtendedAppleApple Intelligence training (opt-out of Applebot)
AmazonbotAmazonAlexa knowledge base
Cohere-aiCohereEnterprise LLM training
AI2BotAllen InstituteOpen-source AI research

Plus 10 more niche bots (FacebookBot, Diffbot, YouBot, MistralAI-User, PanguBot, and others) documented in the full research.

Template 1 — Allow All (HP pattern)

Use when: the brand benefits from AI citation more than it loses to AI summarization. Typical fit: content publishers, B2B services, consumer durables.

# Allow all AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: CCBot
Allow: /

User-agent: Amazonbot
Allow: /

# Sitemap for AI systems
Sitemap: https://yourdomain.com/sitemap.xml

Fortune 500 benchmark: HP names 10 AI bots and allows each. Nvidia names 17 (the full list above plus variants).

Template 2 — Training-Retrieval Split (recommended for most enterprises)

Use when: current content should reach AI answers, but model training contributes to a competitor's advantage. Typical fit: pharma, financial services, legal publishers, paywalled research.

# Block training crawlers — content should not be baked into model weights
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# Allow retrieval crawlers — live queries should see current content
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Fortune 500 benchmark: 0 companies have this template deployed. First-mover advantage is fully available. Requires WAF rules that match — see Verification below.

Template 3 — Block All (Amazon pattern)

Use when: the business model is transactional and AI citation cannibalizes revenue. Typical fit: e-commerce marketplaces, classifieds, pricing-intensive listings.

# Block all known AI crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Amazonbot
Disallow: /

# ... repeat for every crawler in the directory above

Fortune 500 benchmark: Amazon names 16 AI crawlers, all Disallow. Extend the list as new crawlers are published; the Amazon case study publishes an always-updated version.

Template 4 — Selective Per-Model (partnership pattern)

Use when: a bilateral licensing deal with one AI vendor makes that vendor preferred. Typical fit: publishers with explicit OpenAI or Anthropic partnerships.

# Partner: OpenAI has a licensing deal — allow all variants
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

# All other AI crawlers blocked
User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

# ... etc
User-agent: *
Disallow: /private/

Verification: your WAF overrides your robots.txt

Publishing robots.txt is half the work. The other half is ensuring your WAF actually honors what robots.txt declares. Goldman Sachs allows ChatGPT-User in robots.txt and blocks it at the WAF; the declaration is meaningless.

Verify from a non-corporate IP with:

# Test ChatGPT-User access
curl -sI -H "User-Agent: ChatGPT-User/1.0 (+https://openai.com/bot)" https://yourdomain.com/

# Test GPTBot access
curl -sI -H "User-Agent: GPTBot" https://yourdomain.com/

# Test with a fresh datacenter IP (from a cloud VM, not your corporate network)
# Layer-2 datacenter blocks only reveal themselves from commercial IP ranges

A 200 OK response with content means the file-level policy and the WAF agree. A 403 Forbidden means the WAF is overriding robots.txt. A challenge page (200 OK but with a JavaScript challenge) means the WAF is serving an anti-bot intermediate that AI crawlers cannot execute.

The 53-point checklist includes the full verification workflow. The pillar guide's 5-hour roadmap places robots.txt + WAF coordination at Hour 1.