Case Studies & Analysis

Amazon's robots.txt: 47 Bots Named, 16 AI Crawlers Blocked

Published 2026-04-20 · PROGEOLAB Research

Amazon's robots.txt at amazon.com is 5,888 bytes containing 48 unique User-agent sections — the most comprehensive bot management policy in the Fortune Global 500. While other companies block a handful of AI crawlers, Amazon names and blocks 47 individual user agents with Disallow: /. The file is effectively a directory of every automated system that has probed amazon.com at scale — including AI bots that appear nowhere else in any Fortune 500 robots.txt.

For anyone working on AI visibility strategy, Amazon's robots.txt is the most important primary-source document in the field for three reasons: it catalogs AI bots nobody else has named, reveals the full taxonomy of AI crawler types (training, retrieval, agent), and demonstrates the most aggressive defensible blocking strategy available.

The 48 user agents, classified

Category	Count	Examples
AI training crawlers	7	GPTBot, ClaudeBot, CCBot, Bytespider, AI2Bot
AI retrieval crawlers	6	ChatGPT-User, Claude-User, PerplexityBot, Amazonbot
AI agent browsers	3	GoogleAgent-Mariner, GoogleAgent-Shopping, Devin
Specialty / research	6	Google-NotebookLM, MistralAI-User, Cohere-ai, iaskspider, Thinkbot
Scrapers / SEO crawlers	26	SemrushBot-SWA, Scrapy, img2dataset, PanguBot, VelenPublicWebCrawler

Six bots you've never heard of

Amazon's robots.txt names six bots that appear in no other Fortune 500 file:

Google-NotebookLM — Google's research tool has its own crawler, suggesting NotebookLM fetches web content independently rather than relying on the main Google index. Amazon blocks it.
GoogleAgent-Mariner — Google's Project Mariner is an experimental browser agent that navigates websites on a user's behalf. Amazon blocks it to prevent Mariner from executing shopping tasks on amazon.com.
GoogleAgent-Shopping — Dedicated shopping-AI agent, separate from Googlebot. Amazon blocks Google's AI from automated product comparison.
Devin — Cognition's software-engineering AI agent. Amazon blocks Devin from browsing amazon.com (potentially relevant for AWS documentation paths hosted on the main Amazon domain).
iaskspider — iAsk.AI's crawler (a Chinese AI answer engine).
PanguBot — Huawei's foundation-model crawler. Not yet commercialized outside China.

Tracking changes in Amazon's robots.txt is effectively a low-cost intelligence signal for what AI agents and crawlers are in production — because Amazon's security team is paid to notice traffic patterns nobody else is looking for.

The strategic logic

Amazon's blocking is consistent with Amazon's business model. Amazon's competitive advantage is the transaction, not the citation. Product information on amazon.com drives purchases on amazon.com. When AI extracts and redistributes pricing, reviews, and availability, consumers may get answers without clicking through. Every AI-mediated comparison that doesn't end at checkout is a potential lost transaction.

Amazon also has no JSON-LD on its homepage, no llms.txt, no sameAs links to Wikidata. The AI-blocking is comprehensive across every signal layer, not just robots.txt. Amazon scores 1 on the PROGEOLAB AI-Readiness Index — the only point comes from the fact that amazon.com is accessible to Chrome.

When to copy Amazon. When not to.

Amazon's strategy works for Amazon because Amazon's revenue is transactional. It fails catastrophically for most enterprises. A pharma company copying Amazon's robots.txt would still have AI answering questions about its drugs — just using training data rather than current FDA-reviewed content. A bank copying Amazon's robots.txt would still have AI answering questions about its products — just using news coverage rather than the bank's own framing. The AI Content Paradox bites harder the further your business model is from pure transaction.

The Amazon pattern is appropriate when: (1) the product is the destination, not a reference; (2) third-party price comparison is a threat to margin; (3) brand trust is not a primary asset; (4) competing alternatives exist such that AI-mediated discovery is a wash rather than a lift. Most Fortune 500 companies fail at least two of these criteria and should implement the training-retrieval split rather than the Block-All-AI template.

Key takeaways

48 User-agent sections, 5,888 bytes The most comprehensive robots.txt in the Fortune 500 — effectively a directory of automated systems that have probed Amazon
16 AI crawlers blocked by name GPTBot, ChatGPT-User, ClaudeBot, Claude-User, PerplexityBot, CCBot, Amazonbot, Bytespider, AI2Bot, Cohere-ai, and 6 others — all Disallow: /
6 AI bots Amazon has seen that nobody else names Google-NotebookLM, GoogleAgent-Mariner, GoogleAgent-Shopping, Devin, iaskspider, PanguBot — previews of the AI agent economy
Three agent-browser bots Google's Project Mariner, Cognition's Devin, OpenAI's browser agent — all blocked. Amazon sees AI agents as competitive threats
Zero JSON-LD, zero llms.txt, zero Wikidata The block is complete. Amazon scores 1 on the AI-Readiness Index — the only point comes from Chrome accessibility