Data Snapshots

Your llms.txt Adoption Numbers Are Wrong: How Soft-404s Inflate AI Metrics by 25×

Published 2026-04-20 · PROGEOLAB Research

A soft-404 is a web server configuration that returns HTTP 200 for URL paths that don't exist. Instead of returning 404, the server returns the homepage, a custom branded error page, or a catch-all template — all with a 200 status code. For human visitors this is a usability improvement (no ugly error screen). For AI-standards measurement it's a silent data poisoner. Every scanner that checks whether a file exists by looking at the HTTP status code reads soft-404 responses as real implementations.

Soft-404 inflation across AI standards: llms.txt, ai.txt, agents.json — Figure 1 · Soft-404 inflation across AI standards. Status-code reports on the left, body-validated reality on the right. Source: PROGEOLAB, April 2026.

The 41% problem

Of 388 Fortune 500 sites that responded to our audit, 160 (41%) serve soft-404 pages for any requested path. We tested this by probing random nonsense paths (/xyzabc, /nonexistent-path-987) and checking whether the response was HTTP 404 or HTTP 200. 160 sites returned 200.

Those 160 sites inflate every AI-standard metric. When a researcher probes them for /llms.txt, they get HTTP 200 with an HTML homepage. When they probe for /ai.txt, same. /agents.json, same. The sites don't have any of these files — but status-code-only scans record them all as implementations.

The inflation numbers

Standard	HTTP 200	Real	Inflation
llms.txt	353	14	25×
ai.txt	357	0	∞
agents.json	329	0	∞
mcp.json	298	0	∞

The llms.txt 25× inflation is the most consequential because llms.txt has real implementations. ai.txt, agents.json, and mcp.json have zero, so the entire "adoption" signal from status-code scanners is hallucinated. Any analyst claiming "60%+ of the Fortune 500 has adopted ai.txt" is reading soft-404 pages and counting them.

The body-validation fix

The correct method for measuring AI-standard adoption:

Probe a known-nonexistent URL first. /.well-known/xyzabc-probe-978 or similar. Record the response body's MD5 hash. This is the soft-404 signature.
Probe the target path. Record the response body's MD5 hash.
Compare hashes. If identical, the site is serving soft-404 — the target file doesn't exist. If different, inspect the body for standard-specific markers (e.g. "# " prefix for llms.txt, JSON Schema validation for agents.json).

This adds one probe per site and eliminates 10-25× over-reporting. Any serious measurement of AI-standard adoption should implement it. The llms.txt adoption data, the security.txt data, and the emerging-standards data in this corpus all use this methodology.

Key takeaways

41% of Fortune 500 sites serve soft-404s 160 sites return HTTP 200 for any URL path — including nonexistent paths. Every AI-standard scan gets phantom-positive results
llms.txt inflation: 25× (353 → 14) Status-code scanners report 353 Fortune 500 have llms.txt; only 14 actually do. The 339 false positives are all soft-404 pages
ai.txt inflation: ∞ (357 → 0) 357 Fortune 500 sites return 200 for /ai.txt. Zero are real. The standard has no production adoption anywhere
agents.json inflation: ∞ (329 → 0) Same pattern — 329 HTTP 200s, all soft-404. Zero Fortune 500 agents.json implementations
The fix: body validation MD5-hash the response body. If the hash matches the homepage response, or any generic 404 template, it's a soft-404. Simple, mandatory