Platform Research Insights Glossary Tools Compare FAQ Request Demo
Data Snapshots

Your llms.txt Adoption Numbers Are Wrong: How Soft-404s Inflate AI Metrics by 25×

Published 2026-04-20 · PROGEOLAB Research

A soft-404 is a web server configuration that returns HTTP 200 for URL paths that don't exist. Instead of returning 404, the server returns the homepage, a custom branded error page, or a catch-all template — all with a 200 status code. For human visitors this is a usability improvement (no ugly error screen). For AI-standards measurement it's a silent data poisoner. Every scanner that checks whether a file exists by looking at the HTTP status code reads soft-404 responses as real implementations.

Soft-404 inflation across AI standards: llms.txt, ai.txt, agents.json
Figure 1 · Soft-404 inflation across AI standards. Status-code reports on the left, body-validated reality on the right. Source: PROGEOLAB, April 2026.

The 41% problem

Of 388 Fortune 500 sites that responded to our audit, 160 (41%) serve soft-404 pages for any requested path. We tested this by probing random nonsense paths (/xyzabc, /nonexistent-path-987) and checking whether the response was HTTP 404 or HTTP 200. 160 sites returned 200.

Those 160 sites inflate every AI-standard metric. When a researcher probes them for /llms.txt, they get HTTP 200 with an HTML homepage. When they probe for /ai.txt, same. /agents.json, same. The sites don't have any of these files — but status-code-only scans record them all as implementations.

The inflation numbers

Standard HTTP 200 Real Inflation
llms.txt3531425×
ai.txt3570
agents.json3290
mcp.json2980

The llms.txt 25× inflation is the most consequential because llms.txt has real implementations. ai.txt, agents.json, and mcp.json have zero, so the entire "adoption" signal from status-code scanners is hallucinated. Any analyst claiming "60%+ of the Fortune 500 has adopted ai.txt" is reading soft-404 pages and counting them.

The body-validation fix

The correct method for measuring AI-standard adoption:

  1. Probe a known-nonexistent URL first. /.well-known/xyzabc-probe-978 or similar. Record the response body's MD5 hash. This is the soft-404 signature.
  2. Probe the target path. Record the response body's MD5 hash.
  3. Compare hashes. If identical, the site is serving soft-404 — the target file doesn't exist. If different, inspect the body for standard-specific markers (e.g. "# " prefix for llms.txt, JSON Schema validation for agents.json).

This adds one probe per site and eliminates 10-25× over-reporting. Any serious measurement of AI-standard adoption should implement it. The llms.txt adoption data, the security.txt data, and the emerging-standards data in this corpus all use this methodology.