Research FAQ

AI Visibility Research FAQ: 28 Answers Backed by Fortune 500 Data

Published 2026-04-20 · PROGEOLAB Research

This FAQ collects the 28 most common questions we hear about AI visibility, AI crawlers, llms.txt, WAF rules, and related enterprise web topics. Every answer is drawn from the PROGEOLAB Fortune 500 AI Accessibility Audit — 134,000 HTTP probes across 500 companies with four user agents, between April 16–19, 2026 — combined with raw response body analysis.

No speculation, no vendor claims, no AI-generated filler. Every number here is a primary-source measurement. Where questions touch on regulatory or governance tradeoffs (pharma blocking, training-vs-retrieval split), the answer frames the tradeoff without prescribing; the data alone doesn't settle those questions.

Looking for something more specific? The full pillar guide covers the broader thesis; the research library has the detailed reports behind each answer below.

AI Accessibility

What percentage of Fortune 500 companies are accessible to ChatGPT?

265 of 500 Fortune Global 500 companies (53%) are accessible to ChatGPT-User. 53 companies (10.6%) are in the GEO Visibility Gap — they serve content to Chrome browsers but block ChatGPT-User. 148 companies (29.6%) are unreachable by any automated client, including both AI crawlers and traditional search bots. Source: PROGEOLAB Fortune 500 AI Accessibility Report (April 2026).

What is the GEO Visibility Gap?

The GEO Visibility Gap is the difference between what a website shows to browsers and what it shows to AI crawlers. A company is in the gap when Chrome receives HTTP 200 but ChatGPT-User is blocked. 53 Fortune 500 companies fall in this gap. These companies' AI representations are built from stale training data rather than their current, authoritative content.

Which Fortune 500 companies have the best AI visibility?

Based on the PROGEOLAB 6-dimension AI Readiness Index (max ~12): NVIDIA leads at 10.5, followed by Dell Technologies (10.0), Volkswagen (9.0), Apple (8.0), Target (8.0), HP (8.0), and National Australia Bank (8.0). The top 3 span three industries: semiconductors, technology, and automotive — demonstrating that AI visibility is an organizational capability, not an industry characteristic.

Do AI crawlers actually respect robots.txt?

OpenAI, Anthropic, Google, and Perplexity all claim their crawlers respect robots.txt. However, our audit found that only 20 of 267 parseable Fortune 500 robots.txt files (7.5%) contain any AI-specific directive. The remaining 92.5% have no declared AI policy, meaning compliance with a non-existent policy is a meaningless claim. Additionally, WAF blocking operates independently of robots.txt and overrides it in practice.

How do AI crawlers get blocked?

AI blocking operates at three independent layers. Layer 1 (UA string): 53 companies block specific AI user agent strings while allowing browsers. Layer 2 (IP reputation): 24 companies block datacenter IP ranges, affecting all automated clients. Layer 3 (TLS fingerprint): 15 companies use JA3/JA4 fingerprint analysis to detect non-browser clients. Testing with only one method (e.g., curl) detects only Layer 1.

Does blocking AI crawlers protect my content?

No. Blocking AI crawlers creates the AI Content Paradox: it makes your AI representation worse, not absent. AI answer engines continue answering questions about your company — using stale training data and third-party sources instead of your authoritative content. Johnson & Johnson exemplifies this: Chrome 64/64, ChatGPT 0/64. Patients asking AI about J&J medications receive answers from outdated training data rather than current FDA-reviewed prescribing information.

Why are so many Chinese companies unreachable?

78 Chinese Fortune 500 companies are unreachable from our European probing environment. This reflects Great Firewall infrastructure — inbound international traffic to domestically hosted Chinese domains is restricted — not corporate AI policy. Chinese companies on international CDNs (BYD, Lenovo, Alibaba) are accessible and show the same range of AI policies as other countries.

Which country has the highest AI blocking rate?

Britain has the highest GEO gap rate among major countries at 29% — 5 of 17 British Fortune 500 companies block AI while serving Chrome. The U.S. has the highest absolute number in the gap (23 companies) but a lower rate (20%). South Korea and Netherlands show zero AI-specific blocking among their Fortune 500 companies.

Standards & Files

How many Fortune 500 companies have llms.txt?

Only 14 of 500 (2.8%). Status-code scanners report 353 (70.6%), but body validation reveals that 339 of those are soft-404 pages — HTML error pages served with HTTP 200. The 25x inflation demonstrates why body validation is mandatory for any AI standard adoption measurement. Volkswagen leads with 198 links, Dell with 131, Subaru with 100.

What is the best llms.txt implementation in the Fortune 500?

Volkswagen's llms.txt contains 198 curated links organized into 5 Markdown sections: Models (10 links), Shopping Tools (9), Owners (130), Financial Services (12), and Newsroom (25). 65% of links are owner resources — VW prioritizes helping existing owners over driving new sales. VW's full 4-UA accessibility (24/24/24/24) means AI crawlers can actually read the file.

How many Fortune 500 companies have AI bot rules in robots.txt?

20 of 267 parseable robots.txt files (7.5%) contain AI-specific User-agent directives. NVIDIA allows 17 AI bots (most comprehensive allow policy). Amazon blocks 47 bots by name (most comprehensive block). Zero companies distinguish between training crawlers and retrieval crawlers — the recommended approach for most enterprises.

What AI crawlers should I include in my robots.txt?

Our Fortune 500 audit catalogued 24 AI crawlers from 14 operators. Key ones: GPTBot and ChatGPT-User (OpenAI), ClaudeBot and Claude-User (Anthropic), PerplexityBot (Perplexity), Google-Extended (Gemini training), Applebot-Extended (Apple Intelligence). We recommend the training-retrieval split: block training crawlers while allowing retrieval crawlers. Copy-paste templates are in our robots.txt AI Guide.

Do any Fortune 500 companies have ai.txt or agents.json?

No. Zero Fortune 500 companies have genuine ai.txt, agents.json, or MCP server implementations. Every HTTP 200 response on these paths is a soft-404. These emerging standards have zero enterprise adoption as of April 2026. We recommend monitoring specification development but not implementing until standards stabilize.

What is a soft-404 and why does it matter for GEO?

A soft-404 occurs when a web server returns HTTP 200 for a non-existent URL. 160 Fortune 500 sites (41%) are soft-404 sites. This inflates every AI standard adoption metric: llms.txt appears at 70.6% (real: 2.8%), ai.txt appears at 71.4% (real: 0%). Any tool measuring adoption by status code alone reports phantom implementations.

How many Fortune 500 companies have security.txt?

75 of 500 (15%) have a body-validated security.txt following RFC 9116. 9 include PGP signatures. security.txt correlates with AI readiness: 72% of the Top 25 AI-ready companies have it, versus 15% overall — a 4.8x higher rate. This suggests security.txt is a proxy for the organizational maturity that AI visibility also requires.

How many Fortune 500 companies have JSON-LD on their homepage?

122 of 388 responding companies (24.4%). Of these, 78 use the generic Organization type and only 15 use the more specific Corporation type. 55 include sameAs links, but only 3 link to Wikidata (Apple, Comcast, Repsol). 91% are missing numberOfEmployees — the strongest size disambiguation signal for AI entity resolution.

Industry

How do banks perform on AI visibility?

Our audit covers 57 Fortune 500 banks. The GEO gap rate for banking is 5.3% — lower than telecom (27%) or pharma (14%). National Australia Bank leads the sector with an AI Readiness Score of 8/12. Most banks are accessible by default but have not invested in AI optimization signals like llms.txt or JSON-LD.

Why does Tesla score 0 on AI readiness?

Tesla blocks all four user agents on every probe endpoint — 0/64 across Research, Googlebot, Chrome, and ChatGPT-User. This is the most comprehensive blocking in the automotive sector. By contrast, Volkswagen scores 9/12 with a 198-link llms.txt, JSON-LD, security.txt, and full 4-UA accessibility. Same industry, opposite strategies.

Do technology companies lead on AI visibility?

No. Volkswagen (automotive, 9.0) outranks most tech companies. Target (retail, 8.0) and National Australia Bank (banking, 8.0) both outrank Apple (8.0). Oracle and Intel — both major technology companies — score 0/12. Dell leads tech at 10.0 but is the exception. AI visibility is an organizational capability, not an industry characteristic.

Why does Johnson & Johnson block ChatGPT?

J&J demonstrates the cleanest AI block in the Fortune 500: Chrome 64/64, ChatGPT 0/64. The likely motivation is regulatory caution — FDA-regulated content carries off-label promotion and adverse event reporting risks. However, blocking replaces controlled, regulatory-reviewed content with stale AI training data. The recommended approach for pharma: allow retrieval crawlers (ChatGPT-User) while blocking training crawlers (GPTBot).

Which WAF blocks AI crawlers the most?

F5 BIG-IP is the most prevalent WAF among Fortune 500 companies (232 sites), followed by Cloudflare (64), Akamai (59), Imperva (23), and AWS WAF (8). WAF blocking is often an inherited default rather than a deliberate AI policy — Cisco and Goldman Sachs both allow AI bots in robots.txt but block them at the WAF, proving organizational disconnect.

Implementation

How do I check if my site is accessible to AI crawlers?

Test with curl using AI user agent strings: `curl -H 'User-Agent: ChatGPT-User/1.0' https://yoursite.com/`. If you get a 200 response with content, ChatGPT can access your site. If you get 403, a WAF challenge page, or empty content, you're blocked. Note: this only tests Layer 1 (UA string) — IP and TLS fingerprint blocking require more sophisticated testing.

How long does it take to implement basic AI visibility?

Approximately 5 hours. Hour 1: WAF rules + robots.txt AI policy. Hour 2: Create llms.txt with 20+ links. Hour 3: Add JSON-LD with Wikidata sameAs + update title/meta. Hour 4: Publish security.txt + verify all files. Hour 5: Document policy + brief team. This moves most companies from Level 0 to Level 3 on the GEO Maturity Model — ahead of 97% of the Fortune 500.

What is the single highest-impact GEO action I can take?

Ensure ChatGPT-User is not blocked by your WAF. If AI crawlers cannot reach your content, no other optimization matters. This is a 5-minute WAF configuration change. After that, the next highest impact is adding a Wikidata sameAs link to your JSON-LD — 10 minutes, puts you ahead of 99.4% of the Fortune 500.

Should I block AI training crawlers?

It depends on your strategy. If your competitive advantage is information authority (B2B, pharma, financial services), blocking training means future AI models learn about you from third-party sources. If you compete on transactions (e-commerce), blocking training may be defensible. The recommended approach for most enterprises: block training crawlers (GPTBot, Google-Extended) while allowing retrieval crawlers (ChatGPT-User, PerplexityBot).

How do I add a Wikidata sameAs link to my JSON-LD?

Search wikidata.org for your company name and note the Q-number. Add to your JSON-LD: `"sameAs": ["https://www.wikidata.org/entity/QXXXXX", ...your other sameAs links]`. Place Wikidata first in the array. Validate with Google's Rich Results Test. Only 3 Fortune 500 companies do this (Apple Q312, Comcast Q1113804, Repsol Q174747).

How often should I update my AI visibility configuration?

Quarterly minimum. New AI crawlers appear regularly — Amazon's robots.txt grew from 16 to 47 named bots within one year. Review robots.txt AI directives, update llms.txt with new content, verify WAF rules haven't been changed by security updates, and test accessibility with curl. Also review when changing WAF vendors, deploying major site redesigns, or after CMS migrations.

What is the difference between llms.txt and robots.txt for AI?

robots.txt is an access control file — it declares which crawlers may access which paths. llms.txt is a content directory — it guides AI to your most important pages. They serve different functions: robots.txt determines whether AI can reach your content, llms.txt determines what AI finds when it arrives. Both are needed. 7.5% of Fortune 500 have AI robots.txt directives; 2.8% have llms.txt.