IndexDoctor.io
AI visibility

AI crawlers blocked: what it means for AI search visibility

Blocking AI crawlers in robots.txt is a real choice, but it's worth understanding what you're actually opting out of, and how to verify the block is doing what you think.

What this usually means

A page is technically reachable on the web but disallowed in robots.txt for one or more AI crawlers, for example GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, Claude-User, PerplexityBot, or Google-Extended. Each of these user-agents has a distinct purpose: model training, AI search retrieval, or live answers in chat. Blocking them changes whether and how your content can show up in those products.

Why it matters

AI search and answer engines increasingly compete with, and sit alongside, Google for high-intent queries. If your robots.txt blocks the user-agents that fetch pages for live answers (e.g. OAI-SearchBot, Claude-SearchBot, PerplexityBot, ChatGPT-User), your content cannot be retrieved as a citation, even when it is the best answer. Training crawlers (GPTBot, ClaudeBot, Google-Extended) are a separate, longer-term decision about whether your content is used to train future models.

Common causes
  • A copy-pasted robots.txt template added Disallow lines for GPTBot, ClaudeBot, or Google-Extended without distinguishing training from live retrieval.
  • A staging-era "block all bots" policy was promoted to production unchanged.
  • A WAF or CDN bot-management rule blocks AI user-agents at the edge, returning 403 instead of being controlled by robots.txt.
  • A catch-all User-agent: * group disallows everything, and there is no specific Allow group for AI search crawlers.
  • The page is allowed in robots.txt but returns noindex, X-Robots-Tag: noindex, or empty/JS-only content, so it is technically retrievable but not understandable.
  • A canonical points to a different URL that is itself blocked or noindex.
How to diagnose it
  1. Open AI Crawler Checker and paste the page URL.
  2. Look at the per-crawler matrix: which AI crawlers are allowed, blocked, or unknown.
  3. Note whether blocks come from a specific user-agent rule or from the catch-all * group.
  4. Check whether the page itself is noindex or has very little extractable text.
  5. Re-check from a different network if you suspect a CDN/WAF block rather than a robots.txt block.
How to fix it
  1. 1

    Decide training vs. live answers separately

    Training crawlers (GPTBot, ClaudeBot, Google-Extended) are a long-term policy choice. Live retrieval crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot, ChatGPT-User) are what determine whether you can be cited in real-time AI answers. Block them independently if needed.

  2. 2

    Use specific User-agent groups in robots.txt

    Add explicit User-agent: GPTBot or User-agent: PerplexityBot groups instead of relying on the * catch-all. This makes intent obvious and prevents accidental blocks.

  3. 3

    Don't rely only on robots.txt at the edge

    If your CDN/WAF blocks AI bots at the edge, that overrides robots.txt and may return 403 for crawlers you actually want to allow. Audit your bot-management rules.

  4. 4

    Make allowed pages actually understandable

    Allowing a crawler is not enough. Remove noindex, ensure the page returns server-rendered HTML with real text, and use meaningful headings so the page can be parsed and cited.

  5. 5

    Re-check after every robots.txt deploy

    Run AI Crawler Checker against your most important pages after each robots.txt change. The matrix view tells you, per crawler, whether you're allowing what you intended.

FAQ
If I block GPTBot, does ChatGPT stop citing my pages?

Not necessarily. GPTBot is OpenAI's training crawler. Live citations in ChatGPT search use different user-agents like OAI-SearchBot and ChatGPT-User. To control live citations specifically, you have to control those user-agents, not just GPTBot.

Is blocking AI crawlers good or bad for SEO?

It has no direct effect on Google's traditional search ranking, since Google's indexing crawler is Googlebot, distinct from Google-Extended, which is for AI training. Blocking AI crawlers reduces visibility inside AI search and answer engines, which is increasingly a separate traffic surface.

Why does the matrix show "unknown" for some crawlers?

If we couldn't fetch your robots.txt cleanly, or your robots.txt doesn't reference a given crawler at all and there is no catch-all group, we mark access as unknown rather than guess. Add explicit user-agent rules to make behavior unambiguous.

Related fixes

Ready to diagnose your URL?

AI Crawler Checker runs the exact checks discussed above.

Run AI Crawler Checker