AI crawlers explained: GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot and Google-Extended
A grounded guide to the user-agents behind ChatGPT, Claude, Perplexity, and Google's AI surfaces, what they actually do, how robots.txt affects them, and how to choose a sensible policy.
Updated
Search engines used to be a small, well-known list. AI products have changed that. There are now multiple bots from each major lab, with overlapping but distinct purposes. Allowing or blocking them is no longer a single decision, it is several.
This guide explains what the most important AI crawlers actually do, how they read robots.txt, and how to decide which ones to allow on your site without overcomplicating the policy.
What is an AI crawler?
An AI crawler is an automated client run by an AI company to fetch web pages. The page contents may then be used for one or more of three purposes:
- Training or improving a foundation model.
- Building a search-style index that the AI consults when answering questions.
- Retrieving a specific page on demand because a user asked a question that mentioned it.
These three purposes correspond to three classes of crawler: training crawlers, search/indexing crawlers, and user-triggered retrieval agents. The same provider often runs more than one. That is the part most blog posts skip.
Training, search, and retrieval are different decisions
Blocking every AI user-agent is one way to handle the situation, but it has a real cost: AI search products that depend on those crawlers will not be able to cite or summarize your pages. As more people use AI to answer questions, that means less downstream traffic from AI surfaces.
Allowing every crawler is also a real choice with a real cost: your content can end up in training data, including data used to train competing models. There is no universally correct answer here. The honest framing is: training and search visibility are separate policy decisions, and you can make different calls on each.
OpenAI: GPTBot, OAI-SearchBot, ChatGPT-User
- GPTBot: OpenAI's training crawler. Blocking GPTBot is an opt-out from having your pages used to improve future models. It does not affect ChatGPT search or browsing.
- OAI-SearchBot: OpenAI's search-index crawler. Blocking this removes you from the index ChatGPT consults during answers. If you want OpenAI products to cite your pages, this is the one to keep allowed.
- ChatGPT-User: a per-request browsing agent triggered when a user asks ChatGPT to fetch a specific URL. It is not running an offline crawl.
Anthropic: ClaudeBot, Claude-SearchBot, Claude-User
- ClaudeBot: Anthropic's training crawler. Blocking ClaudeBot is an opt-out from training, not from in-product features.
- Claude-SearchBot: Anthropic's search-style crawler used to power Claude's answers when grounding to web content. Allowing this is the equivalent of letting Claude cite you.
- Claude-User: a per-request user-triggered agent that fetches URLs the user references in a Claude conversation.
Perplexity: PerplexityBot
Perplexity uses PerplexityBot to build the search index it consults when answering questions. Perplexity is a citation-heavy product, blocking PerplexityBot directly removes you from its citation pool.
Google: Googlebot and Google-Extended
- Googlebot: classic Google Search crawler. Blocking Googlebot removes you from Google Search ranking. It is essentially never the right call for a public site.
- Google-Extended: a separate token that controls whether Google can use your content for AI products such as Gemini and AI Overviews. Blocking Google-Extended does not affect Google Search ranking, only AI use.
Bing: Bingbot
Bingbot is the search baseline for Microsoft and for many AI products downstream of the Bing index, including some Copilot surfaces. If you want broad AI search visibility, keep Bingbot allowed.
robots.txt examples
Below are three sensible policies. None of them is universally correct, pick the one that matches your goals.
1) Block AI training crawlers, allow AI search crawlers.
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /2) Block everything AI-specific, keep classic search.
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-SearchBot
Disallow: /
User-agent: Claude-User
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /3) Allow everything (the implicit default).
User-agent: *
Allow: /robots.txt is not a privacy mechanism
robots.txt is a politeness protocol. Well-behaved crawlers honor it, including all the AI crawlers above. It does not protect against bad actors or scraping tools that ignore it. If a piece of content needs to be private, it needs to be behind authentication or removed.
How to verify your policy
Once you have a policy in place, the question is whether it is actually being honored. Test the page with the AI Crawler Checker to see how each known AI crawler resolves against your robots.txt and HTTP responses. If something looks wrong, the AI Crawlers Blocked fix page walks through the most common causes.
- Run the AI Crawler Checker to see per-crawler access for any URL.
- Run the Robots Tester to confirm a specific user-agent and path resolve as expected.
- Open the AI Crawlers Blocked fix page if the matrix shows blocks you did not intend.
Is GPTBot the same as ChatGPT browsing?
No. GPTBot is OpenAI's training crawler. ChatGPT browsing is handled by ChatGPT-User (per-request fetches) and OpenAI's search index uses OAI-SearchBot. Blocking GPTBot only opts you out of training, ChatGPT can still browse and search if those agents are allowed.
Should I block AI crawlers?
It depends on whether you care more about being cited in AI answers or about your content not being used for training. The cleanest middle ground is to block training crawlers (GPTBot, ClaudeBot, Google-Extended) and allow search crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot, Googlebot, Bingbot).
What is Google-Extended?
Google-Extended is a separate user-agent token that controls whether Google may use your content for AI products like Gemini and AI Overviews. Blocking Google-Extended does not affect Google Search ranking, only AI use.
Can robots.txt remove a page from AI answers?
Disallowing a crawler in robots.txt prevents future crawls but does not remove pages an AI product has already learned from. For removal, you usually need the provider's removal tools.
What is the difference between GPTBot and OAI-SearchBot?
GPTBot is for training, OAI-SearchBot is for the search index that ChatGPT consults at answer time. They are independent: you can allow one and disallow the other, depending on whether you want to be cited but not trained on.