If you want to protect your content from model training, block training bots like `GPTBot`, `ClaudeBot`, `CCBot`, `Bytespider`, and `Applebot-Extended`. If you want to be visible in AI responses, do not automatically block search and user bots, such as `PerplexityBot`, `OAI-SearchBot`, `ChatGPT-User`, `Claude-User`, or `Applebot`.
Template to copy: "If [the bot serves to train models], choose [blocking in robots.txt], because [it does not provide direct visibility in the AI search engine]. In practice, check [User-Agent, IP, robots.txt, number of requests, and pages it visits]."
The Client Asks AI, and the Logs Get Crowded
The store owner sees GPTBot, PerplexityBot, ClaudeBot, Bytespider in the logs and is unsure if it's an opportunity or a problem. They are right to pause. Each of these bots may be doing something different.
One builds an AI search index. Another fetches the page because a user pasted a link into the chat. A third collects data that could be used for model training. If you lump them all together and block User-agent: *, you could cut off your visibility where customers start looking for products.
This guide is for the owner of a small store, salon, or service company. The goal is simple: check the logs, categorize the bots by purpose, and make decisions without panic.
Why This Matters in 2026
AI Search works differently than classic Google. ChatGPT, Perplexity, Gemini, and Claude do not always show ten blue links. They often build responses from multiple sources and only provide links next to a summary.
OpenAI clarifies bot roles: OAI-SearchBot is for visibility in ChatGPT search; GPTBot refers to model training; and ChatGPT-User appears during user actions. Perplexity describes a similar division: PerplexityBot builds visibility in results, while Perplexity-User acts on user queries. Anthropic also distinguishes between ClaudeBot, Claude-User, and Claude-SearchBot.
For businesses, this changes decision-making. The question "should I block AI?" is too broad. A better question is: "am I blocking model training, indexing for AI answers, link previews, or one-time user visits?"
How This Differs from the Old robots.txt
Classic SEO asked: "Can Googlebot access the page?" AI discovery asks: "Can the correct agent read the appropriate part of the page for the right purpose?"
Example: you can block GPTBot to limit content use for OpenAI model training while allowing OAI-SearchBot to ensure the store appears in ChatGPT Search results. This is not a contradiction. It's about separating training from visibility.
Step by Step
- Divide Bots into Four GroupsBad"We'll block all AI bots since they steal content."Better"We'll block training bots, keep search bots, and monitor user fetchers for 14 days."
For a shoe store, `PerplexityBot` might help because it can assist in citing the guide "how to choose trekking shoes." For a beauty salon, `ChatGPT-User` may be more important, as a client could paste a price list and ask AI which treatment to choose.
| Group | Purpose | Examples | Initial Decision |
| ------------------------ | --------------------------- | ---------------------------------------------------- | ----------------------------- | | Model Training | Data for models |
GPTBot,ClaudeBot,CCBot,Bytespider| Block or limit | | AI Search | Visibility in responses |PerplexityBot,OAI-SearchBot,Claude-SearchBot| Generally allow | | User Fetch | Entry on user request |ChatGPT-User,Claude-User,Perplexity-User| Allow if content is public | | Link Preview & Platforms | Previews, indexes, metadata |FacebookBot,Amazonbot,Applebot| Do not block without checking | - Check if the Bot is on Your ListBad"There's some bot in the logs, so it must be AI."Better"I compare User-Agent against a list of 16 bots and add purpose, operator, and decision."
At Audit AI, we maintain a base list of 16 identifiers: `GPTBot`, `ClaudeBot`, `PerplexityBot`, `ChatGPT-User`, `Google-Extended`, `CCBot`, `Anthropic-AI`, `Claude-Web`, `Bytespider`, `Cohere-AI`, `Applebot-Extended`, `Amazonbot`, `Meta-ExternalAgent`, `FacebookBot`, `OmgiliBot`, `Diffbot`.
This is not the full list of the internet. It's a decision-making list for a small store. If you have Shoper, WooCommerce, or PrestaShop, start by searching for these names in the logs from the last 7-14 days.
- Access Hosting Logs; Don’t Guess from AnalyticsBad"I don't see this in Google Analytics, so there are no bots."Better"I download the access log from hosting and search by User-Agent."
Bots typically do not execute JavaScript like humans. Therefore, Analytics may not show them. Check the raw server logs.
In hosting, look for places labeled "Access logs," "Raw access," "Web logs," or "Server logs." In cPanel, raw logs are usually available. In DirectAdmin, look for domain statistics and Apache/Nginx logs. On managed hosting, request support for a log snippet for the domain from the last 7 days.
Example search on the server:
grep -Ei "GPTBot|ClaudeBot|PerplexityBot|Bytespider|CCBot|Applebot-Extended" access.log
One log line typically shows the date, IP address, path, HTTP status, and User-Agent. For a cosmetics store, it's important to see if the bot reads the blog, products, cart, account, or filter parameters.
- Set a Default Policy: Visibility Yes, Training Not AlwaysBad"We’ll let AI in because it’s the future."Better"We allow bots that can provide citations or answers for customers, but we block training where content is valuable."
For a pet food store, a public guide "how to choose food for an allergy" can work towards visibility in AI Search. Product descriptions, images, and original comparisons should be more protected. For a physiotherapy clinic, keeping the public price list and FAQ accessible is worthwhile, but post-purchase materials should require a login, not just be excluded from robots.txt.
Example of a cautious robots.txt:
User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: CCBot Disallow: / User-agent: Bytespider Disallow: / User-agent: PerplexityBot Allow: / User-agent: ChatGPT-User Allow: /
This is not a template for everyone. It’s a starting point if you want to limit model training but not close off all visibility in AI responses.
- Don’t Treat `Google-Extended` Like Any Ordinary Bot in the LogsBad"I don't see `Google-Extended` in the logs, so Google isn't using it."Better"I check `Google-Extended` in robots.txt as a control token, not a separate HTTP User-Agent."
Google explains that `Google-Extended` does not have a separate User-Agent in HTTP requests. It’s a token in robots.txt that controls whether content crawled by Google can be utilized for future models like Gemini and grounding. Google also states that this token does not affect a page's presence in Google Search.
Example:
User-agent: Google-Extended Disallow: /
For an online store, this is a strategic decision. If you block `Google-Extended`, you do not block regular Google Search. However, you can limit content use in certain AI Google features.
- Secure Private Content with a Password, Not robots.txtBad"We'll hide the `/wholesale/` directory in robots.txt, and that's enough."Better"Wholesale prices, customer panels, and post-purchase files require login; robots.txt is just an additional instruction."
Robots.txt is public. Anyone can access `yourdomain.com/robots.txt` and see which directories you’re trying to hide. This is a good mechanism for controlling crawlers but a weak one for protecting sensitive data.
For a supplement store, the wholesale price list should be behind a B2B account. For a beauty salon, treatment documentation should be password-protected or included in a booking system, not in a public PDF hidden via `Disallow`.
- Measure Load, Not Just Bot PresenceBad"Bytespider was here once, so we're blocking the whole world."Better"We check the number of requests, HTTP status 200/404/429, and the most frequently visited URLs."
One bot on the homepage means nothing. The problem begins when a crawler queries thousands of variant filters, sorting, and parameters `?color=`, `?size=`, `?page=`. A small hosting plan for 50-100 PLN per month will feel this faster than a store on a separate VPS.
If you see many requests to filters, add blocks for parameters:
User-agent: * Disallow: /*?sort= Disallow: /*?filter= Disallow: /*?price=
This also helps regular SEO, as it reduces crawl waste.
- Make Decisions Per Bot, Not Per EmotionBad"AI steals content; we block everything."Better"Each bot receives a decision: allow, disallow, monitor, or block at the firewall."
| Bot from Audit AI List | What It Usually Means | Initial Decision |
| ---------------------- | ---------------------------------------------------------------------- | -------------------------------------- | |
GPTBot| OpenAI model training | Block if you don't want training | |ChatGPT-User| Entry on user request | Allow for public pages | |ClaudeBot| Anthropic model training | Block or limit | |Anthropic-AI| Older Anthropic identifier | Block along withClaudeBot| |Claude-Web| Monitored Claude identifier needing verification in logs | Monitor | |PerplexityBot| Perplexity Search indexing | Allow if you want citations | |Google-Extended| Control token for Gemini, not a separate UA request | Strategic decision | |CCBot| Common Crawl, public web datasets | Block if you don't want datasets | |Bytespider| ByteDance/TikTok/Doubao | Usually block or limit | |Cohere-AI| Cohere-related crawler | Monitor or block | |Applebot-Extended| Control for Apple AI data usage | Block if you don't want Apple training | |Amazonbot| Amazon/Alexa/Search | Monitor, do not block right away | |Meta-ExternalAgent| Meta AI crawler | Decision dependent on risk | |FacebookBot| Previews and Meta systems | Do not block without testing previews | |OmgiliBot| Webhose/Bright Data crawler | Usually block | |Diffbot| Knowledge Graph and web crawl, not model training according to Diffbot | Monitor |
Ready Templates
Template for decisions per bot:
Bot: Operator: Purpose: training / AI search / user fetch / preview / unknown Does it provide visibility to me: Does it impact valuable content: Number of requests in 7 days: Most visited URLs: Decision: allow / disallow / monitor / firewall Date of next check:
Email template to hosting support:
Hello, please export the access logs for the domain example.com from the last 7 days. I want to check AI bot traffic by User-Agent, particularly: GPTBot, ClaudeBot, PerplexityBot, ChatGPT-User, CCBot, Bytespider, Applebot-Extended, Meta-ExternalAgent, and Diffbot. A .log or .gz file is sufficient.
Implementation Checklist
[ ] Download access logs from the last 7 days.
[ ] Search for 16 bot names from the Audit AI list.
[ ] Separate training bots from search bots.
[ ] Check whether the bot visits the blog, products, cart, account, or filter parameters.
[ ] Count the number of requests per bot.
[ ] Verify HTTP statuses: 200, 301, 403, 404, 429, 500.
[ ] Keep public FAQ and guides available for AI Search bots.
[ ] Block training bots if you don't want to use content in datasets.
[ ] Do not hide private data solely via robots.txt.
[ ] Add blocks for filter parameters if bots have crawled thousands of URLs.
[ ] Do not block FacebookBot if link previews are important.
[ ] Do not search for Google-Extended in logs as a separate User-Agent.
[ ] Check if WAF or Cloudflare isn’t blocking the bots you want to allow.
[ ] Set an alert if any bot makes more than 500 requests daily.
[ ] Revisit decisions after 30 days and compare traffic and logs.
Common Mistakes
User-agent: * Disallow: /
This closes the site not only to AI bots. It can also damage classic indexing, link previews, and diagnostics.
User-agent: GPTBot Disallow: / User-agent: OAI-SearchBot Disallow: /
The first block may make sense. The second could cut visibility in ChatGPT searches. Before adding it, check the purpose.
User-agent: * Disallow: /secret-b2b-offer/ Disallow: /wholesale-price-list-2026.pdf
This shows everyone where sensitive materials are located. Such content should require a login.
Measuring Effects
First signal: the number of bot requests decreases after implementing blocks, but public pages remain accessible.
Second signal: the number of 500 errors or timeouts does not increase during crawling.
Third signal: guides, FAQs, and category pages can still be fetched by AI Search bots.
Fourth signal: you see fewer requests to filters and parameters in the logs.
Fifth signal: customers still arrive from brand queries and AI Search, but private content is not publicly accessible.
FAQ
Does `what is gptbot` mean ChatGPT is currently reading my store?
Does Blocking `ClaudeBot` Remove My Old Content from Models?
Should I Block `PerplexityBot`?
Is robots.txt Sufficient for Data Protection?
Summary
Do not block "AI" as a single category. Block or allow for specific purposes: training, searching, user entry, or link previews. Start with logs, not emotions. If you want to check if your page is readable by AI agents and which signals to improve, run an audit at Audit AI.



