What Is robots.txt?
robots.txt is a plain-text file at the root of a site that tells crawlers which paths they may fetch.
robots.txt is a plain-text file served from the root of a domain (for example, https://example.com/robots.txt) that tells web crawlers which paths they are allowed or disallowed to fetch. It is the oldest and most universally respected piece of crawler-control infrastructure on the web — and in the AI era, it is the primary lever for managing both training and retrieval crawlers.
How robots.txt works
The file uses a simple directive grammar. Each block starts with a User-agent declaration identifying the crawler and is followed by Allow or Disallow rules:
User-agent: Googlebot
Disallow: /private/
User-agent: GPTBot
Disallow: /
User-agent: *
Allow: /
User-agent: * applies to every crawler that does not have a specific block. More specific blocks override the wildcard for that crawler.
Common directives
| Directive | Purpose |
|---|---|
User-agent | Which crawler the following rules apply to |
Disallow | Path pattern the crawler must not fetch |
Allow | Path pattern explicitly allowed (overrides a parent Disallow) |
Sitemap | URL of an XML sitemap (multiple allowed) |
Crawl-delay | Seconds between requests (honoured by some crawlers) |
What robots.txt does NOT do
- It does not enforce. Compliance is voluntary. Reputable crawlers obey; malicious scrapers ignore it.
- It does not prevent indexing. A blocked URL can still appear in search results if other sites link to it. Use
noindexmeta tags or HTTP headers to prevent indexing. - It is not a security control. Anyone can read your robots.txt and learn the paths you tried to hide. Use authentication for actual privacy.
robots.txt for AI crawlers
Modern AI vendors run separate user agents for training and retrieval. The most useful AI block in 2026 looks something like:
# Block training crawlers
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
# Allow retrieval crawlers (so you can be cited)
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: ClaudeBot
Allow: /
Blocking all AI crawlers indiscriminately is the riskiest configuration: it removes you from being cited without changing whether your content is used for training, because older crawls and third-party datasets persist.
robots.txt vs. llms.txt vs. meta robots
| File / tag | Purpose |
|---|---|
robots.txt | Controls whether a crawler can fetch a path |
llms.txt | Curates a model's view of your most important content |
<meta name="robots"> | Controls whether an indexed page is shown in results |
Common mistakes
- Blocking entire site by accident —
Disallow: /under a wildcard is a frequent outage cause - Trying to hide sensitive URLs with robots.txt rather than authentication
- Using
Disallowto remove pages from the index — usenoindexinstead - Forgetting to update robots.txt when launching a new section or retiring an old one
Frequently asked questions
Where does robots.txt live?
At the root of the host: https://example.com/robots.txt. Subdirectories or subdomains need their own file.
Do I need a robots.txt at all?
If you are happy for everything to be crawled, you can omit it — but it is good practice to publish one even if it just declares your sitemap.
Can I block a single bot but allow the rest?
Yes. Add a specific block for that user agent with Disallow: /; the wildcard block remains for everyone else.
Related Terms
llms.txt
llms.txt is a proposed plain-text file at the root of a site. It gives large language models a curated, machine-readable map of the pages that matter most.
What Are AI Crawlers?
AI crawlers are bots run by AI vendors. Some fetch pages to train models, others fetch pages live to power answers inside AI search products.
What Is robots.txt?
robots.txt is a plain-text file at the root of a site that tells crawlers which paths they may fetch.
Answer Engine Optimization (AEO)
Answer Engine Optimization is the work of becoming the cited source inside AI answers from ChatGPT, Gemini, Claude, and Perplexity, not just a blue link on Google.
What Is Source Attribution?
Source attribution is the practice of an AI system naming and linking the sources it used to generate an answer.
