What Is robots.txt?

Q: What robots.txt does NOT do

It does not enforce. Compliance is voluntary. Reputable crawlers obey; malicious scrapers ignore it. It does not prevent indexing. A blocked URL can still appear in search results if other sites link to it. Use noindex meta tags or HTTP headers to prevent indexing. It is not a security control. Anyone can read your robots.txt and learn the paths you tried to hide. Use authentication for actual privacy.

robots.txt is a plain-text file served from the root of a domain (for example, https://example.com/robots.txt) that tells web crawlers which paths they are allowed or disallowed to fetch. It is the oldest and most universally respected piece of crawler-control infrastructure on the web — and in the AI era, it is the primary lever for managing both training and retrieval crawlers.

How robots.txt works

The file uses a simple directive grammar. Each block starts with a User-agent declaration identifying the crawler and is followed by Allow or Disallow rules:

User-agent: Googlebot
Disallow: /private/

User-agent: GPTBot
Disallow: /

User-agent: *
Allow: /

User-agent: * applies to every crawler that does not have a specific block. More specific blocks override the wildcard for that crawler.

Common directives

Directive	Purpose
`User-agent`	Which crawler the following rules apply to
`Disallow`	Path pattern the crawler must not fetch
`Allow`	Path pattern explicitly allowed (overrides a parent Disallow)
`Sitemap`	URL of an XML sitemap (multiple allowed)
`Crawl-delay`	Seconds between requests (honoured by some crawlers)

What robots.txt does NOT do

It does not enforce. Compliance is voluntary. Reputable crawlers obey; malicious scrapers ignore it.
It does not prevent indexing. A blocked URL can still appear in search results if other sites link to it. Use noindex meta tags or HTTP headers to prevent indexing.
It is not a security control. Anyone can read your robots.txt and learn the paths you tried to hide. Use authentication for actual privacy.

robots.txt for AI crawlers

Modern AI vendors run separate user agents for training and retrieval. The most useful AI block in 2026 looks something like:

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# Allow retrieval crawlers (so you can be cited)
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

Blocking all AI crawlers indiscriminately is the riskiest configuration: it removes you from being cited without changing whether your content is used for training, because older crawls and third-party datasets persist.

robots.txt vs. llms.txt vs. meta robots

File / tag	Purpose
`robots.txt`	Controls whether a crawler can fetch a path
`llms.txt`	Curates a model's view of your most important content
`<meta name="robots">`	Controls whether an indexed page is shown in results

Common mistakes

Blocking entire site by accident — Disallow: / under a wildcard is a frequent outage cause
Trying to hide sensitive URLs with robots.txt rather than authentication
Using Disallow to remove pages from the index — use noindex instead
Forgetting to update robots.txt when launching a new section or retiring an old one

Frequently asked questions

Where does robots.txt live?

At the root of the host: https://example.com/robots.txt. Subdirectories or subdomains need their own file.

Do I need a robots.txt at all?

If you are happy for everything to be crawled, you can omit it — but it is good practice to publish one even if it just declares your sitemap.

Can I block a single bot but allow the rest?

Yes. Add a specific block for that user agent with Disallow: /; the wildcard block remains for everyone else.

How robots.txt works

Common directives

What robots.txt does NOT do

robots.txt for AI crawlers

robots.txt vs. llms.txt vs. meta robots

Common mistakes

Frequently asked questions

Where does robots.txt live?

Do I need a robots.txt at all?

Can I block a single bot but allow the rest?

Related Terms

What Are AI Crawlers?

Answer Engine Optimization (AEO)

What Is Source Attribution?

What Is robots.txt?

llms.txt

AI is answering questions about your brand right now.