AEO Glossary

    What Is robots.txt?

    Updated May 19, 20263 min read

    robots.txt is a plain-text file at the root of a site that tells crawlers which paths they may fetch.

    robots.txt is a plain-text file served from the root of a domain (for example, https://example.com/robots.txt) that tells web crawlers which paths they are allowed or disallowed to fetch. It is the oldest and most universally respected piece of crawler-control infrastructure on the web — and in the AI era, it is the primary lever for managing both training and retrieval crawlers.

    How robots.txt works

    The file uses a simple directive grammar. Each block starts with a User-agent declaration identifying the crawler and is followed by Allow or Disallow rules:

    User-agent: Googlebot
    Disallow: /private/
    
    User-agent: GPTBot
    Disallow: /
    
    User-agent: *
    Allow: /
    

    User-agent: * applies to every crawler that does not have a specific block. More specific blocks override the wildcard for that crawler.

    Common directives

    DirectivePurpose
    User-agentWhich crawler the following rules apply to
    DisallowPath pattern the crawler must not fetch
    AllowPath pattern explicitly allowed (overrides a parent Disallow)
    SitemapURL of an XML sitemap (multiple allowed)
    Crawl-delaySeconds between requests (honoured by some crawlers)

    What robots.txt does NOT do

    • It does not enforce. Compliance is voluntary. Reputable crawlers obey; malicious scrapers ignore it.
    • It does not prevent indexing. A blocked URL can still appear in search results if other sites link to it. Use noindex meta tags or HTTP headers to prevent indexing.
    • It is not a security control. Anyone can read your robots.txt and learn the paths you tried to hide. Use authentication for actual privacy.

    robots.txt for AI crawlers

    Modern AI vendors run separate user agents for training and retrieval. The most useful AI block in 2026 looks something like:

    # Block training crawlers
    User-agent: GPTBot
    Disallow: /
    
    User-agent: Google-Extended
    Disallow: /
    
    User-agent: anthropic-ai
    Disallow: /
    
    User-agent: CCBot
    Disallow: /
    
    User-agent: Bytespider
    Disallow: /
    
    # Allow retrieval crawlers (so you can be cited)
    User-agent: OAI-SearchBot
    Allow: /
    
    User-agent: PerplexityBot
    Allow: /
    
    User-agent: ClaudeBot
    Allow: /
    

    Blocking all AI crawlers indiscriminately is the riskiest configuration: it removes you from being cited without changing whether your content is used for training, because older crawls and third-party datasets persist.

    robots.txt vs. llms.txt vs. meta robots

    File / tagPurpose
    robots.txtControls whether a crawler can fetch a path
    llms.txtCurates a model's view of your most important content
    <meta name="robots">Controls whether an indexed page is shown in results

    Common mistakes

    • Blocking entire site by accident — Disallow: / under a wildcard is a frequent outage cause
    • Trying to hide sensitive URLs with robots.txt rather than authentication
    • Using Disallow to remove pages from the index — use noindex instead
    • Forgetting to update robots.txt when launching a new section or retiring an old one

    Frequently asked questions

    Where does robots.txt live?

    At the root of the host: https://example.com/robots.txt. Subdirectories or subdomains need their own file.

    Do I need a robots.txt at all?

    If you are happy for everything to be crawled, you can omit it — but it is good practice to publish one even if it just declares your sitemap.

    Can I block a single bot but allow the rest?

    Yes. Add a specific block for that user agent with Disallow: /; the wildcard block remains for everyone else.

    Related Terms

    Measure what AI says about you

    AI is answering questions about your brand right now.

    See what it's saying, and start shaping the answer.

    Start 7-day free trial

    7-day free trial · Go live in under 5 minutes.