robots.txt 语法与示例详解

robots.txt 文件是搜索引擎爬虫访问你网站时首先检查的文件。正确配置的 robots.txt 可以保护私密内容、节省抓取预算并改善 SEO。本指南涵盖了每个指令、通配符模式和你需要的实际示例。

什么是 robots.txt？

robots.txt 是一个放置在网站根目录（例如 https://example.com/robots.txt）的纯文本文件，用于告诉网络爬虫哪些 URL 允许或禁止访问。它遵循 Robots 排除协议（REP），最初于 1994 年提出，现已成为互联网标准（RFC 9309）。

爬虫如何使用 robots.txt：

爬虫到达你的域名后，在抓取任何其他页面之前先请求 /robots.txt。
如果文件存在，爬虫会解析其特定 User-agent 的规则。
爬虫在决定获取哪些 URL 时遵循匹配的 Disallow 和 Allow 指令。
如果未找到 robots.txt（返回 404），爬虫假定所有内容都允许访问。

重要提示：robots.txt 是建议性的，不具有强制性。行为良好的机器人（Googlebot、Bingbot）会遵守它，但恶意爬虫可能完全忽略它。对于真正的私密内容，请使用身份验证或服务器端访问控制。

语法规则与指令

robots.txt 文件由一个或多个规则组组成。每个组以 User-agent 行开头，后跟指令：

User-agent — User-agent — 指定规则适用于哪个爬虫。使用 * 表示所有爬虫。
Disallow — Disallow — 阻止 URL 路径。空值（Disallow:）表示不阻止任何内容。
Allow — Allow — 明确允许 URL 路径，覆盖更广泛的 Disallow。Googlebot 和大多数现代爬虫支持。
Sitemap — Sitemap — 指向你的 XML 站点地图。可以出现在文件中的任何位置，不属于规则组。
Crawl-delay — Crawl-delay — 请求连续请求之间的延迟（秒）。Bing 和 Yandex 支持，但 Google 不支持。

格式规则：

每行一个指令。
以 # 开头的行是注释。
空行分隔规则组。
路径区分大小写（/Admin 与 /admin 不同）。
文件必须命名为 robots.txt 并放置在域名根目录。

基本语法示例：

# This is a comment
User-agent: *
Disallow: /private/
Disallow: /tmp/
Allow: /private/public-page.html

# Slow down Bingbot
User-agent: Bingbot
Crawl-delay: 10

Sitemap: https://example.com/sitemap.xml

通配符模式

Google、Bing 和大多数主要爬虫支持 robots.txt 路径中的两个通配符：

* — *（星号）— 匹配任意字符序列（包括空字符串）。
$ — $（美元符号）— 将匹配锚定在 URL 末尾。

通配符模式示例：

User-agent: *

# Block all URLs containing "?sort="
Disallow: /*?sort=

# Block all .pdf files
Disallow: /*.pdf$

# Block all URLs with query parameters
Disallow: /*?*

# Block all .json API responses
Disallow: /*.json$

# Block paths containing /temp/ anywhere
Disallow: /*/temp/

# Allow specific .xml files (sitemaps)
Allow: /*.xml$

注意：最初的 robots.txt 规范不包含通配符。它们是主要搜索引擎支持的扩展。始终测试你的模式以确保它们匹配你的预期。

常见 robots.txt 示例

阻止所有爬虫访问整个网站：

User-agent: *
Disallow: /

允许所有爬虫（显式声明）：

User-agent: *
Disallow:

阻止特定爬虫：

User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

阻止特定目录：

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /staging/

阻止所有但允许特定路径：

User-agent: *
Disallow: /api/
Allow: /api/public/

多个规则组：

# Default rules for all crawlers
User-agent: *
Disallow: /admin/
Disallow: /private/

# Googlebot gets special access
User-agent: Googlebot
Disallow: /admin/
Allow: /private/google-partner/

# Block aggressive SEO bots entirely
User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

Sitemap: https://example.com/sitemap.xml

WordPress robots.txt

WordPress 网站有特定的目录和文件通常应该阻止爬虫访问，以节省抓取预算并防止索引管理/工具页面：

User-agent: *
# Block WordPress admin and login
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

# Block WordPress includes
Disallow: /wp-includes/

# Block XML-RPC (security best practice)
Disallow: /xmlrpc.php

# Block trackbacks and pingbacks
Disallow: /trackback/
Disallow: /*/trackback/

# Block comment feeds
Disallow: /*/feed/
Disallow: /*/comments/

# Block search results pages
Disallow: /?s=
Disallow: /search/

# Block author archives (optional)
Disallow: /author/

# Block tag pages with thin content (optional)
Disallow: /tag/

# Allow all media uploads
Allow: /wp-content/uploads/

Sitemap: https://example.com/sitemap_index.xml

提示：不要阻止 /wp-content/uploads/ — 那是你的媒体文件所在的位置，你希望它们被索引。另外，永远不要阻止你的 CSS/JS 文件，因为 Google 需要它们来渲染页面。

Next.js / React SPA robots.txt

现代 JavaScript 框架会提供不应被索引的静态资源和 API 路由。以下是 Next.js 应用程序推荐的 robots.txt：

User-agent: *
# Block Next.js internal routes
Disallow: /_next/
Disallow: /api/

# Block internal utility pages
Disallow: /404
Disallow: /500

# Block query parameter variations
Disallow: /*?*

# Allow static assets that search engines need
Allow: /_next/static/
Allow: /_next/image/

# Allow public API endpoints (if any)
Allow: /api/public/

Sitemap: https://example.com/sitemap.xml

在 Next.js 13+ 中，你可以通过创建 app/robots.ts 文件并导出 metadata 函数来程序化生成 robots.txt。对于静态网站，将 robots.txt 放在 public/ 目录中。

电子商务 robots.txt

电子商务网站需要仔细配置 robots.txt，以防止爬虫抓取用户账户、结账流程和产生重复内容的分面导航：

User-agent: *
# Block user account pages
Disallow: /account/
Disallow: /my-account/
Disallow: /login/
Disallow: /register/
Disallow: /password-reset/

# Block checkout and cart
Disallow: /cart/
Disallow: /checkout/
Disallow: /order-confirmation/

# Block wishlist and compare
Disallow: /wishlist/
Disallow: /compare/

# Block internal search results
Disallow: /search/
Disallow: /*?q=
Disallow: /*?search=

# Block faceted navigation (duplicate content)
Disallow: /*?sort=
Disallow: /*?order=
Disallow: /*?filter=
Disallow: /*?color=
Disallow: /*?size=
Disallow: /*?price=
Disallow: /*?page=

# Block review/rating sort variations
Disallow: /*?rating=

# Allow product and category pages
Allow: /products/
Allow: /category/
Allow: /collections/

Sitemap: https://example.com/sitemap.xml

警告：不要阻止产品页面或分类页面 — 那些是你的核心页面。只阻止工具路径、用户特定页面和排序/筛选参数等重复内容生成器。

阻止 AI 爬虫

随着大型语言模型的兴起，许多网站所有者希望阻止 AI 训练爬虫抓取其内容。以下是已知的 AI 爬虫用户代理及其阻止方法：

已知的 AI 爬虫用户代理：

GPTBot — GPTBot — OpenAI 的训练数据爬虫
ChatGPT-User — ChatGPT-User — OpenAI 的 ChatGPT 浏览功能爬虫
Google-Extended — Google-Extended — Google 的 AI 训练爬虫（Gemini/Bard）
anthropic-ai — anthropic-ai — Anthropic 的 Claude 训练数据爬虫
ClaudeBot — ClaudeBot — Anthropic 的网络爬虫
CCBot — CCBot — Common Crawl 机器人（被许多 AI 公司使用）
Bytespider — Bytespider — 字节跳动的爬虫
Amazonbot — Amazonbot — 亚马逊的爬虫

# Block all known AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Amazonbot
Disallow: /

# Still allow regular search engine crawlers
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

注意：在 robots.txt 中阻止 AI 爬虫不会追溯删除已收集的内容。它只能阻止未来的抓取。此外，新的 AI 爬虫会定期出现，因此请定期检查你的 robots.txt。

Sitemap 指令

Sitemap 指令告诉爬虫在哪里找到你的 XML 站点地图。这特别有用，因为它不需要特定的 User-agent 块 — 它全局适用：

# Single sitemap
Sitemap: https://example.com/sitemap.xml

# Multiple sitemaps
Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-products.xml

# Sitemap index file
Sitemap: https://example.com/sitemap_index.xml

Sitemap 指令规则：

必须使用完整的绝对 URL（包括 https://）。
可以列出多个 Sitemap 指令对应多个站点地图。
可以放置在文件中的任何位置（不绑定到 User-agent 组）。
支持引用其他站点地图的站点地图索引文件。

好处：即使 Google 通过 Search Console 发现了你的站点地图，在 robots.txt 中包含它可以确保所有爬虫都能自动找到它。

测试与验证

在部署到生产环境之前，务必测试你的 robots.txt。一个小的拼写错误可能会意外阻止搜索引擎抓取你的整个网站。

Google Search Console — Google Search Console — "robots.txt 测试工具"可以让你针对规则测试特定 URL。
curl — curl 命令 — 快速检查你的 robots.txt 是否可访问：
Online — 在线验证器 — 如 Google 的 robots.txt 测试工具或 Merkle 的 robots.txt 分析器。

# Check if robots.txt is accessible
curl -I https://example.com/robots.txt

# View the full content
curl https://example.com/robots.txt

# Check what Googlebot sees (simulate Googlebot)
curl -A "Googlebot" https://example.com/robots.txt

# Test a specific URL against robots.txt (Python)
pip install robotexclusionrulesparser
python -c "
import robotexclusionrulesparser as rerp
rp = rerp.RobotExclusionRulesParser()
rp.fetch('https://example.com/robots.txt')
print(rp.is_allowed('Googlebot', '/private/page'))
"

要避免的常见错误：

将 robots.txt 放在子目录中而不是域名根目录。
在 Sitemap 指令中使用相对 URL（必须是绝对 URL）。
阻止搜索引擎渲染所需的 CSS/JS 文件。
忘记 Disallow: / 会阻止所有内容，包括你的主页。
更改后不测试 — 务必使用 Google Search Console 验证。
使用不正确的换行符或 BOM 字符（使用不带 BOM 的 UTF-8）。

robots.txt vs meta robots vs X-Robots-Tag

有三种主要方式来控制爬虫行为。每种方式服务于不同的目的：

方法	范围	位置	阻止抓取	阻止索引	粒度
robots.txt	整个路径/目录	域名根目录 /robots.txt	是	否（URL 仍可能出现在搜索结果中）	URL 路径级别
meta robots	单个页面	HTML <head> 标签	否（必须抓取页面才能看到标签）	是（noindex）	每页
X-Robots-Tag	任何资源（PDF、图片等）	HTTP 响应头	否（必须获取资源才能看到头部）	是（noindex）	每个资源

提示：要真正阻止页面出现在搜索结果中，请使用 meta robots noindex 或 X-Robots-Tag noindex。仅靠 robots.txt 不能阻止索引 — 即使 Google 无法抓取页面，它仍然可以通过链接发现并索引 URL。

立即生成你的 robots.txt 文件

Robots.txt 生成器 →Meta 标签生成器 →

FAQ

robots.txt 能阻止页面出现在 Google 搜索结果中吗？

不能。robots.txt 阻止抓取但不阻止索引。如果其他网站链接到你被阻止的页面，Google 仍可能在搜索结果中显示该 URL（没有摘要）。要阻止索引，请改用 meta robots noindex 标签或 X-Robots-Tag: noindex HTTP 头。

我应该把 robots.txt 文件放在哪里？

robots.txt 文件必须放在域名根目录：https://example.com/robots.txt。它在子目录中不起作用。对于子域名（例如 blog.example.com），你需要在子域名根目录单独放置一个 robots.txt。

我可以使用 robots.txt 阻止 GPTBot 和 ClaudeBot 等 AI 爬虫吗？

可以。添加 User-agent: GPTBot 后跟 Disallow: / 来阻止 OpenAI 的爬虫。类似地，使用 User-agent: ClaudeBot 配合 Disallow: / 来阻止 Anthropic 的爬虫。但这只能阻止未来的抓取，不能删除之前已收集的数据。

如果我的网站没有 robots.txt 文件会怎样？

如果爬虫对 /robots.txt 收到 404 响应，它会假定没有任何限制，将抓取你网站上所有可访问的页面。这是 Robots 排除协议中定义的默认行为。

robots.txt 中 Disallow: 和 Disallow: / 有什么区别？

Disallow:（空值）表示不禁止任何内容 — 爬虫可以访问所有内容。Disallow: /（带斜杠）表示整个网站被阻止。这一个字符的差异至关重要，是常见的配置错误来源。