CCBot — Common Crawl's Web Crawler

CCBot powers Common Crawl's open web dataset, used as training data by many AI models. Learn how blocking it works and its limitations.

QUICK FACTS

USER-AGENT CCBot
OPERATOR Common Crawl
CATEGORY Open Dataset
FIRST SEEN 2013
ROBOTS.TXT ✓ Respects directives
DOCUMENTATION Official docs →

What is CCBot?

CCBot is the crawler for Common Crawl, a non-profit that maintains a massive open dataset of web pages. Many AI companies — including those building smaller or open-source models — use Common Crawl snapshots as training data. Blocking CCBot stops future snapshots from including your site, but older snapshots are already publicly available and used by downstream models.

How to Block CCBot

Add the following to your robots.txt file (located at the root of your website):

User-agent: CCBot
Disallow: /

What Happens When You Block CCBot

Future Common Crawl snapshots will not include your content. Past snapshots are already public and cannot be retracted.

Should You Block CCBot?

CCBot builds an open dataset that multiple downstream AI companies use. Blocking it prevents future dataset snapshots from including your content, but past snapshots are already public. This is a broad opt-out that affects many downstream models at once.

CCBot vs Other Common Crawl Crawlers

Common Crawl currently operates CCBot as a standalone crawler. Unlike companies like OpenAI and Anthropic that split functionality across multiple user-agents, Common Crawl uses a single identifier for its AI crawling operations.

GENERATE YOUR ROBOTS.TXT

Use our visual generator to create a robots.txt file that blocks CCBot and any other crawlers you want to opt out of.