Diffbot — Knowledge Graph Extraction Crawler

Diffbot extracts structured data for its knowledge graph, licensed by AI companies. Learn how to block it and what that means for downstream models.

QUICK FACTS

USER-AGENT Diffbot
OPERATOR Diffbot
CATEGORY Open Dataset
FIRST SEEN 2015
ROBOTS.TXT ✓ Respects directives
DOCUMENTATION Official docs →

What is Diffbot?

Diffbot builds a structured knowledge graph of the web by extracting entities, facts, and relationships from web pages. This knowledge graph is licensed to AI companies, search engines, and enterprise clients. Diffbot's data has been used as training input by multiple large AI model developers.

How to Block Diffbot

Add the following to your robots.txt file (located at the root of your website):

User-agent: Diffbot
Disallow: /

What Happens When You Block Diffbot

Diffbot will not extract data from your pages for its knowledge graph. Downstream AI models that license Diffbot data will not include your content in future builds.

Should You Block Diffbot?

Diffbot builds an open dataset that multiple downstream AI companies use. Blocking it prevents future dataset snapshots from including your content, but past snapshots are already public. This is a broad opt-out that affects many downstream models at once.

Diffbot vs Other Diffbot Crawlers

Diffbot currently operates Diffbot as a standalone crawler. Unlike companies like OpenAI and Anthropic that split functionality across multiple user-agents, Diffbot uses a single identifier for its AI crawling operations.

GENERATE YOUR ROBOTS.TXT

Use our visual generator to create a robots.txt file that blocks Diffbot and any other crawlers you want to opt out of.