Between the environmental impact of AI, its threat to jobs and art, and overall it just not being all that good, many are not only skeptical of the tech industry’s latest fad, some actively search for ways to opt out.
Do you block AI crawlers from your personal website, either through robots.txt or by other means?
I wanted to know how many of the big sites are blocking AI crawlers, so I downloaded 734 robots.txt files from sites listed on Wikipedia’s List of most-visited websites and checked for some common AI crawlers, compiled from various lists online with links to darkvisitors.com which has more information on each.
- anthropic-ai
- Applebot-Extended
- Bytespider
- CCBot
- ChatGPT-User
- ClaudeBot
- cohere-ai
- Diffbot
- FacebookBot
- Google-Extended
- GPTBot
- ImagesiftBot
- Omgili
- Omgilibot
- PerplexityBot
And here are the results.
- 9 out of the 735 analyzed sites allow some AI crawlers on some of its content
- 227 sites block some of the AI crawlers from at least some of its content
- and 605 sites block all crawlers from accessing some of its content, which would also include AI crawlers
All in all, most sites I looked don’t care to have their content used to train AI. The online translation dictionary WordReference even has a little message.
# now disallowing all other bots, No AI allowed, for permission, write us: https://forum.wordreference.com/misc/contact
User-agent: Google-Extended
User-agent: *
Allow: /ads.txt
Allow: /app-ads.txt
Disallow: /
Interestingly, some sites actively encourage AI crawlers.
- The news website Mashable blocks GPTBot on all of its content, but has some very specific rules for Google-Extended, allowing it to crawl an archive of their most recent articles, but not beyond page 9.
User-agent: Google-Extended
Disallow: /
Disallow: /*?page=[0-9][0-9]
Disallow: /*?page=[0-9][0-9][0-9]
Disallow: /*?page=[0-9][0-9][0-9][0-9]
Allow: /*?page=[0-9]
- NBA.com also has some very specific rules for GPTBot
User-agent: GPTBot
Disallow: /
Allow: /standings
Allow: /schedule
Allow: /stats/help/glossary
Allow: /stats/draft/history
Allow: /stats/help/statminimums
Allow: /stats/history
Allow: /players
Allow: /player/*profile$
Allow: /team/*
Disallow: /team/*/schedule$
- The travel website Expedia is all in on ChatGTP
user-agent: ChatGPT-User
allow: /
- and their rival Kayak is not too far behind
User-agent: ChatGPT-User
Allow: /flights/
Allow: /hotels/
Allow: /cars/
Allow: /explore/
Allow: /sherlock/
Disallow: /api/
Disallow: /a/
Disallow: /i/
Disallow: /carreservation
Disallow: /hotelreservation
Disallow: /flightreservation
Disallow: /mscarreservation
Disallow: /SNflightreservation
Disallow: /msflightreservation
Disallow: /mshotelreservation
Disallow: /FDcarreservation
Disallow: /FDflightreservation
Disallow: /FDhotelreservation
Disallow: /in
Disallow: /h/
Disallow: /s/
Disallow: /k/
Disallow: /r/