Data & Studies

The AI Bots That ~140 Million Websites Block the Most

Patrick Stox
Patrick Stox is a Product Advisor, Technical SEO, & Brand Ambassador at Ahrefs. He was the lead author for the SEO chapter of the 2021 Web Almanac and a reviewer for the 2022 SEO chapter. He also co-wrote the SEO Book For Beginners by Ahrefs and was the Technical Review Editor for The Art of SEO 4th Edition. He’s an organizer for several groups including the Raleigh SEO Meetup (the most successful SEO Meetup in the US), the Beer and SEO Meetup, the Raleigh SEO Conference, Tech SEO Connect, runs a Technical SEO Slack group, and is a moderator for /r/TechSEO on Reddit.
AI bots power some of the most advanced technologies we use today, from search engines to AI assistants. However, their increasing presence has led to a growing number of websites blocking them.

There’s a cost to bots crawling your websites and there’s a social contract between search engines and website owners, where search engines add value by sending referral traffic to websites. This is what keeps most websites from blocking search engines like Google, even as Google seems intent on taking more of that traffic for themselves.

When we looked at the traffic makeup of ~35K websites in Ahrefs Analytics, we found that AI sends just 0.1% of total referral traffic—far behind that of search.

Ahrefs AI traffic research. Bar chart showing Traffic by Channel, with Search at 43.8%, Direct at 42.3%, Social at 13.1%, Paid at 0.5%, Email at 0.2%, and LLM at 0.1%

I think many site owners want to let these bots learn about their brand, their business, and their products and offerings. But while many people are betting that these systems are the future, they currently run the risk of not adding enough value for website owners. 

The first LLM to add more value to users by showing impressions and clicks to website owners will likely have a big advantage. Companies will report on the metrics from that LLM, which will likely increase adoption and prevent more websites from blocking their bot.

The bots are using resources, using the data to train their AIs, and creating potential privacy issues. As a result, many websites are choosing to block AI bots.

We looked at ~140 million websites and our data shows that block rates for AI bots have increased significantly over the past year. I want to give a huge thanks to our data scientist Xibeijia Guan for pulling this data.

  • The number of AI bots has doubled since August 2023, with 21 major AI bots now active on the web.
  • GPTBot (OpenAI) is the most blocked AI bot, with 5.89% of all websites blocking them.
  • ClaudeBot (Anthropic) saw the highest growth in block rates, increasing by 32.67% over the past year.

The most blocked bots are also the most popular ones. It’s likely that lesser-known bots are less blocked because they are less well-known and less active.

We looked at the total number of websites blocking the bots. There are many ways to block bots with robots.txt, and this accounts for all of them including:

  • Explicit blocks, where the bot is mentioned and disallowed
  • General blocks, where all bots may be blocked
  • Any instances where a directive allowed the bot, after blocking all bots

Caveats: this doesn’t include any other block types such as firewalls or IP blocks.

As I mentioned earlier, the most blocked bot is GPTBot. It’s the most active AI bot according to Cloudflare Radar.

Bots that crawl the most according to Cloudflare Radar

There is a moderate positive correlation between the request rate and the block rate for these bots. Bots that make more requests tend to be blocked more often. The nerdy numbers are 0.512 Pearson correlation coefficient, p-value of 0.0149, and this is statistically significant at the 5% level.

Bots that crawl more are typically blocked more

Here’s the data for the overall blocks:

Block rate of AI bots

Here is the total number of websites blocking AI bots:

Total websites blocking AI bots

Here’s the data:

Bot NameCountPercentage %Bot Operator
GPTBot82459875.89OpenAI
CCBot81886565.85Common Crawl
Amazonbot80826365.78Amazon
Bytespider80249805.74ByteDance
ClaudeBot80230555.74Anthropic
Google-Extended79893445.71Google
anthropic-ai79637405.69Anthropic
FacebookBot79318125.67Meta
omgili79114715.66Webz.io
Claude-Web79099535.65Anthropic
cohere-ai78944175.64Cohere
ChatGPT-User78909735.64OpenAI
Applebot-Extended78881055.64Apple
Meta-ExternalAgent78866365.64Meta
Diffbot78553295.62Diffbot
PerplexityBot78449775.61Perplexity
Timpibot78186965.59Timpi
Applebot77680555.55Apple
OAI-SearchBot77534265.54OpenAI
Webzio-Extended77450145.54Webz.io
Meta-ExternalFetcher77442515.54Meta
Kangaroo Bot77397075.53Kangaroo LLM

It gets a little more complicated. For the above, we looked at the main robots.txt file for a website, but every subdomain can have its own set of instructions. If we look at the ~461M robots.txt in total, then the total block % for GPTBot goes up to 7.3%.

AI bot blocks over time

More top-trafficked sites began blocking AI bots in 2024, but the trend is decreasing towards the end of the year. It looks like the decrease mostly comes from generic blocks. The trend for AI bots themselves is increasing and I’ll show you that in a minute.

AI bot block rate over time by traffic

Do certain types of sites block AI bots more?

Here’s how it breaks down for each individual bot in different categories of websites. I was actually expecting news to be more blocked than other categories because there were a lot of stories about news sites blocking these bots, but arts & entertainment (45% blocked) and law & government (42% blocked) sites blocked them more.

AI block rate over time by domain category

The decision to block AI bots varies by industry. There can be a number of unique reasons for this. These are somewhat speculative:

  • Arts and Entertainment: ethical aversions, reluctance to become training data.
  • Books and Literature: copyright.
  • Law and Government: legal worries, compliance.
  • News and Media: prevent their articles from being used to train AI models that could compete with their journalism and take away from their revenue.
  • Shopping: prevent price scraping or inventory monitoring by competitors.
  • Sports: similar to news and media on the revenue fears.

For this measure, we’re looking only at cases where a particular bot is disallowed. It does not include any overall disallow statements or cases where only certain bots may be allowed. In these cases, website owners went out of their way to specifically block certain bots.

Again, GPTBot is the most targeted, followed closely by Common Crawl’s bot. Common Crawl data is likely used as a data source for most LLMs.

Here are the most blocked AI bots with websites specifically targeting them:

Explicit blocks of AI bots

Here’s the data for the number of websites blocking them:

Total number of sites explicitly blocking AI bots

Here’s the data:

Bot NameCountPercentage %Bot Operator
GPTBot6936390.5OpenAI
CCBot6828610.49Common Crawl
Amazonbot4690860.34Amazon
Bytespider4617060.33ByteDance
Google-Extended4158210.3Google
ClaudeBot3935110.28Anthropic
anthropic-ai3831760.27Anthropic
FacebookBot3618030.26Meta
omgili3225020.23Webz.io
ChatGPT-User3104300.22OpenAI
cohere-ai3063850.22Cohere
Claude-Web2764110.2Anthropic
Applebot-Extended2584510.18Apple
Meta-ExternalAgent2451760.18Meta
PerplexityBot2144880.15Perplexity
Diffbot2138280.15Diffbot
Timpibot1744340.12Timpi
Applebot1631480.12Apple
OAI-SearchBot1103760.08OpenAI
Webzio-Extended1005720.07Webz.io
Meta-ExternalFetcher999930.07Meta
Kangaroo Bot950560.07Kangaroo LLM

Explicit blocks of AI bots over time

As you can see, AI bots are starting to be blocked by a lot more of the most trafficked websites.

Explicit blocks of AI bots on the top 1 million websites by traffic

The number of AI bots more than doubled in just over a year, from 10 in August 2023 to 21 in December 2024. More new entrants into the market mean more bots all using resources to crawl websites.

Claudebot had the fastest growth of any crawler in the last year.

total blocks of AI bots on the top 1 million websites by traffic

Here’s the data:

Bot nameGrowth %Absolute growth
claudebot32.67%0.85
anthropic-ai25.14%0.67
claude-web20.66%0.54
bytespider19.57%0.54
chatgpt-user15.52%0.47
perplexitybot15.37%0.4
gptbot13.38%0.53
cohere-ai12.45%0.32
facebookbot11.71%0.32
ccbot11.41%0.44
amazonbot10.22%0.3
google-extended10.07%0.3
diffbot8.98%0.23
omgili8.96%0.25
applebot-extended7.11%0.18
meta-externalagent5.90%0.15
oai-searchbot2.17%0.06
timpibot0.01%0
webzio-extended-1.69%-0.04
applebot-3.32%-0.09
meta-externalfetcher-4.32%-0.11
Kangaroo bot-5.89%-0.15

Final thoughts

It will be interesting to see how the block rate evolves as more and more of these crawlers start to use an ever-increasing amount of resources. Will they be able to fulfill that social contract with website owners and send them more traffic, or will they choose to keep that traffic for themselves?

I think if they go for the walled garden approach, more sites will end up blocking the bots and these systems will have to pay websites for access to their data, or the bots may end up breaking web standards and ignoring robots.txt blocks. There have been a few reports of some AI bots ignoring robots.txt blocks already, which sets a dangerous precedent.

What’s your take? Are you blocking them on your site, or do you see value in allowing them access? Let me know on X or LinkedIn.