What Is Googlebot & How Does It Work?

What Is Googlebot & How Does It Work?

Patrick Stox
Patrick Stox is a Product Advisor, Technical SEO, & Brand Ambassador at Ahrefs. He was the lead author for the SEO chapter of the 2021 Web Almanac and a reviewer for the 2022 SEO chapter. He also co-wrote the SEO Book For Beginners by Ahrefs and was the Technical Review Editor for The Art of SEO 4th Edition. He’s an organizer for several groups including the Raleigh SEO Meetup (the most successful SEO Meetup in the US), the Beer and SEO Meetup, the Raleigh SEO Conference, runs a Technical SEO Slack group, and is a moderator for /r/TechSEO on Reddit.
Googlebot is the web crawler used by Google to gather the information needed and build a searchable index of the web. Googlebot has mobile and desktop crawlers, as well as specialized crawlers for news, images, and videos.

There are more crawlers Google uses for specific tasks, and each crawler will identify itself with a different string of text called a “user agent.” Googlebot is evergreen, meaning it sees websites as users would in the latest Chrome browser.

Googlebot runs on thousands of machines. They determine how fast and what to crawl on websites. But they will slow down their crawling so as to not overwhelm websites. 

Googlebot is the fastest crawler on the web according to Cloudflare Radar, with Ahrefsbot being the 2nd fastest.

Googlebot is the fastest crawler on the web

If we look at that by the percentage of HTTP requests, Googlebot is 23.7% of the overall requests from good bots. Ahrefsbot is 14.27% and just for comparison, Bingbot is 4.57% and Semrushbot is 0.6%.

percent of http requests for Googlebot from Cloudflare Radar

Let’s look at their process for building an index of the web.

Google has shared a few versions of its pipeline in the past. The below is the most recent.

Flowchart showing how Google builds its search index

It processes this again and looks for any changes to the page or new links. The content of mobile version of the rendered pages is what is stored and searchable in Google’s index. Any new links found go back to the bucket of URLs for it to crawl.

We have more details on this process in our article on how search engines work or if you’re interested in the rendering aspects, check out our article on JavaScript SEO.

Google gives you a few ways to control what gets crawled and indexed.

Ways to control crawling

  • Robots.txt – This file on your website allows you to control what is crawled.
  • Nofollow – Nofollow is a link attribute or meta robots tag that suggests a link should not be followed. It is only considered a hint, so it may be ignored.
  • Change your crawl rate – This tool within Google Search Console allows you to slow down Google’s crawling. Some people believe you can use the message system there to ask Google to increase your crawl rate, but Google has said that doesn’t work.

Ways to control indexing

  • Delete your content – If you delete a page, then there’s nothing to index. The downside to this is no one else can access it either.
  • Restrict access to the content – Google doesn’t log in to websites, so any kind of password protection or authentication will prevent it from seeing the content.
  • Noindex – A noindex in the meta robots tag tells search engines not to index your page.
  • URL removal tool – The name for this tool from Google is slightly misleading, as the way it works is it will temporarily hide the content. Google will still see and crawl this content, but the pages won’t appear in search results.
  • Robots.txt (Images only) – Blocking Googlebot Image from crawling means that your images will not be indexed.

If you’re not sure which indexing control you should use, check out our flowchart in our post on removing URLs from Google search.

If you want more details about how Googlebot determines what to crawl and the speed of crawling, check out our post on crawl budget.

Here are a few details about Googlebot that can help you with troubleshooting various issues.

Location

Googlebot mostly crawls from Mountain View, CA, on the Pacific coast of the United States. They do have some locale specific crawling options they may use in situations such as websites blocking crawling from the US.

Max file size

For most file types, Google will grab the first 15 MB of each file. However, for robots.txt files the max file size is 500 kibibytes (KiB).

Supported Transfer Protocols

Googlebot supports HTTP/1.1 and HTTP/2 and will choose whichever gives the best crawling performance for your site.

They can also crawl over FTP and FTPS, but this is rare.

Content encoding (compression)

Googlebot supports gzip, deflate, and Brotli (br).

HTTP caching

Google supports caching standards such as ETag and Last-Modified responses and If-None-Match and If-Modified-Since request headers.

Many SEO tools and some malicious bots will pretend to be Googlebot. This may allow them to access websites that try to block them.

In the past, you needed to run a DNS lookup to verify Googlebot. But recently, Google made it even easier and provided a list of public IPs you can use to verify the requests are from Google. You can compare this to the data in your server logs.

You also have access to a “Crawl stats” report in Google Search Console. If you go to Settings > Crawl Stats, the report contains a lot of information about how Google is crawling your website. You can see which Googlebot is crawling what files and when it accessed them.

Line graph showing crawl stats. Summary of key data is above

Final thoughts

The web is a big and messy place. Googlebot has to navigate all the different setups, along with downtimes and restrictions, to gather the data Google needs for its search engine to work.

A fun fact to wrap things up is that Googlebot is usually depicted as a robot and is aptly referred to as “Googlebot.” There’s also a spider mascot that is named “Crawley.” According to Google’s Lizzi Harvey, the spider mascot also has an unofficial name of “Dex”, short for Index.

Still have questions? Let me know on Twitter.