Crawl Budget

Contents

What is a crawl budget?

Crawl budget is simply the number of pages Google will crawl on your website within a given timeframe — often on any given day. Google might crawl 30, 600, 20,000 pages, etc., on your website each day. This number varies, and usually, the size of your website, the number of links to it, and its health (how many errors Google encounters) help Google determine the number of pages to crawl. Some of these factors are things you can influence. We talk about that in a jiffy.

How does crawler work?


A crawler (spider or bot) gets a list of URLs to crawl on your website and scan the list thoroughly. It also scans your robots.txt file occasionally to make sure it is still permitted to crawl each URL and then crawls the URLs one by one. Once a URL is crawled and its content parsed, the spider then adds the new URLs it has found on that page that it has to crawl back on the to-do list. Quite a few events can necessitate Google to crawl a URL. It might have found new links pointing at content, or people have tweeted it, or it might have been updated in the XML sitemap, etc., Basically, when Google determines a URL has to be crawled, it adds it to the to-do list.

Crawl rate limit

Google doesn't want to overload your site by crawling it too heavily. For that reason, it uses a crawl rate limit to prevent crawlers from degrading the experience of users visiting a website. In other words, the crawl rate limit limits the maximum fetching rate for a given site. Here is how Google defines crawl rate limit: Simply put, this represents the number of simultaneous parallel connections Googlebot may use to crawl the site, as well as the time it has to wait between the fetches. The crawl rate can go up and down based on a couple of factors:

  • Crawl health: if the site responds really quickly for a while, the limit goes up, meaning more connections can be used to crawl. If the site slows down or responds with server errors, the limit goes down, and Googlebot crawls less.
  • Limit set in Search Console: website owners can reduce Googlebot's crawling of their site. Note that setting higher limits doesn't automatically increase crawling.

If no one is using or visiting your website, it will respond to crawlers fast, and so crawlers will most probably crawl it more.

Crawl demand

Crawlers also consider the demand any specific URL is getting from the index itself to determine how active or passive it should be. The two factors that play a weighty role in determining crawl demand are:

  • Popularity: Popular URLs tend to be crawled and indexed more frequently than the ones that aren't.
  • Staleness: Google's system will prevent stale URLs and will benefit up to date content.

Basically, Google uses crawl demand and crawl rate limits to determine the crawl budget (the number of URLs Googlebot can and wants to crawl.) Ideally, you want all of your pages to get crawled, and you want Googlebot to want to crawl your site.

Differentiation of the index budget

Index budget is different from a crawl budget. It determines how many URLs can be indexed. The difference becomes clear when a site that contains multiple pages return a 404 error code. Each requested page counts on the crawl budget, but if it cannot be indexed due to an error message, the index budget is partially fully utilized.

Why is the crawl budget essential for SEO?

It is simple! Google has to index a page to rank it for anything. That means, if your number of pages surpass your website's crawl budget, you are going to have pages on your website that aren't indexed. Fortunately, most websites don't need to worry about the crawl budget as Google is good at tracking down and indexing pages. However, there are a few instances where you need to pay attention to the crawl budget:

  • You just added several pages: If you lately added a new section to your website with hundreds of pages, you want to ensure that you have the crawl budget to get them all indexed fast.
  • You run a big website: If you have a site with 10k+ pages like an eCommerce website, Google can have a challenge finding them all
  • Lots of redirects: Multiple redirects and redirect chains consume your crawl budget.

How to improve crawl budget?

There are a few things you can do to improve the number of pages Google can crawl on your site.

Allow crawling of your vital pages in robots.txt

This is achievable by hand or a website auditor tool. We prefer to use a tool whenever possible as it simplifies the entire process. By simply adding your robots.txt to the tool of your choice, you should be able to allow/block crawling of any page of your domain in seconds. Once done, upload the edited document, and voila! 

Tip! If you have a large site, it is much easier to use a tool than doing it by hand.

Website maintenance: reduce errors

Strive to make sure all pages that are crawled return the 200 (for "OK") status code or 301 (for "Go here instead"). All other status codes are not OK, and you should fix them as soon as possible. To do this, you have to look at your website's server log. Once you have got your server logs, try to find the common errors and fix them. 

The easiest way of doing that is by selecting all URLs that failed to return 200 or 301 and ordering them by how they were accessed. Fixing the error might mean you have to redirect a URL elsewhere or fix the code. If you know what exactly caused the error, then try to fix the source too. While you can use Google Analytics and other analytical packages, they will only track pages that served a 200 — that is why a server log is more reliable.

Block parts of your site

Yes, if there are certain sections of your website you don't need to be in Google, block them using robots.txt. Make sure to do this only when certain of what you are doing. One of the common issues we see on large eCommerce websites is when they have multiple ways to filter products. In this case, every filter might add new URLs for Google. If you find yourself in such a situation, you really want to ensure you are allowing Google bot only one or two of those filters and not all of them.

Avoid "orphan pages"

Orphan pages are website pages that have no internal or external links pointing to them. Google has a tough time finding orphan pages. So if you want to get the most out of your crawl budget, make sure that there's at least one internal or external link pointing to every page on your site.

Improve Site Speed

Improving your website's page speed is very crucial as it can lead to Googlebot crawling more of your website's URLs. Slow loading pages end up eating valuable Google crawlers' time. Even Google emphasis this: "Making a site faster improves the users' experience while also increasing the crawl rate."

Reduce redirect chains

The moment you 301 redirect a URL, something odd occurs. Google will see the new URL and add it to the to-do list. Normally, it doesn't always follow it instantly — it adds it to its to-do list and just goes on. Now, when you chain redirects, for example, when you redirect non-www to www, then http to https, you have two redirects everywhere, so everything takes longer to crawl.

Get more links

Getting more links is not just about being awesome. It also ensures others know that you are awesome. In other words, it is a matter of good PR and good engagement on Social. Note that link building is a very slow way to increase your crawl budget. However, if you intend to create a large site, then link building needs to be part of your process.

Limit duplicate content

Duplicate content can hurt your crawl budget. This is because Google doesn't like to waste resources by indexing multiple pages with the same content. For that reason, ensure your website contains unique, quality content. You can even hire someone to write quality and unique content if you find it challenging to do it yourself.

Use HTML whenever possible

When it comes to Google, note once it has been said that its crawler got a bit better at crawling JavaScript specifically, but also improved in crawling and indexing XML and Flash. On the other hand, other search engines aren't quite there yet. Because of that, we recommend, whenever possible, stick to HTML.

Update Your Sitemap

It is recommendable to take care of your XML sitemap. Doing so makes it easier for bots to understand where the internal links lead. Use only the URLs that are canonical for your sitemap. Besides, ensure that it corresponds to the latest uploaded version of robots.txt.

Hreflang tags are vital

When analyzing localized pages, crawlers employ hreflang tags. And you must be telling Google about localized versions of your pages as plainly as possible. Here is what to do: Use the <link rel = "alternate" hreflang = "lang_code" href  = "url_of_page" /> in your page’s header. “lang_code” is a code for a supported language. You must also use the <loc> element for any given URL. That way, you can indicate the localized versions of a page.

Why do search engines assign crawl budget to sites?

Because they don't have limitless resources, and they split their attention across millions of sites. So they need a way to prioritize their crawling effort. Assigning a crawl budget to each site allows them to do this.

How do you check your crawl budget?

If your website verified in Google Search Console, you can get some insight into your website's crawl budget for Google by doing this:

  • Log in to Google Search Console and select your site.

google search console

  • Go to Settings > Crawl > Crawl Stats. You should see the number of pages that Google crawls per day.

crawl stats gsc


Conclusion

Crawl budget was, is, and probably will continue to remain a crucial thing for your website. For that reason, it makes sense to leverage it fully. What effort have you made to ensure your website gains all the benefits of crawl budget? It is your turn now to evaluate your website and make the necessary adjustments. All the best!