What is “Indexed, though blocked by robots.txt”:
Have you received an email from Google Search Console (GSC) that says, “Indexed, though blocked by robots.txt,” then here is a little help with what is happening and how to fix it.
Here is a screenshot of the notification:
This message means that Google indexed your URLs but found an instruction to ignore them in your robots.txt file.
That means they won’t show up in results, and this can impact their ability to rank in all SERPs (Search Engine Results Pages). In this piece, you will learn how to fix this issue and if it is ok just to ignore it. Below is what Google Search Console’s Index Coverage report is likely to show with the URL amount shown. It is possible that the snippets that are shown are suboptimal, similar to:
What is a robots.txt file?
Robots.txt file sits within your website’s directory and is the second file bots read when they crawl your website. It offers some instructions for bots, like Google’s bot, as to which files they should and should not view.
Why am I getting this notification?
“Indexed, though blocked by robots.txt” may display for several reasons.
Below are the common ones:
It is certainly not a problem if the robots.txt file contains directives by you or a developer to block pages, duplicate, or unnecessary / category pages.
Wrong URL format
This issue could also arise from a URL that is not really a page. For example, you need to know what the URL below resolves to. https://www.siteguru.co/?s=seo+academy If it is a page that contains vital information that you really want your users to see, then there is a need to change the URL. This is possible on Content Management Systems (CMS) like WordPress where you can modify a page’s slug. No need to fix this issue if the page is not important and if the URL is a search query from your blog. You can also delete the page.
Pages that should be indexed
There are quite a few reasons why pages that should be indexed do not get indexed. Here is why:
A rule in the robots.txt file
There may be a directive in your robots.txt file that is preventing the indexing of pages that should actually be indexed—for example, categories and tags. Remember, categories and tags are real URLs on your website.
You are pointing the Googlebot to a redirect chain
Bots such as Googlebot go through all links they come across and do their best to read for indexing. Nevertheless, if you set up a multifaceted, long, deep redirection, or if the page is just inaccessible, Googlebot would stop looking.
Implemented the canonical link correctly
A canonical tag is placed in the HTML header and tells Googlebot, which is the preferred and canonical page in the event of duplicated content. Bonus! Every page must have a canonical tag. If you have a page that is translated into Spanish, for example, you will self-canonical the Spanish URL, and you would want to canonical the page back to your default English version.
Pages that should not be indexed
Again, there are quite a few reasons why pages that should not be indexed get indexed. But why?
Noindex means a web page shouldn’t be indexed. A page with this directive will be crawled but won’t be indexed. In your robots.txt file, ensure that:
- There is no more than one ‘user-agent’ block.
- The ‘disallow’ line doesn’t instantly follow the ‘user-agent’ line.
- Invisible Unicode characters are removed. You can do that by running your robots.txt file through a text editor, which will convert the encodings.
Pages are linked to from other websites
Pages linked to from other sites can get indexed even if disallowed in robots.txt. When this happens, only the anchor text and URL display in search engine results. Here is a screenshot of how these URLs appear on SERP image source Webmasters StackExchange This issue (robots.txt blocking) can be resolved by:
- Password protecting the file(s) on your server.
- Deleting the pages from robots.txt or add the following meta tag to block them: <meta name=" robots" content=" noindex">
Assuming you have created a new website or even new content and included a ‘noindex’ rule in robots.txt to prevent indexing. Or lately signed up for GSC there are ways to fix the blocked by robots.txt problem:
- Give Google time to drop the old URLs from its index eventually. Normally Google drops URLs if they keep returning 404s errors. It is not recommended to use plugins to redirect your 404s as they can cause problems that may result in GSC sending you the ‘blocked by robots.txt’ notification.
- 301 redirect the old URLs to the current ones
Check to see if you have robots.txt file
It is also possible for GSC to send you these notifications even if you do not have a robots.txt file. CMS like WordPress might already created a robots.txt file, plugins may also create robots.txt files. Overwriting the virtual robots.txt files with your own robots.txt files, This might cause the complication on GSC.
How do you fix this issue?
Using a directive to permit search engine bots to crawl your website is the only way bots will identify which URLs to index and which to ignore.
Here is the directive that allows all bots to crawl your website:
User-agent: * Disallow: /
It means ‘disallow nothing’.
Here are the steps to identify what pages you want to disallow:
1. You can either review all the pages or export the list of URLs from any SEO audit tool that can provide all pages of your site, in our case, we used SiteGuru Audit:
2. Identify URLs that you do not want to index on SERP and add it to your robots.txt file:
3. Once you've disallowed certain pages on robots.txt, you should rerun your SiteGuru's audit and you should see "no-index" next to the pages:
4. If you still receiving the notification, check which pages might have linked to the disallowed pages and remove the link. Google Search Console does not provide you where all the pages are linked to the no-indexed URL, but you can use an SEO tool like SiteGuru to identify what URLs link to the page that has been no-indexed:
What will disallow robots.txt?
- Disallow crawling of the entire website. Keep in mind that URLs from the website may still be indexed in some situations, even if they haven’t been crawled. Note that this doesn’t match the various AdsBot crawlers that must be named explicitly.
- Disallow crawling of a directory and its contents by following the directory name with a forward slash. Remember that you shouldn’t use robots.txt to block access to private content - use proper authentication instead. This is because anyone can view the robots.txt file, and URLs disallowed by it might still be indexed without being crawled.
- Disallow crawling of the entire site, but show AdSense ads on those pages, disallow all web crawlers other than Mediapartners-Google. This implementation protects your pages from search results, but the Mediapartners-Google web crawler can still analyze them to decide what ads to show visitors to your site.
The example above would disallow any pages following the path /tags/: