Check the noindex settings
Enter the URL of a page to check if the page is blocked
If you received a Google Search Console notification or noticed that some of your pages are “Indexed, though blocked by robots.txt,” I’m here to show you how to solve this common indexing error, plus - what to do when pages that shouldn’t be indexed get indexed.
Let’s take a look!
When Google bots are done crawling your website, they’ll index it next. Usually, that’s the goal: you want your pages to rank for the right keywords on Google SERPs.
However, there are some pages you don’t want it to index, for example:
- Your website backend
- Staging environments
- Private pages
- Thin or duplicate content pages
Have you received an email from Google Search Console (GSC) that says, “Indexed, though blocked by robots.txt,” then here is a little help with what is happening and how to fix it.
To make sure Google doesn’t use these pages, you use the robots.txt file. It contains instructions for the search engines, including specifying the pages you’d like it to skip.
If you haven’t received a notification but would still like to check, you can use Google Search Console -> Indexing -> Pages. You’ll see a list of all the reasons your URLs may have issues being indexed:
When you click on a specific page indexing issue and the URL, you’ll get the option to “Inspect URL.” From there, you’ll be able to access more information and the report:
If you use SiteGuru for weekly SEO audits and to-do lists, you can use the indexation report. You’ll see when Google bots last crawled your pages, and if there are any indexing issues you should fix.
It is not a problem if the robots.txt file contains directives by you or a developer to block pages but check to ensure:
- You’re not blocking pages that should be ranking for a keyword.
- You haven’t accidentally set up a general rule that affects pages that should be indexed.
If you intentionally no-indexed the page, then you’re good! Feel free to skip this article and brew yourself a cup of coffee.
If you haven’t intentionally no-indexed the page, it’s time to troubleshoot.
There may be a directive in your robots.txt file preventing the indexing of pages that should actually be indexed.
For example, you may have blocked certain pages in your help center from being indexed, but you may have set up a rule that blocks all of them - including the ones that could rank for a long-tail keyword.
Check the no-index directives and ensure that:
- There is no more than one ‘user-agent’ block.
- The ‘disallow’ line doesn’t instantly follow the ‘user-agent’ line.
- Invisible Unicode characters are removed. (You can run your robots.txt file through a text editor, which will convert the encodings.)
You can also use our free no-index checker to verify.
If you want search bots to index all the pages on your website, this should be your robots.txt directive:
User-agent: * Disallow: /
This means to disallow nothing.
If you use a CMS like WordPress, it may automatically create your robots.txt file. SEO plugins do the same. If you also created your own, ensure you’re not duplicating or triplicating robots.txt files with different directives, confusing Google.
Bots use links to crawl and understand your website. They can use your redirects, but if you’ve set up so many redirects that you throw them for a loop, they’ll eventually give up.
For example, let’s say you run an international website. You’ll have an original version of the page in Spanish and a translated version in English. In that case, you’ll add a canonical tag to the duplicate page, so it references the original version.
Make sure it’s all set up correctly, so you don’t accidentally no-index your pages.
You could also see this issue if your URL isn’t really a page. For example, Google may have picked up a campaign UTM parameter or a variation of your page’s URL. If that’s the case, feel free to disregard the notification.
However, if it’s a page that contains information you want searchers to see, change the URL and validate the fix in Google Search Console.
Finally, when you’ve fixed the URLs, navigate to them in the Page Indexing section in Google Search Console, select the URL, and click “Validate fix.”
There are also cases where the pages you don’t want Google to pick up are indexed. In addition to checking the robots.txt rules for mistakes, check for the following culprits:
Pages linked to from other sites can get indexed even if disallowed in robots.txt. When this happens, only the anchor text and URL are displayed in search engine results.
You can fix this issue by:
- Password-protecting the file(s) on your server.
- Adding an instruction to the robots.txt file to ignore these pages, or adding the following meta tag to block them: <meta name=" robots" content=" noindex">
If you migrated your website recently and no-indexed the old URLs, it’ll take a while for Google to catch on.
You can fix this issue by:
- Implementing 301 redirects from old to new URLs (preferable for conserving link equity).
- Giving Google time to drop the old URLs from its index eventually. (Typically, Google drops URLs if they keep returning 404s errors.) Avoid plugins to redirect your 404s.
Make a list of all your website URLs. You can do it manually or (if your website is bigger or you want to be thorough) use SiteGuru’s crawler.
Once you’ve identified the URLs that you don’t want Google to index, add them to your robots.txt file:
Check which pages might have linked to the disallowed pages and remove the link.
Google Search Console does not provide this information, but you can use SiteGuru to see the linking URLs.
Finally, run a new website audit with SiteGuru to ensure the pages can’t be indexed and that others can. You should see a “no-index” tag next to the page.
It’s normal to see different status codes in your Google Search Console, but know when to act.
When it comes to the “Indexed, though blocked by robots.txt” code, make sure you keep your robots.txt file updated with the proper exceptions.
Then, monitor the changes manually or through SiteGuru’s automated weekly audits. It’s the easiest way to focus on actionable SEO and swoop into the technicalities only when something requires your attention.
1. Can I disallow crawling for my entire website?
Yes, you can. However, URLs may still be indexed in some situations, even if they haven’t been crawled.
This doesn’t match the various AdsBot crawlers that must be named explicitly, so you can block your website for search engines but still show ads.
2. How do I disallow directory crawling?
Disallow the crawling of a directory and its contents by following the directory name with a forward slash:
The example above would disallow any pages following the path /tags/. For example, if “tags” is my category page, this directive would also block all subsequent coffee product pages.
Please remember that it’s better to use proper authentication to block access to private content instead of using robots.txt. Anyone can view the robots.txt file, so URLs might still be indexed without being crawled.
3. How do I edit my Shopify and eCommerce robots.txt files?
Even though you previously couldn’t edit your Shopify robots.txt file, you now can.
You’ll go to Online Store -> Themes -> Actions -> Edit code -> Add a new template -> select “robots” -> select “Create template.”
There, you’ll be able to make your own exceptions and rules.