Duplicate content: what is it and how to fix it

Duplicate means that the same content is published on different URLs. This can hurt your search rankings, because Google doesn't know which page is the right one. In this article, we tell you all about duplicate content, show some of it's most common causes and tell you how you can fix it.

What is duplicate content?

Duplicate content means that the same content is published on different URLs on the internet, either on the same website or on a different site. www.example.com/t-shirts and www.example.com/t-shirts?sort=price may have different URLs, their content is exactly the same.

Why does duplicate content matter?

If you're searching on Google, you don't want to see the exact same result twice. Therefore, Google will only show one result with similar  content in the results page. Google is forced to make a choice: which content am I going to include in the results. Their choice may not be the one you would have liked to rank.

Google showing the similar content notification

It get's worse: other websites may link to different versions of your content. And because links are an important ranking signal for Google and other search engines, they won't know which content to prioritize. Therefore, it might distribute the pagerank (the authority of that page in Google's index) among the different URLs, resulting in an overall lower page rank.

You may have heard of crawl budget: the time and resources Google is willing to spend on indexing your website. Duplicate content makes it harder for them to crawl and index your website. As a result, some of your other pages may not rank at all.

The result: confused search engines and lower rankings for your content. That's a shame, because duplicate content is easy to avoid.

More on how Google handles duplicate content in this video by Matt Cutts.

When does Google see content as duplicate?

Google has described duplicate content as "substantive blocks of content within or across domains that either completely matches other content or are appreciably similar". Note that the content doesn't have to be exactly the same to be considered duplicate.

What is a duplicate content penalty?

The good news: there's no such thing as a duplicate content penalty. They said so in a blog post from 2008 and it's still true, as Google's John Mueller said in a 2014 Hangout session.

Google may issue penalties when content is intentionally copied (plagiarism), but that's not the same as duplicate content. Duplicate content is often unintentional, and won't result in your website being removed.

It does, however, lead to duplicate pages being filtered from the results, and gives you less control over which pages are ranking. Enough reason to avoid duplicate content!

Causes and solutions

There are many reasons why the same content may live on different URLs. We'll talk you through the most common causes and tell you how to fix them.

URL parameters

Often you'll see parameters in URLs, used for sorting, filtering, pagination or recognizing where traffic comes from. For example: www.example.com/products?sort=price and www.example.com/products may be the exact same page, but have a different URL. The same is true for tracking parameters: www.example.com/blog-post?utm_source=email may not differ from www.example.com/blog-post.

You probably can't remove these parameters, because they are there for a reason. There's an easy fix for this: use canonical URLs. A canonical URL tells search engines that although there may be various URLs going to the same content, only that one canonical URL is the original one. Generally, Google will use that URL in their results.

In the head of your page, add:

<link rel="canonical" href="http://www.example.com/blogs/my-blog-post" />

That tells Google http://www.example.com/blogs/my-blog-post should be indexed, even when the URL shown is:
http://www.example.com/blogs/my-blog-post?utm_source=email or
http://www.example.com/blogs/my-blog-post?show-comments=true&page=5.

It's very similar to a 301 redirect, without changing the URL.

Content in different categories

Some Content Management Systems allow you to place content like products or blog post in different categories. A gardening webshop that sells apple trees may list them on www.example.com/trees/apple-tree and www.example.com/fruit/apple-tree.

As a result, the product page or blog post is available on two different URLs. There you have it: duplicate content!

There are two possible solutions to this:

  • Make sure that even when a product fits into two categories, the product item page always uses the category name of the most important category.
  • OR: use a canonical URL that always tells Google the most important URL. so that one will be found in the results.

Unoriginal content

Say you're selling products, and you use the description provided by the manufacturer on the product item page. There's a good chance that many of your competitors are doing the same. As a result, the content on your product page is hard to distinguish from that of your competitor.

Ideally, you should write your own content or at least adjust the provided texts so that it speaks to your audience. That way you not only avoid duplicate content, you also make sure that your audience is targeted with text written just for them, instead of generic descriptions everyone uses.

Guest posts

Imagine you get the opportunity to write a guest post on a big blog in your industry. That's pretty cool! But what if you wanted to post that same article on your own blog? Now you have 2 different URLs, even on different domains, with the exact same content.

Canonical URLs are the solution. If you can, ask the owner of the blog to include a canonical URL to the same blog post on your page. That's a strong signal that yours is the original. 

Country specific domains

Say you have www.example.com targeting the US, and www.example.co.uk targeting the UK. Both websites sell kitchen apparel, and have identical product descriptions. But because the pricing and delivery costs differ, you want to make sure you're sending the right people to the right website, and of course, you want to avoid duplicate content.

href lang attributes are the answer here. They tell Google which page targets which country, so Google can display the .com website to US searchers, and the co.uk website to people from the UK. 

More about International SEO and how to use href langs.

www and non-www URLs

Some websites have www in front of their domain, like www.google.com. Others don't, like dribbble.com. If your website works on the domain www and the one without, you have two identical websites on a different URL. Google will consider that duplicate.

There's an easy fix: redirect all your traffic to www. If you have an Apache server, add this to your .htaccess file:

RewriteEngine On
RewriteCond %{HTTP_HOST} !^www\.
RewriteRule ^(.*)$ http://www.%{HTTP_HOST}/$1 [R=301,L]

https and http URLs

So you secured your website with an SSL certificate? That's great! Just don't forget to redirect all traffic to that secure URL, otherwise your content will live on 2 URLs: one with, and one without SSL.

If you're on an Apache server, you can do this by adding the following lines to your .htaccess file:

RewriteEngine On
RewriteCond %{HTTPS} off
RewriteRule (.*) https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

Trailing slashes

You see this often: www.example.com/products and www.example.com/products/ (note the trailing slash at the end) show the exact same page. Google is getting smart enough to realize that this is probably the same page, but most SEOs agree that it's not worth the risk. It's much better to redirect all traffic to the URL without the trailing slash.

If your website is running on an Apache server, add the following line to your htaccess to redirect all traffic to the variant without the trailing slash:

RewriteRule ^/?(.+)/$ /$1 [R=301,L]

Boilerplate content

When we talk about content, we normally refer to the text in your blog post, news article or product description. But there's more content on your page: you have a menu, a header, a footer and maybe even a sidebar that you show on every page across your site. That's what we call boilerplate content.

If you have a lot of boilerplate content on your page compared to the specific content of that page, Google may view these pages as  duplicate. The result is pretty serious: it may not show your individual product pages in the search results. Therefore, Google recommends keeping your boilerplate content to a minimum.

Of course, you'll need a menu and a footer. Just don't include your entire privacy statement in the footer. Instead, add a link to a specific page.

A note on canonical URLs

We mentioned canonical URLs as a great way of avoiding duplicate content. But you have to be careful. Keep in mind that if you have page A with a canonical URL pointing to page B, that page A is probably not going to be indexed. Often that is what you want, but make sure you take good care placing canonical URLs, because the effects can be serious.

How to find duplicate content?

You've come to the right place: SiteGuru checks your website for duplicate page titles and meta descriptions, often a signal that the content itself is also duplicate.

Google Search Console also reports on duplicate content.

Conclusion

Duplicate content can have a negative impact on how your website is indexed and ranked. It's easy to find and easy to fix, so there's no reason there should be duplicate content on your website.