According to Google Search Console, “Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar.”
Technically a duplicate content, may or may not be penalized, but can still sometimes impact search engine rankings. When there are multiple pieces of, so called “appreciably similar” content (according to Google) in more than one location on the Internet, search engines will have difficulty to decide which version is more relevant to a given search query.
Why does duplicate content matter to search engines? Well it is because it can bring about three main issues for search engines:
- They don’t know which version to include or exclude from their indices.
- They don’t know whether to direct the link metrics ( trust, authority, anchor text, etc) to one page, or keep it separated between multiple versions.
- They don’t know which version to rank for query results.
When duplicate content is present, website owners will be affected negatively by traffic losses and rankings. These losses are often due to a couple of problems:
- To provide the best search query experience, search engines will rarely show multiple versions of the same content, and thus are forced to choose which version is most likely to be the best result. This dilutes the visibility of each of the duplicates.
- Link equity can be further diluted because other sites have to choose between the duplicates as well. instead of all inbound links pointing to one piece of content, they link to multiple pieces, spreading the link equity among the duplicates. Because inbound links are a ranking factor, this can then impact the search visibility of a piece of content.
The eventual result is that a piece of content will not achieve the desired search visibility it otherwise would.
Regarding scraped or copied content, this refers to content scrapers (websites with software tools) that steal your content for their own blogs. Content referred here, includes not only blog posts or editorial content, but also product information pages. Scrapers republishing your blog content on their own sites may be a more familiar source of duplicate content, but there’s a common problem for e-commerce sites, as well, the description / information of their products. If many different websites sell the same items, and they all use the manufacturer’s descriptions of those items, identical content winds up in multiple locations across the web. Such duplicate content are not penalized.
How to fix duplicate content issues? This all comes down to the same central idea: specifying which of the duplicates is the “correct” one.
Whenever content on a site can be found at multiple URLs, it should be canonicalized for search engines. Let’s go over the three main ways to do this: Using a 301 redirect to the correct URL, the rel=canonical attribute, or using the parameter handling tool in Google Search Console.
301 redirect: In many cases, the best way to combat duplicate content is to set up a 301 redirect from the “duplicate” page to the original content page.
When multiple pages with the potential to rank well are combined into a single page, they not only stop competing with one another; they also create a stronger relevancy and popularity signal overall. This will positively impact the “correct” page’s ability to rank well.
Rel=”canonical”: Another option for dealing with duplicate content is to use the rel=canonical attribute. This tells search engines that a given page should be treated as though it were a copy of a specified URL, and all of the links, content metrics, and “ranking power” that search engines apply to this page should actually be credited to the specified URL.
Meta Robots Noindex: One meta tag that can be particularly useful in dealing with duplicate content is meta robots, when used with the values “noindex, follow.” Commonly called Meta Noindex, Follow and technically known as content=”noindex,follow” this meta robots tag can be added to the HTML head of each individual page that should be excluded from a search engine’s index.