Duplicate Content and Near Duplicate Content - Canonical URLs, Content Theft (Scraping)

Duplicate Content Issue. What is the Issue?
Duplicate Content Resources and Opinions
Duplicate Content Research, Patents, Shingles and Algorithms
Duplicate Content Check/Copyright Infringement Detection

Duplicate Content Issue. What is the Issue?

Return to Search Engine Optimization Resources Home

This article addresses the problem of having identical or nearly identical versions of a web page available at more than one URL, on one site or many sites. This can occur unintentionally (because of site architecture, for example) or it can be something of which the webmaster is well aware (with syndicated content "scraped" or illegally copied content from other websites).

Canonical URLs

The problem with so-called canonical URLs is not too much of an issue today. Most webmasters are aware of it. Google even has a tool in its Webmaster Central application for webmasters to specify which version is the primary one -- the one with "www." or the one without. The webmaster does not have to make any changes to a site for that. He should, though, because other search engines do not provide a mechanism like Google's.

Some sites are configured to return the website whether or not the "www" is entered, e.g. http://www.domain.com returns the site and so does http://domain.com. Those are two distinct URLs and search engines treat them as different sites, because technically you could have completely different sites at each address, on different servers even. When the search engine notes that all pages are identical at both addresses, how to handle this is left to the search engine. However, you should not leave that choice up to the search engines. Make the choice for them! You should redirect one of the versions to the other. (This will also prevent PageRank leakage caused by people linking to your site via both types of URLs.) By making this choice proactively, the version that becomes "master" is up to you, and is not an SEO problem.

HTML Attribute Value "Canonical"

To solve the canonical issue with the www versus the non www version of a website, Google provided a tool via their Webmaster Central site, that lets webmaster specify, which version is the primary one, for the domains the webmaster registered with the Google Webmaster Central Service.

In February 2009, the three major search engines Google, Yahoo and Microsoft introduced a second HTML attribute value (next to attribute value NOFOLLOW that was introduced by Google in 2005) that is specifically for the use by webmasters to help with their search engine optimization efforts. It is called "canonical" and is like "nofollow" a HTML attribute value for the attribute "rel".

Unlike "nofollow", which is used for the "rel" attribute within the "a" (hyperlink) element or tag inside the body of a HTML document, "canonical" is used for the "rel" attribute for the "link" element or tag and inside the header section of the HTML document.

The "canonical" attribute value also serves a differnt purpose than "nofollow", which was designed to indicate to search engines that the link where the attribute value is used should not pass any link value (or "PageRank"), which is in one way or another the most important ranking factor in any of the major search engines today.

The new attribute value "canonical" provides webmasters the ability to indicate to search engines what the main or primary URL is for the current document and by extension, that any other URL variations to the same (duplicate) document should not be used.

Example Usage of the "canonical" tag:


<htlm>
<head>
...
<link rel="canonical" href="http://www.example.com/product/german-beer" />
...
</head>
...
<body>
...

This idicates to search engines that the primary to the current document is "http://www.example.com/product/german-beer" and that the current version (if different) is a duplicate.

Some examples of typical of URL variations that cause duplication:

http://www.example.com/product.php?item=german-beer
http://www.example.com/product.php?item=12345
http://www.example.com/product/german-beer/
http://www.example.com/product/german-beer/?session-id=ABCDE99991991ASDX
http://www.example.com/product.php?item=12345&aff=7654321
http://www.example.com/product.php?item=german-beer&tackingcode=ppc-adwords123
http://www.example.com/category/product/german-beer
http://www.example.com/germanbeerpromotion

Google engineer Matt Cutts elaborated in an interview with WebProNews during the SMX West conference where the new "tag" was introduced, how Google uses the information provided by webmasters who are using "canonical" in their web pages.

The tag can only be used to refer to the primary version of the URL to the document on the same domain name. That means that it cannot be used to refer to a version of the same document on another domain. Furthermore, if correctly implemented and indexed properly by the Google Bot, URL versions on the same domain to the same document will be treated like if those URLs were 301 redirected by the webmaster to the primary URL version.

This means that webmasters who have duplicate URLs to the same content on their website and used by other web sites to point to that content, should see an increase in PageRank and improved ranking in Google search results, with the implementation the new "canonical tag", without implementing any server side 301 redirects of their duplicate URLs.

Scraper Sites

A bigger concern is pages with the same content but more than one URL because of the site's architecture. With the increased popularity of syndication and aggregation and "mash-ups" it has become much easier for "black hat SEO" types to create scraper sites.

"Scraper sites" are sites that are thrown together as quickly and with as much automation as possible. They are designed to rank well and get users to click on contextual ads, such as Google AdSense links, and to generate revenue. The chances for generating revenue are high, because the ads are the only text that makes sense (compared to the gibberish produced by the scraper). Another goal of a scraper site could be to indirectly boost the ranking of a less visible site. A site that is included in the search index and starts to rank for terms can pass on "link juice" like any other website on the Internet. That means that if you link from your scraper site to another website of yours, it will boost the other site's ranking. This is very risky business, because search engines will not only get rid of the scraper site if they detect it, but also penalize your real site, if it is clear what you did.

Landing on a scraper site is, in most cases, a bad experience for the user. Search engines will try the get rid of scraper sites as efficiently as possible. The preponderance of scraper sites contributes to an overall sense of mistrust (by search engines) of webmasters who may very well be promoting legitimate website content.

Further reading is available at my article Poor Search Engine Rankings Caused by Duplicate Content Issues.

- top -

Duplicate Content Resources and Opinions

Return to Search Engine Optimization Resources Home

Duplicate Content Filter - What it is and how it works

How to Remedy Duplicate Content and Magical % Thinking by Todd Malicoat aka Stuntduble from June, 2006
Duplicate Content - Revisited - Blog post at SEOmoz.org from 11/3/2006
Microsoft Explains Duplicate Content Results Filtering by Bill Slawski, SEO by the SEA 11/4/2006
Duplicate Content Issues and Search Engines by Bill Slawski from SEObyTheSea.com from June 11th, 2006
Duplicated Content and Re-Ranking Methods Patents owned by Search Engines. A written summary of the Session at SES 2005 Conference & Expo, SJ titled: "Patents On Duplicated Content and Re-Ranking Methods"
Duplicate Content Round-up: Diagnosis and Correction with Free Tools on 11/07/2006
Google Engineer Matt Cutts explains Duplicate Content Detection by Google - video where he talks about the process of duplicate content detection
Duplicate Content & Multiple Site Issues from SES Chicago 2006 by Lisa Barone from Bruce Clay, Inc.
Deftly dealing with duplicate content post by Adam Lasnik at the official Google Webmaster Central Blog
The Illustrated Guide to Duplicate Content in the Search Engines by Rand Fishkin of SEOMoz.org

- top -

Duplicate Content Research, Patents, Shingles and Algorithms

Return to Search Engine Optimization Resources Home

New Duplicate Content and Mapping Patents from Google - 01/02/2007 by Bill Slawski, Correspondent at SEL
Mirror, Mirror on the Web - A Study of Host Pairs with Replicated Content by Krishna Bharat and Andrei Broder of the Compaq Systems Research Center in Palo Alto, California.
Detecting query-specific duplicate documents; Google, Inc.; Patent No. 6,615,209 (2003)
Detecting duplicate and near-duplicate files; Google, Inc.; Patent No. 6,658,423 (2003)
Patent No. 6,658,423 by William Pugh demonstrated
Method for determining the resemblance of documents; Digital Equipment Corp., Inc.; Patent No. 5,909,677 (filed on June 18, 1996 and approved in 1999)
Method for determining the resemining the resemblance of documents; Altavista, Inc.; Patent No. 6,230,155 (filed on November 23, 1998 and approved in 2001)
Phrase identification in an information retrieval system - Patent Application: 20060018551 by Anna Lynn from 1/26/2006, filed on July 26, 2004.
System and method for optimizing search results through equivalent results collapsing Patent Application: 20060248066 by Brewer; Brett D. 11/2/2006

- top -

Duplicate Content Check/Copyright Infringement Detection

Return to Search Engine Optimization Resources Home

Duplicate Content Detection Tools for Webmasters. Not just to detect duplicates on your own site, but also stolen content from your site by scrapers and other webmasters.