This article addresses the problem of having identical or nearly identical versions of a web page available at more than one URL, on one site or many sites. This can occur unintentionally (because of site architecture, for example) or it can be something of which the webmaster is well aware (with syndicated content "scraped" or illegally copied content from other websites).
The problem with so-called canonical URLs is not too much of an issue today. Most webmasters are aware of it. Google even has a tool in its Webmaster Central application for webmasters to specify which version is the primary one -- the one with "www." or the one without. The webmaster does not have to make any changes to a site for that. He should, though, because other search engines do not provide a mechanism like Google's.

Some sites are configured to return the website whether or not the "www" is entered, e.g. http://www.domain.com returns the site and so does http://domain.com. Those are two distinct URLs and search engines treat them as different sites, because technically you could have completely different sites at each address, on different servers even. When the search engine notes that all pages are identical at both addresses, how to handle this is left to the search engine. However, you should not leave that choice up to the search engines. Make the choice for them! You should redirect one of the versions to the other. (This will also prevent PageRank leakage caused by people linking to your site via both types of URLs.) By making this choice proactively, the version that becomes "master" is up to you, and is not an SEO problem.
A bigger concern is pages with the same content but more than one URL because of the site's architecture. With the increased popularity of syndication and aggregation and "mash-ups" it has become much easier for "black hat SEO" types to create scraper sites.
"Scraper sites" are sites that are thrown together as quickly and with as much automation as possible. They are designed to rank well and get users to click on contextual ads, such as Google AdSense links, and to generate revenue. The chances for generating revenue are high, because the ads are the only text that makes sense (compared to the gibberish produced by the scraper). Another goal of a scraper site could be to indirectly boost the ranking of a less visible site. A site that is included in the search index and starts to rank for terms can pass on "link juice" like any other website on the Internet. That means that if you link from your scraper site to another website of yours, it will boost the other site's ranking. This is very risky business, because search engines will not only get rid of the scraper site if they detect it, but also penalize your real site, if it is clear what you did.
Landing on a scraper site is, in most cases, a bad experience for the user. Search engines will try the get rid of scraper sites as efficiently as possible. The preponderance of scraper sites contributes to an overall sense of mistrust (by search engines) of webmasters who may very well be promoting legitimate website content.
Further reading is available at my article Poor Search Engine Rankings Caused by Duplicate Content Issues.

Mirror, Mirror on the Web - A Study of Host Pairs with Replicated Content by Krishna Bharat and Andrei Broder
Detecting query-specific duplicate documents; Google, Inc.; Patent No. 6,615,209 (2003)
Detecting duplicate and near-duplicate files; Google, Inc.; Patent No. 6,658,423 (2003)
Patent No. 6,658,423 by William Pugh demonstrated
Method for determining the resemblance of documents; Digital Equipment Corp., Inc.; Patent No. 5,909,677 (1999)
Method for determining the resemining the resemblance of documents; Altavista, Inc.; Patent No. 6,230,155 (2001)
Phrase identification in an information retrieval system - Patent Application: 20060018551 by Anna Lynn 1/26/2006
System and method for optimizing search results through equivalent results collapsing Patent Application: 20060248066 by Brewer; Brett D. 11/2/2006
| You had a duplicate content issue. It got your site banned. You found the problem and spend the necessary time to fix it, now what? Sending Reinclusion Request(s) to the Search Engines! |
See the
Advertiser Kit to learn more about sponsorship opportunities at Cumbrowski.com. Press? Download my
Media Kit.