Internet Marketing and Web Development Resources
Home Site Map About Contact

Duplicate Content and Near Duplicate Content Issues -Resources for SEO

Table of Contents



Duplicate Content Issue. What is the Issue?


Return to Search Engine Optimization Resources Home

This article addresses the problem of having identical or nearly identical versions of a web page available at more than one URL, on one site or many sites. This can occur unintentionally (because of site architecture, for example) or it can be something of which the webmaster is well aware (with syndicated content "scraped" or illegally copied content from other websites).

The problem with so-called canonical URLs is not too much of an issue today. Most webmasters are aware of it. Google even has a tool in its Webmaster Central application for webmasters to specify which version is the primary one -- the one with "www." or the one without. The webmaster does not have to make any changes to a site for that. He should, though, because other search engines do not provide a mechanism like Google's.

Some sites are configured to return the website whether or not the "www" is entered, e.g. http://www.domain.com returns the site and so does http://domain.com. Those are two distinct URLs and search engines treat them as different sites, because technically you could have completely different sites at each address, on different servers even. When the search engine notes that all pages are identical at both addresses, how to handle this is left to the search engine. However, you should not leave that choice up to the search engines. Make the choice for them! You should redirect one of the versions to the other. (This will also prevent PageRank leakage caused by people linking to your site via both types of URLs.) By making this choice proactively, the version that becomes "master" is up to you, and is not an SEO problem.

A bigger concern is pages with the same content but more than one URL because of the site's architecture. With the increased popularity of syndication and aggregation and "mash-ups" it has become much easier for "black hat SEO" types to create scraper sites.

"Scraper sites" are sites that are thrown together as quickly and with as much automation as possible. They are designed to rank well and get users to click on contextual ads, such as Google AdSense links, and to generate revenue. The chances for generating revenue are high, because the ads are the only text that makes sense (compared to the gibberish produced by the scraper). Another goal of a scraper site could be to indirectly boost the ranking of a less visible site. A site that is included in the search index and starts to rank for terms can pass on "link juice" like any other website on the Internet. That means that if you link from your scraper site to another website of yours, it will boost the other site's ranking. This is very risky business, because search engines will not only get rid of the scraper site if they detect it, but also penalize your real site, if it is clear what you did.

Landing on a scraper site is, in most cases, a bad experience for the user. Search engines will try the get rid of scraper sites as efficiently as possible. The preponderance of scraper sites contributes to an overall sense of mistrust (by search engines) of webmasters who may very well be promoting legitimate website content.

Further reading is available at my article Poor Search Engine Rankings Caused by Duplicate Content Issues.

SEOMoz Premium Membership


- top -


Duplicate Content Resources and Opinions


Return to Search Engine Optimization Resources Home

Duplicate Content Filter - What it is and how it works

How to Remedy Duplicate Content and Magical % Thinking by Todd Malicoat aka Stuntduble from June, 2006
Duplicate Content - Revisited - Blog post at SEOmoz.org from 11/3/2006
Microsoft Explains Duplicate Content Results Filtering by Bill Slawski, SEO by the SEA 11/4/2006
Duplicate Content Issues and Search Engines by Bill Slawski from SEObyTheSea.com from June 11th, 2006
Duplicated Content and Re-Ranking Methods Patents owned by Search Engines. A written summary of the Session at SES 2005 Conference & Expo, SJ titled: "Patents On Duplicated Content and Re-Ranking Methods"
Duplicate Content Round-up: Diagnosis and Correction with Free Tools on 11/07/2006
Google Engineer Matt Cutts explains Duplicate Content Detection by Google - video where he talks about the process of duplicate content detection
Duplicate Content & Multiple Site Issues from SES Chicago 2006 by Lisa Barone from Bruce Clay, Inc.
Deftly dealing with duplicate content post by Adam Lasnik at the official Google Webmaster Central Blog
The Illustrated Guide to Duplicate Content in the Search Engines by Rand Fishkin of SEOMoz.org


- top -


Duplicate Content Research, Patents, Shingles and Algorithms


Return to Search Engine Optimization Resources Home

New Duplicate Content and Mapping Patents from Google - 01/02/2007 by Bill Slawski, Correspondent at SEL

Mirror, Mirror on the WebMirror, Mirror on the Web - A Study of Host Pairs with Replicated Content by Krishna Bharat and Andrei Broder

Detecting query-specific duplicate documents; Google, Inc.; Patent No. 6,615,209 (2003)
Detecting duplicate and near-duplicate files; Google, Inc.; Patent No. 6,658,423 (2003)
Patent No. 6,658,423 explainedPatent No. 6,658,423 by William Pugh demonstrated
Method for determining the resemblance of documents; Digital Equipment Corp., Inc.; Patent No. 5,909,677 (1999)
Method for determining the resemining the resemblance of documents; Altavista, Inc.; Patent No. 6,230,155 (2001)
Phrase identification in an information retrieval system - Patent Application: 20060018551 by Anna Lynn 1/26/2006
System and method for optimizing search results through equivalent results collapsing Patent Application: 20060248066 by Brewer; Brett D. 11/2/2006


- top -


Duplicate Content Check/Copyright Infringement Detection


Return to Search Engine Optimization Resources Home

Duplicate Content Detection Tools for Webmasters. Not just to detect duplicates on your own site, but also stolen content from your site by scrapers and other webmasters.

Copyscape.com - Search for copies of your page on the Web.
Similar Page Checker - Determine Percentage of Similarity of 2 Wepages (URLs)
Site Wide Duplicate Content Analyzer by SEOJunkie.com
Duplicate Content Tool (Online Tool) by Virante.com

You had a duplicate content issue. It got your site banned. You found the problem and spend the necessary time to fix it, now what? Sending Reinclusion Request(s) to the Search Engines!

- top -


Cumbrowski.com Sponsors

See the Advertiser Kit to learn more about sponsorship opportunities at Cumbrowski.com. Press? Download my Media Kit.

Email Alert & Newsletter (privacy) My Blog Posts and Newsletter (read)


Enter your email address:

or ReveNews - Carsten Cumbrowski - Feed