Search Engine Spiders, META Tags and Robots.txt Exclusion Protocol

Introduction
Search Engine Crawlers/Bots/Spiders
Search Engine Spider Simulators/Testers
HTML Meta Tags for Search Engines
Robots.txt Exclusion Protocoll
Crawl Tests

Introduction

Return to Search Engine Optimization Resources Home

If somebody uses the terms "search engine bot", "spider" and "crawler" then they all refer to the same thing; an automated tool utilized by somebody (mostly general web search engines like Google or Yahoo) to access and download the content of your web site.

Depending on the purpose of the crawler only HTML documents and/or other elements of the web pages (CSS style sheets, images, videos, documents that is being linked to etc.) are being accessed using up bandwidth and server resources plus filling up your web server logs with information about all the requests.

A crawler can act like a regular user who visits the web site, but it usually hits multiple pages at once and shortly one after another, since it is a program and not a human who requires a lot more time to process the content of your web pages.

Legitimate crawlers identify themselves via the User Agent property that is send to the web server with every request. It contains the "name" of the crawler and a link to a web page where you can learn more about the purpose of the crawler and who operates it. It also should contain instructions about how to prevent the crawler from accessing your web site, if you should decide that the purpose of the crawling does not provide benefits for you that justify the mentioned overhead on your end.

The most common way to block unwanted crawlers is via the Robots.txt exclusion protocol.

Don't block a crawler too quickly, because if you do, it might be to your own disadvantage. For example, if you would block the "GoogleBot" spider, which is operated by Google, Google would be unable to index your web pages and thus unable to include those pages in their vast search engine index. This has the consequence that your site would not appear in the search results of users who are looking for relevant information, products or services that you are able to provide.

Most web sites get a significant percentage of their web traffic from search engines, such as Google, so you should be interested that they spider your site to be able to do their job and provide you with the visitors for whom you created your web site in the first place (find out more about how to increase the traffic from general search engines to your web site). If a legitimate search engine spider hits your web servers too "hard", there are other ways to fix this issue. One way would be to use the search engines web master tools that all major search engines provide to web masters.

- top -

Search Engine Crawlers/Bots/Spiders

Return to Search Engine Optimization Resources Home

Frequently Updated Bots vs. Browsers is a database of 35,000+ (and growing) user agents. It contains verified user agents used by the search engine crawlers, but also shows user-agents that might or might not be real and verified yet, but look real (e.g., contain a known spider name or search engine URL). There are of course more fake ones than real ones, because spoofing a user agent is easy to do and could be done by almost anyone who wants to do that (for whatever reasons).
Psychedelix.com is a free user agent/crawler database which is updated constantly.
List of User Agents of Important SE Crawlers by User Agent String.Com (still being updated).
The Web Robots Database at RobotsTXT.org is a database of registered web robots. You can also download their Robots Database File to use it within your own application (see RobotsDB Database Schema)
Psychedelix.com RSS Feed; subscribe to get notified quickly and automatically about newly discovered user agents used by search engine crawlers.
Old Search Engine Bots List by Michael Horowitz (last updated: August 30, 2004).

- top -

Search Engine Spider Simulators/Testers

Return to Search Engine Optimization Resources Home

Spider Simulator displays the contents of a webpage exactly how most search engine bots would see it.
WannaBrowser.com is an online tool for testing HTML output for various user agents and specific referrers.
Summit Media Spider is a tool that checks a provided webpage (single page) and checks the pages for potential issues for search engine spiders and other issues inlcuding page titles, description, keywords, h1 tags, images and image alt tags, HTML size, total page size, retrieval time, links on the page and also for broken links (optional)
User Agent Test Track is an online tool created by Botsvsbrowsers.com which allows you to pick any user agent and access any URL using that user agent to detect cloaking based on user agent. I will not, however, detect cloaking based on IP address. You can call the tool with specific SE user agent information via links from its homepage, which lists user agents of known search engine spiders.

- top -

HTML Meta Tags for Search Engines

Return to Search Engine Optimization Resources Home

SEO Technical Tips provided by SEO veteran Bruce Clay offers technical information, tips and advice on how to adjust for server and design issues that may be negatively impacting your search engine optimization efforts.
Meta Tag Generator checks that meta tags are correctly formed.
Meta Analyzer analyzes a (competitor's) website's meta tags.
Meta Medic's slogan is "Perfect Meta Tags, Easy!" It's a tool created by Aaron Wall, author of SEO Book.
Meta Robots Tag 101: Blocking Spiders, Cached Pages & More is an article by Danny Sullivan at SearchEngineLand.com that answers questions about how, as a webmaster, to control what will be indexed by search engines and what will not. It covers what you have to tell the search engines regarding how your pages are being shown in the search results.

- top -

Robots.txt Exclusion Protocoll

Return to Search Engine Optimization Resources Home

The web robots pages at RobotsTxt.org provide information about Robots Exclusion Standard and articles about writing well-behaved web robots.
The Robots Exclusion Protocol for robot exclusion via robots.txt or robots meta tags at RobotsTxt.org.
The Robots.txt Builder Tool allows you to create a valid robots.txt file. It has nice features such as selection of bots by type you would like to exclude, such as image search, contextual ads, web archivers and web search robots as well as a list of known "bad robots" that are useless and only use up your bandwidth. Most bad bots do not obey the Robots Exclusion Standard but a surprising number do, so it might be worth excluding some.
Robots.txt Generator tool helps you generate a valid Robots.txt file; a list of common crawlers is already provided.
Robots.txt Tester is a simple web tool that allows you to test the syntax of your robots.txt file. Simply enter the URL to your robots.txt file, e.g., http://www.yoursite.com/robots.txt and the tool will check it for possible problems. It looks only for syntax and structural errors.
How to use Google's Robots.txt Tool includes a detailed FAQ section and documentation. The tool itself requires a Google account for Google Webmaster Central (formerly known as Google Sitemaps).
Meaning of the robots.txt file analysis results provided by Google's robots.txt tool is available at Google Webmaster Central.

- top -

Crawl Tests

Return to Search Engine Optimization Resources Home

Spider Map Creator creates a page that lists all your pages with links to them, the page title in the anchor and the meta description text under the page links. Next to generating a nice site map, this also shows you if pages are missing or if you might have problems like duplicate page titles or duplicate meta descriptions.
Crawl Test Tool by SEOMoz.org, will test how accessible your site is to search engines. The public version is limited, with the full version only being available to SEOMoz premium members. (Click here for more information.)
Crawl Score at CrawlScore.com is a web based service that crawls and diagnoses any crawling related issues with your website. It also generates sitemaps in all widely used formats including Google and Yahoo.