Robocops

By: Philip Nicosia

The Robots.txt protocol, also called the “robots exclusion standard” is designed to lock out web spiders from accessing part of a website. It is a security or privacy measure, the equivalent of hanging a “Keep Out” sign on your door.

This protocol is used by web site administrators when there are sections or files that they would rather not be accessed by the rest of the world. This could include employee lists, or files that they are circulating internally. For example, the White House website uses robots.txt to block any inquiries on speeches by the Vice President, a photo essay of the First Lady, and profiles of the 911 victims.

How does the protocol work? It lists the files that shouldn’t be scanned, and places it in the top-level directory of the website. The robots.txt protocol was created by consensus in June 1994 by members of the robots mailing list (robots-request@nexor.co.uk). There is no official standards body or RFC for the protocol, so it’s difficult to legislate or mandate that the protocol be followed. In fact, the file is treated as strictly advisory, and does not have absolute guarantee that those contents won’t be read.

In effect, robot.txt requires cooperation by the web spider and even the reader, since anything that is uploaded into the internet becomes publicly available. You aren’t locking them out of those pages, you are just making it harder for them to get in. But it takes very little for them to ignore these instructions. Computer hackers can also easily penetrate the files and retrieve information. So the rule of thumb is-if it’s that sensitive, it shouldn’t be on your website to begin with.

Care, however, should be taken to ensure that the Robots.txt protocol doesn’t block the website robots from other areas of the website. This will dramatically affect your search engine ranking, as the crawlers rely on the robots to count the keywords, review metatags, titles and crossheads, and even register the hyperlinks.

One misplaced hyphen or dash can have catastrophic effects. For example, the robots.txt patterns are matched by simple substring comparisons, so care should be taken to make sure that patterns matching directories have the final ‘/’ character appended: otherwise all files with names starting with that substring will match, rather than just those in the directory intended.

To avoid these problems, consider submitting your site to a search engine spider simulator, also called search engine robot simulator. These simulators-which can be bought or downloaded from the internet- use the same processes and strategies of different search engines and give you a “dry run” of how they will read your site. They will tell you which pages are skipped, which links are ignored, and which errors are encountered. Since the simulators will also reenact how the bots will follow your hyperlinks, you’ll see if your robot.txt protocol is interfering with the search engine’s ability to read through all the necessary pages.

It’s also important to review your robot.txt files, which will enable you to spot any problems and correct them before you submit them to real search engines.

Author Bio

XML-Sitemaps.com provides free online tools for webmasters including a search engine spider simulator and a Google sitemaps XML validator.

Article Source: http://www.ArticleGeek.com - Free Website Content

What other readers also liked:
No related posts

The World of Duplicate Content - Use of a Filter

By: Aaron Brooks

The World Wide Web is like a running race or marathon where websites compete to reach the finish line first. In this case the finish line is higher ranking. And in this race for supremacy it is important to avoid duplicate content and its penalties.

To facilitate the efficient functioning of directories search engines have been armed with content filters. This removes or filters duplicate content from pages it’s indexing. And the most hurtful penalty is lower rankings.

Unfortunately, these filters not only catch rogues but web pages that are genuine too. What webmasters need to do is understand how filters function and know what action is to be taken to avoid being filtered out.

When a search engine sends out spiders the filters leave out or sieve:

• Websites that feature identical content. And when within a site the webmaster includes many copies or versions of pages to cheat the search engines. Filters are also extremely sensitive to “doorway” pages.

• Content masked by different packaging. Known as “scraped content” this duplication of pages with little or no relevant changes falls prey to filters.

• Product descriptions featured by e-commerce sites. Most e-commerce sites publish alongside a product the manufacturer’s description of the product and this content then appears on zillions of e-commerce sites falling victim to filters.

• Articles distributed widely over the net. While some engines are programmed to find the origin of the article there are others who may not be able to source the origins.

• Pages that are not duplicates but contain the same core material written by different people.

To get the better of filters you need to:

• Use a tool like the Similar Page Checker http://www.webconfs.com/similar-page-checker.php to ensure that the pages in your site are not mirroring content from elsewhere. In case there are other URLS with similar or identical content the tool will reveal them to you and you will be able to make changes in your pages.

• Be vigilant and know who has “helped” themselves to your content. By using www.copyscape.com you can determine which websites have stolen or copied your work.

• Even if you do use distributed content you can add a commentary or make changes to the page focusing on its relevance to your site. By making any content your own you are making it unique and different and this will ensure that the pages are not filtered by search engines.

• Even if you are running an e-commerce site you must include product descriptions that are distinctively yours and not run of the mill.

Lean as much as you can about duplicate content and its dangers. Read the issues that were discussed at the SES 2006 New York Session and other forums. Remember most search engines, Google, Yahoo, or Open Directory Project do not want to be flooded by duplicate content and web pages.

Jake Baillie, President of TrueLocal listed the duplicate content mistakes to be: circular navigation; printer friendly pages; inconsistent linking; product only pages; transparent serving domains; and bad cloaking.

It is important for sites to get high ranking through fair and not foul means.

Author Bio
Aaron Brooks is a freelance writer for http://www.1888seoservices.com, the premier website to find Seo consulting, link buildings and professionals seo training, online marketing tips, seo tools and more. He also freelances for the premier REVENUE SHARING Discussion Forum Site http://www.1888discuss.com

Article Source: http://www.ArticleGeek.com - Free Website Content

What other readers also liked:
> Google Optimization in the SEO Process
> Off Page Optimization: Analyzing Competition In SEO
> Impending Changes in the SEO world
> What is Google Sandbox Effect?
> Googlewashing - Is It Really That Bad