Robot Exclusion Standard
The robot exclusion standard is implimented through a simple text file called robots.txt. Robots.txt is your chance to ask search engines to treat certain areas of your website as per your directions.
Notice, I used the word ask. Obeying the robot exclusion standard is something that search engines do out of courtesy. There is no way to require all search engines to obey the information you put in your robots.txt file. That said, most reputable search engine spiders do obey the robot exclusion standard and as such, every website should have the robots.txt file in the root directory (same place as the index.html) of their website.
The robot exclusion standard also includes the use of a meta tag. We'll discuss that more in a bit.
The Robots Exclusion Standard - robots.txt Text File
The robot exclusion standard is implimented through the use of a text file.
The name of the file must be robots.txt and it should be located in the root directory of your website - where your index.html file resides.
The following section explains what should be in your robots.txt file.
Robot Exclusion Standard - Allowing Access
Allowing All Robots Full Access
Let's say we want to allow all robots to visit all files on our website. Our robots.txt file would only contain the following two lines:
Giving No Access to All Robots
To keep all robots out of all areas of a website, the robots.txt file should contain only the following two lines:
If you do this, you will not be spidered by Google, MSN, or Yahoo! and will therefore not appear on those or any other search engine results. The only time I have used this is when I have a prototype of a website online and don't want the unfinished pages indexed.
Preventing a Directory From Being Spidered
I don't know about you, but I don't want my pictures, scripts or other private areas indexed by search engines. Let's say we don't want our images or cgi-bin directories indexed; the robots.txt file should contain:
By disallowing these two directories you are telling the spiders that they can freely access and index any other directories.
Banning a Specific Crawler
Perhaps you don't want a specific crawler to access your site. The robots.txt file should be ordered by the generic to the specific. To ban Googlebot from our images directory, the robots.txt file would appear as follows:
User-agent: * # First tell everyone what to do
Banning one or more of the search engine spiders from your images directory is a good idea. Several of the search engines allow for searches of pictures from websites. The way the pictures are categorized and displayed can convey the message that these images are freely available for use. Many do not understand that these images are copyrighted and will unknowingly use them on their own websites.
This example also introduced using a comment. The # (pound sign) is used to identify a comment. It is a good idea if you start including a number of exclusions of either directories or specific bots to comment the robots.txt file so you can remember what you had intended your robot exclusion standard to do.
Protecting Files by Type with Robots.txt
Using a robots.txt disallow to protect filetypes is only supported by Google and MSN. The disallow command by filetype uses a wildcard or placeholder for the file name.
I realize that the command does not specifically mention either Google or MSN. By not specifically stating which engines should follow this rule, you will allow any search engine that recognizes wildcards to disallow indexing files of the designated file type.
Controlling Robots via a Robots Meta Tag
Along with the robots.txt text file, webmasters can ask robots and crawlers to treat individual pages of a website differently from each other.
There are only four options when using the robots meta tag.
Dear Robot, Please Index This Page
To ask a visiting robot to index and follow all of the links on the current webpage:
<meta name="robots" content="index,follow" />
To ask a visiting robot to index a page, but not follow all of the links:
<meta name="robots" content="index,nofollow" />
Dear Robot, Please Don't Index This Page
To ask a visiting robot to not index a page, but follow the links on that page:
<meta name="robots" content="noindex,follow" />
To ask a visiting robot to essentually ignore a page by not indexing it or following any of the links:
<meta name="robots" content="noindex,nofollow" />
A Note on Revisit-After Meta Tag
The revisit-after meta tag is intended to request a search engine to revisit a page after a defined number of days. This meta tag is not widely supported by search engines. A search engine from Canada has been given credit for having invented this tag and some seo experts emphatically state that it is not supported by any of the other search engines. I mention it here for completeness only.
Unless you have bandwidth issues, I don't understand why you would want to tell a visiting search engine, "Go away, please don't index this webpage in your search results unless some arbitrary number of days has passed since your last visit."
Don't miss Part III on HTML Meta Tags!
Keyword Meta - learn the HTML code and some tips on using this often misused and misunderstood meta tag.
Did you miss Part I on HTML Meta Tags?
HTML Meta Tags - Learn about the HTML Meta Tags Content Type, Description and Author
Additional Information on Meta Tags
Several visitors to Help For Web Beginners have submitted follow-up questions on this article.
Be sure to visit Meta Tag FAQs to see their questions and our answers.
HelpForWebBeginners.com is a website dedicated to providing free, easy to understand, online How To's for true web beginners. While the materials are not free or available for reprints, they are offered freely for individual use. Please use the contact page to let Michele know if this tutorial has been helpful or if there are any other beginner web programming or MySpace related tutorials you would like to see.