Bookmark and Share

Creating & Using robots.txt

What is robots.txt?

The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable. In this tutorial, we use the terms “robot”, “spider” and “crawler” interchangeably. Robots are often used by search engines to categorize and archive web sites. Robots and spiders are automated programs. Before they access the pages in your website, they determine if a robots.txt file exists that restricts them from accessing certain directories or pages.

The rules you set in the robots.txt file are requests to robots and not strict enforceable directives. While reputable robots will obey robots.txt instructions, robots used by spammers and other disreputable people may not. Also, not all robots will interpret the instructions the same way. For these reasons, do not use robots.txt as a method of protecting sensitive information from web robots; use password protected directories instead.

Why Use robots.txt?

A robots.txt file should be used if you have pages or directories on your website that you don’t want search engines to crawl and index. These might include password protected directories or members-only areas, folders containing PHP or CGI scripts or databases, folders containing administrative pages or redirect pages. If you want the search engine spiders to index everything on your website, you don’t need to create a robots.txt file at all.

Robots may not index pages you restrict using robots.txt, but they may still index those pages if they are linked to from other websites. Be sure that you do not link to them from any other website if you really don’t want the pages indexed anywhere.

Where To Put robots.txt

When a robot looks for the robots.txt file for URL, it removes the path component from the URL (everything from the first single slash), and puts “/robots.txt” in its place. For example, with “http://www.yoursite.com/pictures/hawaii.html”, it will remove the “/pictures/hawaii.html”, path and automatically replace it with “/robots.txt”. The resulting URL would be “http://www.yoursite.com/robots.txt”. You need to put it in the right place on your web server so the resulting URL will work. Usually that is the same place where you put your web site’s main “index.html” welcome page. Where exactly that is, and how to put the file there, depends on your web host. Remember to use all lower case for the filename: “robots.txt”, not “Robots.TXT”.

How To Create robots.txt

The robots.txt file is a simple text file and is not at all difficult to create. All you need is Notepad, TextEdit, emacs or another free text editor. Don’t use a word processing program; the file needs to be plain text. Also, be careful of typos. If the command is incorrect, the robots will ignore it. The following guide will help you create a robots.txt file.

There are 2 main commands in a robots.txt file: User-agent and Disallow. Both are case sensitive. We will go over other commands later on.

  • User-agent: robot to which the following rule applies
  • Disallow: URL or file you want to block

To block robot1 from accessing your images folder, you would enter the following.

User-agent: robot1
Disallow: /images/

To disallow robot1 from your images folder, your cgi-bin folder a specific page in your main directory and a page in your temp directory:

User-agent: robot1
Disallow: /images/
Disallow: /cgi-bin/
Disallow: /personal_page.html
Disallow: /temp/junk.html

Now that you know how to disallow a robot from specific areas of your site, restricting multiple robots is just as easy. All you need to do is add another User-agent line and follow it by the specific files and folders you want to exclude. Let’s expand on our previous example. Note that you need a separate “Disallow” line for every URL prefix you want to exclude. You cannot use “Disallow: /cgi-bin/ /temp/” on a single line. Also, you may not have blank lines in a record, as they are used to delimit multiple records. You can have a blank line when moving to next group of exclusions.

User-agent: robot1
Disallow: /images/
Disallow: /cgi-bin/
Disallow: /personal_page.html
Disallow: /temp/junk.html

User-agent: robot2
Disallow: /cgi-bin/
Disallow: /personal_page.html
Disallow: /downloads/theme_song.mp3

So how do we apply restrictions globally? The following examples will show you how. First, we’ll start with applying restrictions to all robots. For that capability, the “*” is used. The following robots.txt will restrict all robots from accessing the specified files and folders. Remember that all files and directories are case sensitive!

User-agent: *
Disallow: /images/
Disallow: /cgi-bin/
Disallow: /personal_page.html

To block the whole site, use a forward slash. You may want to do this for specific bots,

User-agent: NastyBot
Disallow: /

or for a site you don’t want indexed by any search engines.

User-agent: *
Disallow: /

The usefulness of restricting a single robot is debatable. If the robot was annoying or malicious enough to get your attention, chances are it won’t obey the rules anyway. However, some programmers might make it follow the rules just to see if you are paying attention. Like we mentioned above, if you have sensitive information you don’t want indexed, put it in a password protected directory and don’t link to it from anywhere.

Other robots.txt Commands

Unless otherwise specified, the following commands are not supported in either the User-agent or Disallow lines. The ‘*’ in the User-agent field is a special value meaning “any robot”. Specifically, you cannot have lines like “User-agent: *bot*”, “Disallow: /tmp/*” or “Disallow: *.gif”. Some crawlers like Googlebot, Yahoo! Slurp and MSNbot recognize strings containing “*”, while Teoma (Ask.com) and other robots interpret it in different ways.

Googlebot, Yahoo! Slurp and MSNbot
The * wildcard robots.txt commands will be recognized by Google, Yahoo! and MSN robots. The first example shows how you can allow all directories that begin with “products” to be crawled, while disallowing any files or directories that contain “admin” and any files with “?usercart” in their URL (this last one is used to disallow scripted pages like shopping carts or redirects).

User-Agent: *
Allow: /products*/
Disallow: /*admin*.html
Disallow: /*?usercart

Specifically for Yahoo! Slurp, using a * at the end of a directory is redundant since the robot already behaves that way. Therefore, “Disallow: /admin*” and “Disallow: /admin” are seen as equivalent to the Slurp robot.

The pattern matching wildcard “$” can be used to further refine exclusions in your robots.txt file. The following examples show you the different ways this can be used. First, to disallow all files of a particular extension or containing a specific character, you would do this:

User-Agent: *
Disallow: /*.gif$
Disallow: /*?$

These commands will disallow all files ending in “.gif.” You can use this with any file extension you want to exclude from indexing. Without the “$”, you would be disallowing all files containing “.gif”. This could lead to unintentional blocking. The second command excludes all files ending in “?”. To exclude all files that contain “?” anywhere in the URL string, you would leave the “$” off the end. This can be useful to keep robots from indexing pages generated by CGI or PHP scripts such as shopping cart pages and redirects. All of these wildcards work with the “Allow:” and “Disallow:” directives.

Crawl-delay directive: Yahoo! Slurp and MSNbot support a crawl-delay parameter, set to the number of seconds to wait between successive requests to the same server. Google does not support this parameter but does allow you to control crawl speed via Google Webmaster Tools

User-agent: *
Crawl-delay: 10

Allow directive: Some major crawlers support an Allow directive which can counteract a following Disallow directive. This is useful when you disallow an entire directory but still want some HTML documents in that directory crawled and indexed. While by standard implementation the first matching robots.txt pattern always wins, Google’s implementation differs in that it first evaluates all Allow patterns and only then all Disallow patterns. Yet, in order to be compatible to all robots, if you want to allow single files inside an otherwise disallowed directory, you need to place the Allow directive(s) first, followed by the Disallow, for example:

User-agent: *
Allow: /folder1/myfile.html
Disallow: /folder1/

This example will Disallow anything in /folder1/ except /folder1/myfile.html, since the latter will match first. In case of Google, though, the order is not important.

Sitemap: Some robots, including Googlebot, Yahoo! Slurp, and MSNbot support a Sitemap directive, allowing single sitemap references (for smaller sites) or multiple sitemaps in the same robots.txt.

User-agent: *
Disallow: /cgi-bin/
Disallow: /temp/personal_page.html
Sitemap: http://www.yoursite.com/sitemap_index.xml
Sitemap: http://www.yoursite.com/hugeFolder/sitemap_big.xml
Sitemap: http://www.yoursite.com/otherFolder/sitemap_other.xml

Google-Specific robots.txt Commands

The following is quoted directly from Google’s webmaster help.

“To prevent pages on your site from being crawled, while still displaying AdSense ads on those pages, disallow all bots other than Mediapartners-Google. This keeps the pages from appearing in search results, but allows the Mediapartners-Google robot to analyze the pages to determine the ads to show. The Mediapartners-Google robot doesn’t share pages with the other Google user-agents. For example:”

User-agent: *
Disallow: /

User-agent: Mediapartners-Google
Allow: /

Google also allows you to exclude specific images or all of your images from being indexed in Google Images. The following examples will show each, respectively.

User-agent: Googlebot-Image
Disallow: /images/monkey.jpg

User-agent: Googlebot-Image
Disallow: /

To create your final robots.txt file, use the commands above as your needs dictate. Remember, since Google is the granddaddy of all search engines, and Yahoo and MSN are right behind them, it may be a good idea to make lines specific to each of them based on the rules defined above. If your exclusions are simple (a file, a couple directories, etc.) and don’t need wildcards, you don’t have to make separate entries for each search engine. Those exclusion rules fall within the standard robots.txt definitions and should be interpreted the same way by all reputable crawlers.

Robots META Tag

Like robots.txt, the robots META tag has become a de-facto standard. The META tag is also described in the HTML 4.01 specification, Appendix B.4.1. You can use a special HTML <META> tag to tell robots not to index the content of a page, and/or not scan it for links to follow. For example:

<html>
<head>
<title>…</title>
<META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”>
</head>

There are three important considerations when using the robots META tag:

  • Robots can ignore your META tag. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers.
  • The NOFOLLOW directive only applies to links on the page where you use the META tag. It’s entirely likely that a robot may find the same links on another page without a NOFOLLOW tag (perhaps on some other site), and still arrive on your undesired page.
  • The NOFOLLOW META tag is not the same as the rel=”nofollow” link attribute. Be careful not to confuse the two.

Like any META tag, this should be placed in the HEAD section of an HTML page, as in the example above. You should put it in every page on your site, because a robot can encounter a deep link to any page on your site. The “NAME” attribute must be “ROBOTS”. Valid values for the “CONTENT” attribute are: “INDEX”, “NOINDEX”, “FOLLOW”, “NOFOLLOW”. Multiple comma-separated values are allowed, but obviously only some combinations make sense. If there is no robots META tag, the default is “INDEX,FOLLOW”, so there’s no need to spell that out. That leaves:

<META NAME=”ROBOTS” CONTENT=”NOINDEX, FOLLOW”>

<META NAME=”ROBOTS” CONTENT=”INDEX, NOFOLLOW”>

<META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”>

Which of the above directives you use in the robots META tag are based entirely on how you want the robots to behave when they arrive at that page. They do not have to be in all caps; they can be all lowercase as well.

“nofollow” Link Attribute

GOOGLE
This link attribute provides a method of telling search engine robots not to follow a specific link. It was introduced by Google, Yahoo! and MSN and is widely followed. While these crawlers will follow the command, other search engine and malicious spiders may not. Below is an example of this attribute in use.

<a href=”admin.php” rel=”nofollow”>Administrator Login</a>

Many blogs (like Blogger, for example) automatically insert this attribute after any link posted in a comment. This prevents spammers from using your blog to increase their Google PageRank. For message boards and guestbooks, you can disable HTML in user posts. While this won’t prevent spammers from posting, it will prevent them from placing their links on your forum, guestbook or blog.

To quote (again) directly from Google Webmaster Help:

How does Google handle nofollowed links?

We don’t follow them.This means that Google does not transfer PageRank or anchor text across these links. Essentially, using nofollow causes us to drop the target links from our overall graph of the web. However, the target pages may still appear in our index if other sites link to them without using nofollow, or if the URLs are submitted to Google in a Sitemap. Also, it’s important to note that other search engines may handle nofollow in slightly different ways.

What are Google’s policies and some specific examples of nofollow usage?

Here are some cases in which you might want to consider using nofollow:

Untrusted content: If you can’t or don’t want to vouch for the content of pages you link to from your site — for example, untrusted user comments or guestbook entries — you should nofollow those links. This can discourage spammers from targeting your site, and will help keep your site from inadvertently passing PageRank to bad neighborhoods on the web. In particular, comment spammers may decide not to target a specific content management system or blog service if they can see that untrusted links in that service are nofollowed. If you want to recognize and reward trustworthy contributors, you could decide to automatically or manually remove the nofollow attribute on links posted by members or users who have consistently made high-quality contributions over time.

Paid links: A site’s ranking in Google search results is partly based on analysis of those sites that link to it. In order to prevent paid links from influencing search results and negatively impacting users, we urge webmasters use nofollow on such links. Search engine guidelines require machine-readable disclosure of paid links in the same way that consumers online and offline appreciate disclosure of paid relationships (for example, a full-page newspaper ad may be headed by the word “Advertisement”). More information on Google’s stance on paid links.

Crawl prioritization: Search engine robots can’t sign in or register as a member on your forum, so there’s no reason to invite Googlebot to follow “register here” or “sign in” links. Using nofollow on these links enables Googlebot to crawl other pages you’d prefer to see in Google’s index. However, a solid information architecture — intuitive navigation, user- and search-engine-friendly URLs, and so on — is likely to be a far more productive use of resources than focusing on crawl prioritization via nofollowed links.

 

Navigate

Hosting Info

Hosting Help

Online Business

Join Our Newsletter

Copyright © 2002- MyMultiHost.com, Hylidix LLC - #74, Reading, MA 01867-0174. All rights reserved. Disclaimer | Disclosure