Home About Reference Tools WordPress Links


Robots.txt Reference

The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent search engine 'bots' (robots) and 'spiders' from accessing all or part of a website - depending upon the contents of a robots.txt file.

These robots are used by search engines, such as Google, Yahoo!, Bing and others to index web sites, or by webmasters to proofread source code. The standard is unrelated to, but can be used in conjunction with, sitemaps a robot inclusion standard for websites.

A sitemap and robots.txt file are essentially opposites:
  • A sitemap tells a search engine where to go to index a site
  • A robots.txt file tells a search engine bot where NOT to go!

How Robots.txt files are used
The webmaster simply creates a text file called robots.txt and puts it in his root directory. For example, the url for the robots.txt file for this site is:

http://www.RobotsGenerator.com/robots.txt

This text file should contain the instructions in a specific format - precisely the way our generator creates this file for you. Obedient robots will parse this file and read the instructions before fetching any other file from the web site. If there is no robots file, then the bot will assume that anything goes and do as it pleases.

 

Spambots
Spambots and other malevolent spiders will simply ignore this file, so this offers no protection from spamharverster bots.

A robots.txt file on a website will tell the specified robots to ignore and not index the specified files or directories when they crawl the site. Some of the reasons a website owner might do this may include:

  • a preference for privacy from search engine results,
  • some content might be misleading or irrelevant to the site as a whole,
  • the need that an application only operate on certain data,
  • not have images or media files appear in search results,
  • not have login pages parsed

 

Subdomains
If your site has subdomains, each subdomain must have its own robots.txt file. Of course, you would not need this if you would like the robots to parse the entire subdomain.

 

Advisory nature of the protocol
The robots.txt protocol is purely advisory. It works only for robots that cooperate and pay attention to instructions. Just by creating a robots.txt file and telling robots not to go there does not mean that you have any kind of privacy! Everything you put on the web is publicly available unless you use password protection - which is beyond the scope of this site.

 

Official Standard for the robots.txt protocol:
There is none!

It was created by consensus in 1994 by members of the robots mailing list (robots-request@nexor.co.uk). The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website. The robots.txt patterns are matched by simple substring comparisons, so care should be taken to make sure that patterns matching directories have the final '/' character appended, otherwise all files with names starting with that substring will match, rather than just those in the directory intended.

 

Robots.txt file examples

The wildcard "*" specifies all robots, so this file would allow all robots to visit all files.

 

This file keeps all robots out:

 

This file tells all crawlers not to enter four directories of a website:

 

Example that tells a specific crawler not to enter one specific directory:


# replace the 'BadBot' with the actual user-agent of the bot

 

Example that tells all crawlers not to enter one specific file:


(Other files in the specified directory will be indexed)

 

Comments are designated with the hash symbol "#".
# Comments appear after the "#" symbol at the start of a line, or after a directive

 

Non-Standard Instructions

Crawl-delay directive

Several major robots support the Crawl-delay parameter. This is set to the number of seconds to wait between successive requests to the server.

 

The Allow Directive

Some major crawlers support the Allow Directive which can counteract a following Disallow directive. You can use this in a situation where you want a whole directory disallowed except for one or two HTML documents. The Googlebot behaves differently when it encounters this directive, unlike most other bots, the Googlebot first evaluates all Allow patterns and only then all Disallow patterns. Bing will obey the Allow or Disallow directive which is the most specific.

The right way to to this so that it works for all robots is to place the Allow directive(s) first, followed by the Disallow as shown below:

 

Sitemap

Most major robots support a Sitemap directive, allowing multiple Sitemaps in the same robots.txt as shown below. Our generator will create this in the proper form for you.

 

NEW:

Extended Standard for Robot Exclusion
An Extended Standard for Robot Exclusion has been proposed, which adds several new directives, such as Visit-time and Request-rate. For example:

 

Other Alternatives

Meta tags can also give instructions to web crawlers.

Visit the meta tag generator to generate meta tags for your site.

      



Show +