The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent search engine 'bots' (robots) and 'spiders' from accessing all or part of a website - depending upon the contents of a robots.txt file.
These robots are used by search engines, such as Google, Yahoo!, Bing and others to index web sites, or by webmasters to proofread source code. The standard is unrelated to, but can be used in conjunction with, sitemaps a robot inclusion standard for websites.
A sitemap and robots.txt file are essentially opposites:
- A sitemap tells a search engine where to go to index a site
- A robots.txt file tells a search engine bot where NOT to go!
How Robots.txt files are used
The webmaster simply creates a text file called robots.txt and puts it in his root directory. For example, the url for the robots.txt file for this site is:
This text file should contain the instructions in a specific format - precisely the way our generator creates this file for you. Obedient robots will parse this file and read the instructions before fetching any other file from the web site. If there is no robots file, then the bot will assume that anything goes and do as it pleases.
Spambots and other malevolent spiders will simply ignore this file, so this offers no protection from spamharverster bots.
A robots.txt file on a website will tell the specified robots to ignore and not index the specified files or directories when they crawl the site. Some of the reasons a website owner might do this may include:
If your site has subdomains, each subdomain must have its own robots.txt file. Of course, you would not need this if you would like the robots to parse the entire subdomain.
Advisory nature of the protocol
The robots.txt protocol is purely advisory. It works only for robots that cooperate and pay attention to instructions. Just by creating a robots.txt file and telling robots not to go there does not mean that you have any kind of privacy! Everything you put on the web is publicly available unless you use password protection - which is beyond the scope of this site.
Official Standard for the robots.txt protocol:
There is none!
It was created by consensus in 1994 by members of the robots mailing list (email@example.com). The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website. The robots.txt patterns are matched by simple substring comparisons, so care should be taken to make sure that patterns matching directories have the final '/' character appended, otherwise all files with names starting with that substring will match, rather than just those in the directory intended.
The wildcard "*" specifies all robots, so this file would allow all robots to visit all files.
This file keeps all robots out:
This file tells all crawlers not to enter four directories of a website:
Example that tells a specific crawler not to enter one specific directory:
# replace the 'BadBot' with the actual user-agent of the bot
Example that tells all crawlers not to enter one specific file:
(Other files in the specified directory will be indexed)
Comments are designated with the hash symbol "#".
# Comments appear after the "#" symbol at the start of a line, or after a directive
Several major robots support the Crawl-delay parameter. This is set to the number of seconds to wait between successive requests to the server.
Some major crawlers support the Allow Directive which can counteract a following Disallow directive. You can use this in a situation where you want a whole directory disallowed except for one or two HTML documents. The Googlebot behaves differently when it encounters this directive, unlike most other bots, the Googlebot first evaluates all Allow patterns and only then all Disallow patterns. Bing will obey the Allow or Disallow directive which is the most specific.
The right way to to this so that it works for all robots is to place the Allow directive(s) first, followed by the Disallow as shown below:
Most major robots support a Sitemap directive, allowing multiple Sitemaps in the same robots.txt as shown below. Our generator will create this in the proper form for you.
Extended Standard for Robot Exclusion
An Extended Standard for Robot Exclusion has been proposed, which adds several new directives, such as Visit-time and Request-rate. For example:
Meta tags can also give instructions to web crawlers.
Visit the meta tag generator to generate meta tags for your site.