Robots.txt Tutorial
Search engines use spider programs, also known as robots or crawlers, to create the indexes for their search databases.
These crawlers follow links to find new and updated pages for the search
engine. Before a website is indexed by a search engine's spider, the
special file named "robots.txt" is first retrieved from the website's
document root. So for example, if a search engine's robot is about to
index http://example.com, it will first fetch http://example.com/robots.txt.
The format is defined by the Robot
Exclusion Standard, and RoboGen
is an editor for these files.
Simple Example
User-agent: gooblebot
Disallow: /images/
Disallow: /projects/
Disallow: contact.html
# This is a comment
User-agent: *
Disallow: /support/
Disallow: contact.html
LDIF Tag Format and Comments
All tags are specified in LDIF format, which means that a tag is specified by
a name followed by a colon (:) followed by the value. Only one tag can
appear per line. Lines beginning with a pound sign (#) are comments and
are ignored.
User-agent
The User-agent tag starts a rule section for a particular spider
program. The special * user-agent applies to all spiders, except the
spiders for which specific sections exist.
Almost all programs, including Internet Explorer and Mozilla, which access
web pages have user-agent names, which can often be seen in the web access
log. It would have no effect to define sections in a robot exclusion file
for web browsers and other user-agents that are not automated spider programs.
Disallow
A user-agent section contains one or more disallow lines. Each line
specifies a file or directory which is not to be indexed by the specified
crawler program.
For example, to block the contact.html page from being included in an index
use the following:
Disallow: contact.html
To block the contents of the images directory from being indexed use the
following:
Disallow: /images/
According to the robot exclusion standard, disallow rules are treated as path
prefixes. This means that disallowing /images would block any path which
starts with /images, which means that both /images.html and /images/something.jpg
would not be listed in any index.
Not a Security Mechanism
It is very important to remember that robots.txt files provide absolutely no
security for a web site. Spiders operated by some groups, such as
spammers, will simply ignore the contents of the file. Perhaps worse is
that spammers and hackers will look in the robots.txt file for addresses that
might have otherwise remained hidden. If you have protected content, you
must use actual security mechanisms such as, but not limited to, password
protected directories. Such measures are outside of the scope of the robot
exclusion standard and this tutorial. Consider yourself reminded!
|