The META Robots Tag

As a website owner you can use a small file called "/robots.txt" to tell web crawlers the pages you want them to visit on your site. Or more importantly - those you don't want visited!

About Web Robots

Web Robots, often called Web Wanderers, Crawlers, or Spiders, are software programs that travel across the Internet automatically. Search engines such as Google use them to index the content of websites and you want them to visit your site & take details of all the pages in it to put in their indexes.

Unfortunately, spammers also use them to scan sites for many uses. "Bad" crawlers can ignore your "robots.txt". Especially malware crawlers that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.

The "robots.txt" file is a publicly available file. Anyone can see what sections of your server you don't want robots to use. What you want to exclude (from the crawlers view) depends on your site & server. Everything not explicitly disallowed is considered fair game to retrieve.

If you do not want your web site or specific pages/directories on your site indexed - use a "/robots.txt" file.

If you have no problem with every page on your site being indexed, then you do not really need a "/robots.txt" file.

What to put in the "robots.txt" tag

The "/robots.txt" file is a text file, with one or more records. Usually contains a single record looking like this:

  • User-agent: *
  • Disallow: /cgi-bin/
  • Disallow: /tmp/
  • Disallow: /~joe/

In this example, three directories are excluded - the cgi-bin, the tmp and the ~joe directories. The "User-agent: * " means this section applies to all robots. You need a separate "Disallow" line for every URL prefix you want to exclude - you cannot say "Disallow: /cgi-bin/ /tmp/" on a single line. Also, you cannot have blank lines in a record, as they are used to delimit multiple records.

What NOT to put in the "robots.txt" tag

Globbing and regular expression are not supported in either the User-agent or Disallow lines. The " * " in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif".

Where to put the "robots.txt" file

As a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site's main "index.html" welcome page.

Where exactly that is, and how to put the file there, depends on your web server software. If in doubt, ask you web hosting service provider.

How to create the "robots.txt" file

You can use anything that produces a text file. On Microsoft Windows, use notepad.exe, or wordpad.exe (Save as Text Document), or even Microsoft Word (Save as Plain Text).

On the Macintosh, use TextEdit (Format->Make Plain Text, then Save as Western).

Important things to note about the "robots.txt" file

If you share a host with other people, and you have a URL like http://www.example.com/~username/ or http://www.example.com/username, then you can't have your own /robots.txt file. If you want to use /robots.txt you'll have to ask the host administrator to help you.

Listing pages or directories in the /robots.txt file may invite unintended access (from spammers etc). To resolve this you could create a work-round but in practise this is not a wise thing to do.

The /robots.txt file is not intended for access control, so don't try to use it as such. You should configure your server to do authentication, and configure appropriate authorization. Modern content management systems support access controls on individual pages and collections of resources.

Sometimes crawlers can be found to be ignoring "/robots.txt"  quite often because the "/robots.txt" has been written incorrectly. However it's more likely that the robot has been explicitly written to scan your site for information to abuse. It might be collecting email addresses to send email spam, look for forms to post links ("spamdexing"), or security holes to exploit.

Use a checker program to validate your "robots.txt" file

We recommend always using a "/robots.txt" checker software program to check that your site's "/robots.tx"t file has been written correctly. Try this one

Webcritique.org is a Priday Design Studio service.
Priday Design Studio is based in Birmingham, UK and provides web design and site maintenance services for SME & club/associations throughout the UK. Domain names and hosting facilities are provided by associated site, PDS-Hosting Solutions, enabling us to provide you with a complete service.