As a website owner you can use a small file called "/robots.txt" to tell web crawlers the pages you want them to visit on your site. Or more importantly - those you don't want visited!
Web Robots, often called Web Wanderers, Crawlers, or Spiders, are software programs that travel across the Internet automatically. Search engines such as Google use them to index the content of websites and you want them to visit your site & take details of all the pages in it to put in their indexes.
Unfortunately, spammers also use them to scan sites for many uses. "Bad" crawlers can ignore your "robots.txt". Especially malware crawlers that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
The "robots.txt" file is a publicly available file. Anyone can see what sections of your server you don't want robots to use. What you want to exclude (from the crawlers view) depends on your site & server. Everything not explicitly disallowed is considered fair game to retrieve.
If you do not want your web site or specific pages/directories on your site indexed - use a "/robots.txt" file.
If you have no problem with every page on your site being indexed, then you do not really need a "/robots.txt" file.
The "/robots.txt" file is a text file, with one or more records. Usually contains a single record looking like this:
In this example, three directories are excluded - the cgi-bin, the tmp and the ~joe directories. The "User-agent: * " means this section applies to all robots. You need a separate "Disallow" line for every URL prefix you want to exclude - you cannot say "Disallow: /cgi-bin/ /tmp/" on a single line. Also, you cannot have blank lines in a record, as they are used to delimit multiple records.
Globbing and regular expression are not supported in either the User-agent or Disallow lines. The " * " in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif".
As a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site's main "index.html" welcome page.
Where exactly that is, and how to put the file there, depends on your web server software. If in doubt, ask you web hosting service provider.
You can use anything that produces a text file. On Microsoft Windows, use notepad.exe, or wordpad.exe (Save as Text Document), or even Microsoft Word (Save as Plain Text).
On the Macintosh, use TextEdit (Format->Make Plain Text, then Save as Western).
If you share a host with other people, and you have a URL like http://www.example.com/~username/ or http://www.example.com/username, then you can't have your own /robots.txt file. If you want to use /robots.txt you'll have to ask the host administrator to help you.
Listing pages or directories in the /robots.txt file may invite unintended access (from spammers etc). To resolve this you could
create a work-round but in practise this is not a wise thing to do.
The /robots.txt file is not intended for access control, so don't try to use it as such.
You should configure your server to do authentication, and configure appropriate authorization. Modern content management systems support access controls on individual pages and collections of resources.
Sometimes crawlers can be found to be ignoring "/robots.txt"
quite often because the "/robots.txt" has been written incorrectly.
However it's more likely that the robot has been explicitly written to scan your site for information to abuse.
It might be collecting email addresses to send email spam, look for forms to post links ("spamdexing"), or security holes to exploit.
We recommend always using a "/robots.txt" checker software program to check that your site's "/robots.tx"t file has been written correctly. Try this one
Webcritique.org is a Priday Design Studio service.
Priday Design Studio is based in Birmingham, UK and provides web design and site maintenance services for SME & club/associations throughout the UK. Domain names and hosting facilities are provided by associated site, PDS-Hosting Solutions, enabling us to provide you with a complete service.