How To Protect Your Files From Robots
By Erika Lawal
Posted Friday, February 6, 2004
Optimizing website pages for the search engines without running into trouble at the very least causes most of us webmasters to keep our brain cells finely honed, and at worst induces massive migraines!
One of the most common challenges for us all is how to present "clean", relevant and original content to a wide range of visitors.
You may find that you want to exclude search engine and other robots from all or part of your website for a number of reasons including;
you want to write similar pages for different types of visitors, but don't want to be penalized for duplication.
you want to prepare pages or files that you don't want viewed.
It's very easy to achieve this by one of two means. You can use either a robot.txt file or a meta tag.
Let's de-mystify the process of writing these files and tags!
WHAT IS A ROBOTS.TXT FILE?
A robots.txt file is an instruction to the robots that travel the web, spidering the pages they find there. There are several forms such a file can take - how often to traverse the site, if at all, and how.
The robots.txt file we're considering here is an exclusion instruction - think of it as a "no entry" sign to robots.
You can write a file to exclude ("disallow") robots from all, or just part of your site.
Before you begin, you need to know how to write the .txt file.
Prepare it in a text editor such as Notepad. Don't attempt it in Word or an HTML editor such as FrontPage. When you're finished, save it as "robots.txt".
WHAT TO PUT IN YOUR ROBOTS.TXT FILE
If you want to disallow all robots, you'd write;
User-agent: *
Disallow: /
And that's all. Nothing else.
What about if you only want to exclude part of your site?
Let's pretend you're running a website which advises on raising children. Your material will be relevant to surfers who live in many countries, but if you want them to really sit up and look, especially if you want them to buy from you, you'll need to make sure that your content is region-specific, including references, idiom and spelling.
This situation is an ideal candidate for a robots exclusion .txt file.
You've written all the pages you want to show to surfers in Canada, UK, and Australia in 3 separate directories which surfers will access by clicking on an appropriate link on your main pages.
The directories are:
/ca/
/uk/
/au/
To disallow robots from these directories write the following .txt file;
User-agent: *
Disallow: /ca/
Disallow: /uk/
Disallow: /au/
It may be that you want to allow some robots and disallow others.
In our example, it may be that you want to disallow just one robot, from one directory, in which case you'd write;
User-agent: NastyBot
Disallow: /ca/
Or to exclude all robots except one, which you want to traverse all of your site;
User-agent: NiceBot
Disallow:
User-agent: *
Disallow: /ca/
Note that if you don't enter a slash, that means the robots are permitted to read the whole site. " * " means all known robots. So in the last .txt file example, all robots are excluded from your Canadian directory, except NiceBot, which can read the whole site.
Easy isn't it!
WHERE TO PUT YOUR ROBOTS.TXT FILE
Once created, your file needs to go into your root directory. This is the same directory which contains your home page. Don't put it anywhere else, because the robots won't see it.
Note that you can only have ONE robots.txt file per site, so any modifications will need to be integrated into your original file.
Note also that writing a no index robots.txt file means these pages won't be indexed, but that won't matter if you've optimized your indexed pages properly.
In our Ca/UK/Au example above, your traffic will find your indexed global/US pages via the search engines, and will make the link to their "nationality" page from the point of entry to your site - we've all seen the little flag links on other sites - just put up a flag graphic and say for example; "UK Visitors Click Here".
If you want to learn more about exclusion robots.txt files, visit:
robotstxt.org/wc/exclusion-admin.html
If you prefer/need to exclude individual pages from being viewed by robots, you can do this using a robots.txt file, but you can also achieve it using a meta tag on your web page between the <head> tags. The universal exclusion is as follows:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
It may be that you want robots to index your pages, but not to archive them. There may be a range of reasons why you don't want search engines to keep copies of old pages - the most prevalent one among webmasters is because they are cloaking pages and don't want it known that the page served to search engines is a different one to that seen by surfers, but it's also possible to have perfectly "legitimate" reasons for wanting to exclude parts of your site from public scrutiny.
Whatever your reason, if you want to avoid your page being indexed, the universal tag is:
<META NAME="ROBOTS" CONTENT="NOARCHIVE">
For Google (the search engine you are most likely to want to avoid archiving your pages for its cache feature), the tag is:
<META NAME="GOOGLEBOT" CONTENT="NOARCHIVE">
To learn more about exclusion meta tags, visit:
robotstxt.org/wc/exclusion.html#meta
Don't be put off by the jargon; writing these files and tags is one of the easiest and most useful technical tasks you can undertake as a webmaster - write a file today and save yourself hundreds of hours!
About The Author
Erika Lawal writes Daily Internet Marketing Tips for webmasters desperately in search of cutting edge site optimization and marketing advice that produces results. Get a FREE series of our Tips by visiting: DailyInternetMarketingTips.com