Microsoft KB Archive/193942

-

The information in this article applies to:


 * Microsoft FrontPage 98 for Windows

-

SUMMARY
This article describes the method used to prevent Web robots, also called Web spiders or Web crawlers (such as the FrontPage Import Web Wizard) from searching through your Web and taking files meant to be private.

This article also describes some examples on how to use this method on your server to prevent a FrontPage 98 (or any Web robot) user from bypassing your security.

Overview
Web robots are programs that cross many pages in the World Wide Web by recursively retrieving linked pages. FrontPage 98 has an Import Web Wizard the works just like a Web robot.

There have been occasions where Web robots have visited Web servers where they were not allowed.

This particular type of situation required many Web server administrators to implement a method to prevent Web robots, like the Import Web Wizard, from accessing areas where they may not be allowed or wanted.

The Method
The method used to exclude Web robots from a server is to create a file on the server, which specifies an access policy for them. For this method to be effective, the following criteria are used:


 * The file is called robots.txt.
 * The file is located in the Root Web.

This approach was chosen because it can be easily implemented on any existing Web server, and a Web robot can find the access policy with only a single document retrieval.

The Format of Robots.txt
The record starts with one or more User-agent lines, followed by one or more disallow lines. The following lines describe the structure for your "robots.txt" file.

NOTE: Unrecognized headers are ignored.


 * 1) The pound sign (#) is used for comments.

User Agent: *

Disallow: folder name The following example restricts access to the "bak" and "_private" folders located in the subWeb named "myweb." To use the example, follow these steps:


 * 1) Create a file in the Root Web called robots.txt.
 * 2) Place the following lines of text in the "robots.txt" file:


 * 1) do not access these 2 folders

User-agent: *

Disallow: /myweb/_private/ # This is my private URL space

Disallow: /bak/ # these are backup folders

If you want to restrict access to the entire web site, follow these steps:


 * 1) Create a file in the Root Web called robots.txt.
 * 2) Place the following lines of text in the "robots.txt" file:


 * 1) do not access anything on the web

User-agent: *

Disallow: /

NOTE: The presence of an empty "robots.txt" file has no explicit access restrictions. It will be treated as if it was not present and Web robots will be allowed throughout the web.