1. Follow the format, location, and syntax rules
To start, the file must be named “robots.txt” and a site can have only one. The file format must be plain text encoded in UTF-8 with ASCII characters. It must be placed at the root of a domain or a subdomain, otherwise, it will be ignored by crawlers.
Also, Google enforces a size limit of 500 kibibytes (KiB) and ignores content after that limit.
There may be one or more groups of directives, one directive per line. Each group of directives includes:
- User-agent: the name of the crawler. Many user-agent names are listed in Web Robots Database.
- Disallow: a directory or page, relative to the root domain, that should not be crawled by the user-agent. It could have more than one entry.
- Allow: a directory or page, relative to the root domain, that should be crawled by the user-agent. It could have more than one entry.
Keep in mind that groups are processed from top to bottom, and a user-agent can match only one rule set, which is the first, most-specific rule that matches a given user-agent. Additionally, rules are case sensitive and typos are not supported!
You can also make comments preceded by a # and everything after it will be ignored.
The robots.txt file also supports a limited form of “wildcards” for pattern matching:
* designates 0 or more instances of any valid character.
$ designates the end of the URL.
Avoid the use of “crawl-delay” since Google and other crawlers might ignore it. Specifically, Google has its own sophisticated algorithms to determine the optimal crawl speed for a site.
For example, here is how the robots.txt for Cape Air looks like: