Robots.txt is a text file that sits in the root directory of your website. It exists purely to communicate with search engines and it advises them which directories and files you wish them to crawl, or not to crawl.

Why do we use robots.txt?

Robots.txt is used to send instructions to search engines to disallow them from crawling pages that you don’t want them to index, such as payment confirmation pages, feedback forms, contact forms, directories that contain scripts, and adverts.

Google considers the robots.txt file as a mark of a well optimised site, so even if you do not wish to disallow spiders from accessing your pages, it is worth saving one into your root folder anyway (see example 1).

How do we write a robots.txt file?

Writing a robots.txt file is very straightforward and simply requires a text editor such as Notepad (Windows) or TextEdit (Mac). Simply type in the command(s), save as a .txt file, and upload the file to the root directory of your website using your FTP client or your web host’s cPanel.

Every command must have a ‘User-agent’ line and one or more ‘Disallow’ lines. The ‘User-agent’ line specifies which search engine robots you are sending instructions to (e.g. Googlebot, Googlebot-Mobile, Yahoobot). The Disallow line specifies which directories and files you are disallowing them from crawling.

Here are some examples of robots.txt files and what they mean.

Example 1:

User-agent: *
Disallow:

This means that all search engine robots are allowed to access all pages on your site. The asterisk (*) in line 1 indicates all search engine robots, and is most commonly used. Alternatively, you may name individual robots, as in example 4.

By leaving the ‘Disallow’ line blank, you indicate that you are happy for all the pages on your site to be crawled and indexed.

Example 2:

User-agent: *
Disallow: /

The forward slash in the ‘Disallow’ line means that all directories in the site are disallowed, and therefore that all search engine robots are disallowed from accessing all pages on your site.

Example 3:

User-agent: *
Disallow: /cgi-bin/

This robots.txt file disallows all spiders from accessing the directory containing the site’s scripts, the cgi-bin directory. The forward slash at the beginning of the ‘Disallow’ line indicates that it is a directory in the root. The forward slash after the directory name indicates that it is a directory rather than an individual file.

Example 4:

User-agent: Googlebot-Image
Disallow: /images/christmas-party.jpg

You can also disallow certain spiders (in this case, the Google image spider) from accessing certain files or directories. You can find out the name of specific search engine robots by checking the search engine’s website.

The ‘Disallow’ line names a specific file that you do not wish to be indexed. You can do this with any file on your website.

You can also double up commands in your robots.txt file, as below:

Example 5:

User-agent: *
Disallow: /cgi-bin/
Disallow: /booking-system/

User-agent: Googlebot-Image
Disallow: /

However, do not put multiple directories or files on one disallow line, use a new line for each.

Pitfalls to avoid

Robots.txt works only as a guide for search engine robots, but not all bother to heed it. To guarantee that your files won’t be touched by a robot, you should use the .htaccess file.
Don’t list your secret directories in your robots.txt file. Robots.txt is visible by everyone, simply by typing in www.yourdomain.com/robots.txt, and is widely used by spammers to find secret directories.

By Amy Fowler