If you manage a website or a blog then dealing with search engines must be one of your areas of operations. You may need to control what content and URLs of your website are crawled and indexed by the search engine bots and that is where Robots.txt comes into picture.
What is Robots.txt?
Robots.txt file is the protocol in your hand to handle and direct the search engine bots like Googlebot on what and what not to crawl and index from your website. This is also called “The Robots Exclusion Protocol (REP)” or “Robot Exclusion Standard”.
All the content that is available publicly on your website can be crawled by automated bots like "Googlebot" which is google's web crawling bot. Since your content is available on publicly accessible World Wide Web, any automated crawler can crawl, access the content recursively and index the pages.
Robot.txt is of particular importance when you need to be selective about the content indexed by the search engines.
Robots.txt uses simple directives like below –
Disallow: /readme.html Disallow: /rpc_relay.html
The above directives mean that search engine bots should not crawl and index readme.html and rpc_relay.html pages on your website.
Something like below in robots.txt will direct all crawlers to not index your website.
User-agent: * Disallow: /
The rules and guidelines for writing Robots.txt have been standardized over the period of time. Major search engine bots like “googlebot” understand and follow the directives written in the robots.txt file in the same manner.
If you want each and every piece of content on your website to be available for indexing by the search engine bots, you don’t really need a Robot.txt since your content is by default publicly available to the web crawlers.
Robots.txt - SEO Impact and Best Practices
From search engine optimization perspective, you should always ensure that duplicate content is not exposed to search engines. Your site can have duplicate content due to many reasons. One such example is printer friendly pages. These pages need to stay on the website but should be spared from search engines.
Many webmasters attempt to utilize robots.txt to avoid indexing of duplicate content, however, the best practice is to never use robots.txt for handling duplicate content because duplicate content URLs can be exposed to search engines by other means such as link to the URL on another website.
The duplicate content URLs should be handled by other techniques available such as 301 redirect, canonical tag and meta tags. These techniques ensure that even if your URL is found, it will be crawled or indexed as per the usage of these tags in your webpage.
Non Value Add URLs
There might be URLs on your website that do not add any value by appearing on search engine results pages (SERPs), one such example is link to users page like "Example.com/users". These urls mostly lack content or contain content that is not SEO friendly in general and may impact website’s overall SEO performance. Although you can add directives in the robots.txt file to avoid such URLs from being crawled but the best method still remains using meta tags to avoid indexing or crawling of such pages.
Below given are examples on how meta tags can be used for this purpose -
<meta content="noindex, follow" name="robots"> (Do not index the page but pick other links from the page content) <meta content="index, follow" name="robots"> (Index the page and pick other links from the page content) <meta content="noindex, nofollow" name="robots"> (Neither index nor pick other links from the page content)
Avoid Exposing Website's Directory Structure
You do not want directories of your website to be crawled or indexed by search engines. For example below given are two of the directories on Mashable website that are spared from search engines indexing by using Robots.txt.
You can check above by visiting the link Mashable.com/robots.txt.
Blocking Links Discovered From Other Sites
Robots might discover URLs of your website from other websites, there are multiple ways by which other websites can link to your website's unexposed URLs. However, you may not really want many such URLs to get crawled and indexed by robots. The best practice again is to use Meta tags in your page and not to rely on robots.txt file.
<meta name="robots" content="noindex">
Using this tag will ensure that robots like Googlebot drop the webpage completely from indexing even if it is discovered from links on other websites.
Blocking Bad Robots
This looks workable but does never work in practice. You might want to place a directive like below to direct a BadBot to crawl your website.
User-agent: BadBot Disallow: /
But remember BadBot is a bad and would choose to ignore your robot.txt directive completely and crawl your entire website. Bad robots will ignore meta tags as well, so, meta tags too are not good way of protection here.
Robots.txt and Data security
While webmasters can utilize Robot.txt file to direct search engine bots to not crawl set of URLs, it should be noted that the URLs or paths can still be accessed on the web if someone knows the direct URL of your website.
This essentially means that Robot.txt files cannot be used as a tool to secure or hide your data on the web.
Robots.txt doesn’t ensure that all automated web crawlers will follow it. Malicious crawlers might choose not to follow the Robots.txt and crawl to scrape the entire website that is available publicly on the web.
If some data is absolute confidential and needs to be spared from the external world, you need to apply other data security techniques at code and infrastructure level.
Robots.txt Syntax and Usage Guidelines
Robot.txt file is placed in the top level directory of your domain web server. For example – "noeticforce.com/robots.txt"
You can check the link above and see how the robots.txt file for noeticforce.com looks like. This means that robots.txt file is publicly available and anyone can check the robots.txt file of your website.
For learning purpose, you can check robots.txt file of any website by typing “example.com/robots.txt”, replace example.com with the website name.
There are three main keywords used to construct the robots.txt file -
User-agent: This is used to specify the name of the robot.
Disallow: This is used to specify the URL path or directory of your website that you need to block from the specified User-agent.
Allow: This is used to allow a specific file within a subdirectory when a subdirectory itself is not allowed by usage of Disallow directive.
Robots.txt examples (snippets) –
To restrict all robots (web crawlers) access to all WebPages –
User-agent: * Disallow: /
To restrict all robots from crawling select directories on the server
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/
To exclude specific crawler completely
User-agent: corruptBot Disallow: /
Disallow Googlebot from indexing a folder, except one file in the folder
User-agent: Googlebot Disallow: /folder1/ Allow: /folder1/myfile.html
For Specifying Sitemap
While there are many tools available that can be used to test your robots.txt file, I personally prefer using Google webmaster’s robots.txt tester.
It auto fetches the robots.txt file from your website and allows you to test the links that are blocked for googlebot. Other search engines like Bing also follow the robots.txt in the same fashion and if something is blocked for googlebot, it is highly probable that it is blocked for Bing also since Bing too is established search engine and follows the standards.
The image below is taken while using robots.txt Tester of Google webmaster tools for this noeticforce.com website. The image is taken as example for explanation purpose only and does not contain directives to demonstrate best practices. This is not the current live robots.txt file of noeticforce.com.
If you look at the image above, I am checking if the URL “noeticforce.com/users” is accessible to the googlebot and the answer given by robots.txt Tester is, no, and the reason for that is the directive Disallow: /users in the robots.txt file. Tool highlights the directive or the rule in the file that guides googlebot to not crawl the specific URL. You can see in the above image the highlighted row in red color.
Good Luck with the SEO and Robots.txt generation!!