Robots.txt is a file used by websites to let ‘search bots’ know if or how the site should be crawled and indexed by the search engine. Many sites simply disallow crawling, meaning the site shouldn’t be crawled by search engines or other crawler bots. When you are trying to extract data from the web, it is critical to understand what robots.txt is and how to read and respect robots.txt to avoid legal ramifications.
Why should you Read and Respect Robots.txt File?
Respect for the robots.txt shouldn’t be attributed to the fact that the violators would get into legal complications. Just like you should be following lane discipline while driving on a highway, you should be respecting the robots.txt file of a website you are crawling. It is considered the standard behavior on the web and is in the best interest of the web publishers.
Many websites prefer blocking the crawlers because the content is of a sensitive nature and of little use to the public. If that’s not a good enough reason for compliance with the robots.txt rules, note that crawling a website that disallows bots can lead to a lawsuit and end up badly for the firm or the individual. Let’s now move on to how you can follow robots.txt to stay in the safe zone.
Robots.txt Rules
1. Allow Full Access
User-agent: *
Disallow:
If you find this in the robots.txt file of a website you’re trying to crawl, you’re in luck. This means all pages on the site are crawlable by bots.
2. Block All Access
User-agent: *
Disallow: /
You should steer clear from a site with this in its robots.txt. It states that no part of the site should be visited by using an automated crawler and violating this could mean legal trouble.
3. Partial Access
User-agent: *
Disallow: /folder/
User-agent: *
Disallow: /file.html
Some sites disallow crawling only particular sections or files on their site. In such cases, you should direct your bots to leave the blocked areas untouched.
4. Crawl Rate Limiting
Crawl-delay: 11
This is used to limit crawlers from hitting the site too frequently. As frequent hits by crawlers could place unwanted stress on the server and make the site slow for human visitors, many sites add this line in their robots file. In this case, the site can be crawled with a delay of 11 seconds.
5. Visit Time
Visit-time: 0400-0845
This tells the crawlers about hours when crawling is allowed. In this example, the site can be crawled between 04:00 and 08:45 UTC. Sites do this to avoid load from bots during their peak hours.
6. Request Rate
Request-rate: 1/10
Some websites do not entertain bots trying to fetch multiple pages simultaneously. The request rate is used to limit this behavior. 1/10 as the value means the site allows crawlers to request one page every 10 seconds.
Be a Good Bot
Good bots comply with the rules set by websites in their robots.txt file and follow best practices while crawling and scraping. It goes without saying that you should study the robots.txt file of every targeted website in order to make sure that you aren’t violating any rules.
Confused?
It’s not uncommon to feel intimidated by all the complex tech jargon and rules associated with web crawling. If you find yourself in a situation where you need to extract web data but are confused about compliance issues, we’d be happy to be your data partner and take end-to-end ownership of the process.