The web is known as an open place – but that would be just an exaggeration if you take a closer look. The web that we know is actually just the tip of a huge iceberg. Search engine crawlers have access only to the ‘Surface web’ which is a name for the smaller percentage of web where crawlers can go. The rest of the websites disallow crawling by stating it in their robots.txt file. Robots.txt is a file used by websites to let ‘bots’ know if or how the site should be crawled and indexed. Many sites simply disallow crawling, meaning the site shouldn’t be crawled by search engines or other crawler bots. When you are trying to extract data from the web, it is critical to understand what robots.txt is and how to respect it to avoid legal ramifications.
Respect for the robots.txt shouldn’t be attributed to the fact that the violators would get into legal complications. Just like you should be following lane discipline while driving on a highway, you should be respecting the robots.txt file of a website you are crawling. It is considered the standard behavior on the web and is in the best interest of web publishers. Many websites prefer blocking the crawlers because the content is of sensitive nature and of little use to the public. If that’s not good enough reason for compliance with the robots.txt rules, note that crawling a website that disallows bots can lead to a lawsuit and end up badly for the firm or the individual. Let’s now move on to how you can follow robots.txt to stay in the safe zone.
1. Allow full access
If you find this in the robots.txt file of a website you’re trying to crawl, you’re in luck. This means all pages on the site are crawlable by bots.
2. Block all access
You should steer clear from a site with this in its robots.txt. It states that no part of the site should be visited by using an automated crawler and violating this could mean legal trouble.
3. Partial access
Some sites disallow crawling only particular sections or files on their site. In such cases, you should direct your bots to leave the blocked areas untouched.
4. Crawl rate limiting
This is used to limit crawlers from hitting the site too frequently. As frequent hits by crawlers could place unwanted stress on the server and make the site slow for human visitors, many sites add this line in their robots file. In this case, the site can be crawled with a delay of 11 seconds.
5. Visit time
This tells the crawlers about hours when crawling is allowed. In this example, the site can be crawled between 04:00 and 08:45 UTC. Sites do this to avoid load from bots during their peak hours.
6. Request rate
Some websites do not entertain bots trying to fetch multiple pages simultaneously. Request rate is used to limit this behavior. 1/10 as the value means the site allows crawlers to request one page every 10 seconds.
Good bots comply to the rules set by websites in their robots.txt file and follow best practices while crawling and scraping. It goes without saying that you should study the robots.txt file of every target website in order make sure that you aren’t violating any rules.
It’s not uncommon to feel intimidated by all the complex tech jargon and rules associated with web crawling. If you find yourself in a situation where you need to extract web data but are confused about compliance issues, we’d be happy to be your data partner and take end-to-end ownership of the process.