How to Read and Respect Robots.txt | Webscraping Techniques

Contact information

PromptCloud Inc, 16192 Coastal Highway, Lewes De 19958, Delaware USA 19958

We are available 24/ 7. Call Now. marketing@promptcloud.com

Administrator

March 3, 2017
Blog, Web Scraping

Table of Contents show

Robots.txt is a file used by websites to let ‘search bots’ know if or how the site should be crawled and indexed by the search engine. Many sites simply disallow crawling, meaning the site shouldn’t be crawled by search engines or other crawler bots. When you are trying to extract data from the web, it is critical to understand what robots.txt is and how to read and respect robots.txt to avoid legal ramifications.

Why should you Read and Respect Robots.txt File?

Respect for the robots.txt shouldn’t be attributed to the fact that the violators would get into legal complications. Just like you should be following lane discipline while driving on a highway, you should be respecting the robots.txt file of a website you are crawling. It is considered the standard behavior on the web and is in the best interest of the web publishers.

Many websites prefer blocking the crawlers because the content is of a sensitive nature and of little use to the public. If that’s not a good enough reason for compliance with the robots.txt rules, note that crawling a website that disallows bots can lead to a lawsuit and end up badly for the firm or the individual. Let’s now move on to how you can follow robots.txt to stay in the safe zone.

Robots.txt Rules

1. Allow Full Access

User-agent: *
Disallow:

If you find this in the robots.txt file of a website you’re trying to crawl, you’re in luck. This means all pages on the site are crawlable by bots.

2. Block All Access

User-agent: *

Disallow: /

You should steer clear from a site with this in its robots.txt. It states that no part of the site should be visited by using an automated crawler and violating this could mean legal trouble.

3. Partial Access

User-agent: *

Disallow: /folder/

User-agent: *

Disallow: /file.html

Some sites disallow crawling only particular sections or files on their site. In such cases, you should direct your bots to leave the blocked areas untouched.

4. Crawl Rate Limiting

Crawl-delay: 11

This is used to limit crawlers from hitting the site too frequently. As frequent hits by crawlers could place unwanted stress on the server and make the site slow for human visitors, many sites add this line in their robots file. In this case, the site can be crawled with a delay of 11 seconds.

5. Visit Time

Visit-time: 0400-0845

This tells the crawlers about hours when crawling is allowed. In this example, the site can be crawled between 04:00 and 08:45 UTC. Sites do this to avoid load from bots during their peak hours.

6. Request Rate

Request-rate: 1/10

Some websites do not entertain bots trying to fetch multiple pages simultaneously. The request rate is used to limit this behavior. 1/10 as the value means the site allows crawlers to request one page every 10 seconds.

Be a Good Bot

Good bots comply with the rules set by websites in their robots.txt file and follow best practices while crawling and scraping. It goes without saying that you should study the robots.txt file of every targeted website in order to make sure that you aren’t violating any rules.

Confused?

It’s not uncommon to feel intimidated by all the complex tech jargon and rules associated with web crawling. If you find yourself in a situation where you need to extract web data but are confused about compliance issues, we’d be happy to be your data partner and take end-to-end ownership of the process.