Submit Your Requirement
Scroll down to discover

Read and Respect Robots.txt File

March 3, 2017Category : Blog Web Scraping
Read and Respect Robots.txt File

Robots.txt is a file used by websites to let ‘search bots’ know if or how the site should be crawled and indexed by the search engine. Many sites simply disallow crawling, meaning the site shouldn’t be crawled by search engines or other crawler bots. When you are trying to extract data from the web, it is critical to understand what robots.txt is and how to read and respect robots.txt  to avoid legal ramifications.

Why should you Read and Respect Robots.txt File?

Respect for the robots.txt shouldn’t be attributed to the fact that the violators would get into legal complications. Just like you should be following lane discipline while driving on a highway, you should be respecting the robots.txt file of a website you are crawling. It is considered the standard behaviour on the web and is in the best interest of the web publishers. 

Many websites prefer blocking the crawlers because the content is of a sensitive nature and of little use to the public. If that’s not a good enough reason for compliance with the robots.txt rules, note that crawling a website that disallows bots can lead to a lawsuit and end up badly for the firm or the individual. Let’s now move on to how you can follow robots.txt to stay in the safe zone.

Robots.txt Rules

1. Allow Full Access

User-agent: *

Disallow:

If you find this in the robots.txt file of a website you’re trying to crawl, you’re in luck. This means all pages on the site are crawlable by bots.

2. Block All Access

User-agent: *

Disallow: /

You should steer clear from a site with this in its robots.txt. It states that no part of the site should be visited by using an automated crawler and violating this could mean legal trouble.

3. Partial Access

User-agent: *

Disallow: /folder/

User-agent: *

Disallow: /file.html

Some sites disallow crawling only particular sections or files on their site. In such cases, you should direct your bots to leave the blocked areas untouched.

4. Crawl Rate Limiting

Crawl-delay: 11

This is used to limit crawlers from hitting the site too frequently. As frequent hits by crawlers could place unwanted stress on the server and make the site slow for human visitors, many sites add this line in their robots file. In this case, the site can be crawled with a delay of 11 seconds.

5. Visit Time

Visit-time: 0400-0845      

This tells the crawlers about hours when crawling is allowed. In this example, the site can be crawled between 04:00 and 08:45 UTC. Sites do this to avoid load from bots during their peak hours.

6. Request Rate

Request-rate: 1/10

Some websites do not entertain bots trying to fetch multiple pages simultaneously. The request rate is used to limit this behaviour. 1/10 as the value means the site allows crawlers to request one page every 10 seconds.

Be a Good Bot

Good bots comply with the rules set by websites in their robots.txt file and follow best practices while crawling and scraping. It goes without saying that you should study the robots.txt file of every targeted website in order to make sure that you aren’t violating any rules.

Confused?

It’s not uncommon to feel intimidated by all the complex tech jargon and rules associated with web crawling. If you find yourself in a situation where you need to extract web data but are confused about compliance issues, we’d be happy to be your data partner and take end-to-end ownership of the process.


CTA-education

Web Scraping Service CTA
4 thoughts on “Read and Respect Robots.txt File

Leave a Reply

Your email address will not be published. Required fields are marked *

Generic selectors
Exact matches only
Search in title
Search in content
Filter by Categories
Blog
Branding
Classified
Data
eCommerce and Retail
Enterprise
Entertainment
Finance
Healthcare
Job
Marketing
Media
Real Estate
Research and Consulting
Restaurant
Travel
Web Scraping

Get The Latest Updates

© Promptcloud 2009-2020 / All rights reserved.
To top