Submit Your Requirement

Download Web Data Acquisition Framework

Did you know that there are 12 factors to be considered while acquiring data from the web? If no, fret not! Download our free guide on web data acquisition to get started!

Scroll down to discover

How to Read a Robots.txt File

September 13, 2015Category : Blog
How to Read a Robots.txt File


Crawling public data from around the web is an integral part of most organizations, small or large, these days. With “web crawling” gaining popularity across industries, concerns around legal ramifications of it have grown too. Although these rules differ across geographies (only if under litigation), gladly there has been a general consensus around scraping rules on the borderless internet.

Most of the websites have an instruction document for bots, that enumerates a set of rules for automated access of their sites. It’s always hosted at the root location so you can find it at domain/robots.txt (example- Consider this a legal document that your bot needs to abide by if you plan to crawl that particular site. This has to be the foremost step before you decide to ethically crawl a site.

  • User agents– Instructions might be grouped by user agents which are nothing but the name of the bots like Googlebot, Yahoobot, MSNbot or *, the latter meaning everyone else. So if your user agent is not listed on a robots.txt file, you should be conforming to the instructions for all other user agents.
  • Allow vs. Disallow– These are the key elements of this file. If the specific URL pattern of your interest is disallowed, you are expected to not crawl it, and vice-versa for allow. Here’s an example.

User-agent: Googlebot
Disallow: /rss/people/*/reviews
Disallow: /gp/pdp/rss/*/reviews
Disallow: /gp/cdp/member-reviews/
Disallow: /gp/aw/cr/

Allow: /wishlist/universal*

  • Crawl delays– Robots.txt files also at times instruct what kind of delay is to be maintained between fetches. The numbers listed are typically in seconds. Conforming to these delays ensures you are following the target source’s politeness policies so as to not overload them with requests. This for example-

User-Agent: slurp
Crawl-delay: .25

  • Sitemaps– Most of the XML sitemaps that are auto-generated are listed under this robots.txt file. They aid in helping the relevant bots index pages within the sitemap. And of course, your user agent can read it too if allowed.
  • Wildcards and other characters– Most of the statements within this file use the wildcard character “*” and also “/”. So if a file shows below, it means all user agents are allowed on all pages.

User Agent: *

Allow: /

  • Comments– Make sure the instruction statements are not commented. If any of it is, it’s probably not valid yet.

# User Agent:*

# Disallow:/

Since inherently all websites would want to get as visible as possible, you’d seldom find them blocking bots via their robots.txt files. At PromptCloud, we’ve found only ~2% of the sites on the web disallowing access to bots. However, there are certain actions that are intended to be performed only by humans (like login, add to cart, etc.) which are more often than not blocked for bot access. So for all those still concerned if crawling is legal, time to check out the robots file because it’s the authority telling you what you CAN CRAWL and what is better left for humans.

P.S. We’d be rolling out a parser for robots.txt soon that’ll help you conclude feasibility of a site. Keep an eye 🙂

Web Scraping Service CTA
1 thought on “How to Read a Robots.txt File
  • Avatar for Ava Sotos
    Ava Sotos

    Nice read, I just passed this onto a colleague who was doing some research on that. And he just bought me lunch because I found it for him smile Thus let me rephrase that: Thanks for lunch!

Leave a Reply

Your email address will not be published. Required fields are marked *

Generic selectors
Exact matches only
Search in title
Search in content
Filter by Categories
eCommerce and Retail
Real Estate
Research and Consulting
Web Scraping

Get The Latest Updates

© Promptcloud 2009-2020 / All rights reserved.
To top