Crawling public data from around the web is an integral part of most organizations, small or large, these days. With “web crawling” gaining popularity across industries, concerns around legal ramifications of it have grown too. Although these rules differ across geographies (only if under litigation), gladly there has been a general consensus around scraping rules on the borderless internet.
Most of the websites have an instruction document for bots, that enumerates a set of rules for automated access of their sites. It’s always hosted at the root location so you can find it at domain/robots.txt (example- https://amazon.com/robots.txt). Consider this a legal document that your bot needs to abide by if you plan to crawl that particular site. This has to be the foremost step before you decide to ethically crawl a site.
- User agents– Instructions might be grouped by user agents which are nothing but the name of the bots like Googlebot, Yahoobot, MSNbot or *, the latter meaning everyone else. So if your user agent is not listed on a robots.txt file, you should be conforming to the instructions for all other user agents.
- Allow vs. Disallow– These are the key elements of this file. If the specific URL pattern of your interest is disallowed, you are expected to not crawl it, and vice-versa for allow. Here’s an example.
- Crawl delays– Robots.txt files also at times instruct what kind of delay is to be maintained between fetches. The numbers listed are typically in seconds. Conforming to these delays ensures you are following the target source’s politeness policies so as to not overload them with requests. This for example-
- Sitemaps– Most of the XML sitemaps that are auto-generated are listed under this robots.txt file. They aid in helping the relevant bots index pages within the sitemap. And of course, your user agent can read it too if allowed.
- Wildcards and other characters– Most of the statements within this file use the wildcard character “*” and also “/”. So if a file shows below, it means all user agents are allowed on all pages.
User Agent: *
- Comments– Make sure the instruction statements are not commented. If any of it is, it’s probably not valid yet.
# User Agent:*
Since inherently all websites would want to get as visible as possible, you’d seldom find them blocking bots via their robots.txt files. At PromptCloud, we’ve found only ~2% of the sites on the web disallowing access to bots. However, there are certain actions that are intended to be performed only by humans (like login, add to cart, etc.) which are more often than not blocked for bot access. So for all those still concerned if crawling is legal, time to check out the robots file because it’s the authority telling you what you CAN CRAWL and what is better left for humans.
P.S. We’d be rolling out a parser for robots.txt soon that’ll help you conclude feasibility of a site. Keep an eye 🙂