Crawling public data from around the web is an integral part of most organizations, small or large, these days. With “web crawling” gaining popularity across industries, concerns around legal ramifications of it have grown too. Although these rules differ across geographies (only if under litigation), gladly there has been a general consensus around scraping rules on the borderless internet.
Most of the websites have an instruction document for bots, that enumerates a set of rules for automated access of their sites. It’s always hosted at the root location so you can find it at domain/robots.txt (example- http://amazon.com/robots.txt). Consider this a legal document that your bot needs to abide by if you plan to crawl that particular site. This has to be the foremost step before you decide to ethically crawl a site.
User Agent: *
# User Agent:*
Since inherently all websites would want to get as visible as possible, you’d seldom find them blocking bots via their robots.txt files. At PromptCloud, we’ve found only ~2% of the sites on the web disallowing access to bots. However, there are certain actions that are intended to be performed only by humans (like login, add to cart, etc.) which are more often than not blocked for bot access. So for all those still concerned if crawling is legal, time to check out the robots file because it’s the authority telling you what you CAN CRAWL and what is better left for humans.
P.S. We’d be rolling out a parser for robots.txt soon that’ll help you conclude feasibility of a site. Keep an eye 🙂