Businesses that involve crawling the web are shot with this question at every platform possible. Our answer to this mundanity- Is viewing a web page on your browser legal? Hell Yes!
Crawling means fetching content from the web pages in an automated manner as opposed to manually opening each page in your browser. The calls made by the browser-agent to the target server that hosts the web page is similar to the way a bot hits a page to grab its content. So why is web crawling a taboo among those who only have learnt to use the term? Mostly because it’s quite often used against the website’s policies and breaks the ground rules of web crawling.
Here are some thumb rules to follow if you want a bot to behave humanly (pun intended).
- Robots.txt – Consider this a filter cum consent form that you should abide by if you intend to crawl that site. It tells you what URL’s you can/cannot crawl. This is rarely bot-specific- even Google bot can’t crawl a blocked page unless the site is worried about that page’s SEO.
- Public content– Crawl only public content keeping copyright policies in mind. If you’re web crawling a site only to reproduce the same content on a new site of yours, good luck with that!
- Authentication-based sites– Some sites need authentication before you could access their content and mostly would discourage crawling because they only want real human beings logging in.
- Crawl delay– robots.txt also lists delay to be maintained between consecutive crawls, if at all, to ensure you’re not hitting their servers too hard. If you overload them with requests, chances are that your IP’s will be blocked.
If you have followed the above points and are still seeking peace with crawling either belonging to the crawling or receiving party, let’s look at why crawling was ever possible.
- Content of a website is made public so that it reaches the public. More the public reach, better the publicity. Crawling only increases this phenomenon as long as it follows the above rules.
- Some websites host truck loads of information that’s difficult to assimilate manually and hence (like all other technology interventions), a bot has to intervene.
- Many businesses these days proliferate on the data they collect from multiple other businesses (think data analytics). Although they haven’t traded with each of those website hosts, they have built/ rented a complex technology stack by their own means in order to acquire the multifarious data. This process has always helped new businesses jump in.
Conclusion– Crawling is not an under the table activity. It’s just another way of collecting data and needs a lot of intellectual capacity to deal with.
P.S. We’re not lawyers and this post is attributed to our limited knowledge of crawling that’s an integral part of our big data solutions.