As an evolving field, extracting data from the web is still a gray area – without any clear ground rules regarding the legality of web scraping. With growing concerns among companies regarding how others use their data, crawling the web is gradually becoming more and more complicated. The situation is further aggravated by the growing complexity of web page elements such as AJAX.
Here are a couple of ground rules that every web crawling solution should follow, because there exists a very thin line between being ‘crawlers’ and ‘hackers’:
Politeness – It’s easy to burden small servers and causing DDoS on target sites, which can prove detrimental to success of any company (especially small businesses). As a rule of thumb, there should at least be an interval of 2 seconds in successive requests to avoid hitting the target servers too hard.
Crawlability – A lot of websites restrict the amount of data (either sections of the site or complete sites) that they allow to be crawled by various search agents, via the robots.txt file (located at http://example.com/robots.txt). The first step in establishing the feasibility of any site’s crawlability is to check whether the site allows bots in the sections that are necessary for extracting the desired data.
Although being involved in crawling web data for our clients is full of new and exciting challenges everyday, here are a few notable ones that seem like beginning of some sort of a trend:
The web is a dynamic space with inconsistencies in data formats and structures. There are no norms to be followed while building a web presence. Due to this lack of uniformity, collecting data in a machine-readable format can be difficult. The problem gets amplified with scale when you need structured data (process a.k.a data extraction). This places challenges when a lot of details are to be extracted pertaining to a specific schema from thousands of web sources.
Although AJAX and interactive web components make websites more user-friendly, the same is not true for crawlers. Even for the Google’s crawler it’s not easy to extract information from AJAX-based pages as such content is produced dynamically by the browser and is therefore not visible to crawlers. Although at PromptCloud we have successfully conquered this problem (learn more about it here). Still there is a lot of scope for advancement and further refinements in the approach to be more efficient and scalable.
Another interesting challenge is to acquire datasets in real-time. This is especially necessary in the fields of security and intelligence in predicting and reporting early signs of incidents. We have in fact crossed the barrier of achieving a near real-time latency, but achieving data in real-time (<10 seconds=”” from=”” thousands=”” of=”” sources=”” at=”” once=”” will=”” surely=”” be=”” a=”” big=”” step=”” forward=”” p=””>
Even User-Generated Content (UGC) is claimed to be proprietary by giants like Craigslist and Yelp and is usually out of bounds for commercial crawlers.These sites police web scraping and discourage bots, and this is a discrete issue where you are bound by the legal stuff. Although currently only 2-3% of sites on the web disallow bots, while the rest encourage democratization of data. But this is a case beyond the control of crawling solution providers and it’s even possible that more and more sites follow suit, depriving you from mining gold!
Even with all these limitations, web data still presents huge opportunities if you know how to put it to the right use. Challenges keep growing but there are always some loopholes to exploit.