Evolution of Web Crawling: How crawling the web emerged as a mainstream discipline
Web crawling as a market segment has come a long way from being an emerging technology to become an integral part of many businesses – sometimes whole companies are formed based on crawling and extracting data. In effect, data translates into money when leveraged effectively, as is evident by the successes of price comparison sites, social media monitoring and reputation management companies.
Web crawlers visit web pages, acquire data, and discover new pages from the ‘seed’ pages. Though most people believe that Google was probably the first crawler to crawl the web in its entirety, web crawling as a technology has a rather long and interesting history behind it (as we will illustrate later in this post). Although the initial crawlers could only collect data, modern day web crawlers are much more robust as apart from data collection, they’re also capable of monitoring web applications for vulnerability and accessibility. The first crawlers were developed for a much smaller web (about 100,000 web pages), but today some of the popular sites alone have millions of pages. But not all crawlers are built for the entire web. Here’s how some of the web crawlers differentiate themselves:
Different Crawlers have different focus
Crawlers are built to solve specific issues, and hence some of them focus more on at least one of the following three parameters:
1. Quality – these crawlers are rather comprehensive in their coverage – having good resources, discovering new (linked) pages and closing the loop.
2. Representation – making sure that the copies of the target data are complete, such crawlers are usually site specific and crawl the deep web.
3. Freshness – In certain situations, having latest copies of data is vital. These crawlers are usually suited to areas where there’s a requirement for latest data, possibly in near real time frequencies.
First Web Crawlers
Here’s a brief history of the first generation of web crawlers.
1. RBSE spider – developed and used by the NASA funded Repository Based Software Engineering (RBSE) program in the year 1994, at the University of Houston, Clear Lake. It was built by David Eichmann of NASA using the languages Oracle, C, and wais. The primary purpose of this crawler was indexing and statistics source. At the time when this crawler was built, size of the web was just about 100,000 web pages.
2. WebCrawler – created by Brian Pinkerton of the University of Washington and launched on April 20, 1994, WebCrawler was the first search engine that was powered by a web crawler. According to Wikipedia, WebCrawler was the first web search engine to provide full text search.
3. Archive.org – Internet Archive, also known as The Wayback Machine, used Heritrix as its web crawler for archiving the entire web. Written in Java, it has a free software license accessible either via a web browser or through a command line tool. It is also worth noting that Heritrix is not the only crawler that was used in building the Internet Archive. In fact, most of the data has been donated by Alexa Internet, which crawls data for its own purposes with a crawler called ia_archiver.
Second Generation Crawlers
The second generation crawlers are either a) focused crawlers or b) large scale crawlers. Focused crawlers were site specific, personally customized and relocatable crawlers such as SPHINX and Mercator. Search engine providers such as Google, Lycos and Excite developed crawlers that were capable of global-scale crawling and indexing of data.
From in-house development to DIY Tools
Once enterprises started taking up web crawling seriously, they developed in house expertise. The development team was tasked to crawl relevant data from the web. Developers leveraged open-source technologies and built upon it to suit their crawling needs. However, this demanded dedicated teams for end-to-end monitoring of the crawl pipeline and turned out to be quite a challenge to manage these at scale. More so, the development teams were then unable to focus on their real solution that crawls were to supplement.
There was a sudden rise of Do it Yourself (DIY) tools given the addressable market size for such solutions. They served a purpose: as the web still had relatively consistent structures, it was easy to extract data using these tools. A few DIY tools such as Apache Nutch and Scrapy were the more popular ones. These development teams had now started leveraging such tools for quicker access to web data with minimal technical barriers. However, with increasing complexity of the web, limitations of these tools surfaced rather quickly. The DIY tools had a different constraint altogether- they worked on a limited scale and in case of failures, provided much lesser control on their fixes.
The Outsourcing Model
Subsequently, customized web scraping solutions emerged – a model wherein end to end crawling and extraction is taken care of by an external provider. Clean and structured data is usually delivered on an ongoing basis through a pipeline. This means that companies can focus on their core business and let someone else take care of providing them with relevant data.
Although the initial web pages had limited functionalities and a mostly uniform structure, rapid expansion of the web and added complexity of web page elements has made the task of crawling somewhat challenging. Speciality web crawling companies work either as a SaaS product or on a Data-as-a-Service model (such as ours). Both the models have typical advantages and shortcomings, and achieving the right match depends on the user’s business scenario, viz., nature of the requirements, availability of time, resources, manpower and know how.
Additionally, significant tactical challenges exist today in the world of web crawling. Some of them are:
- IP address blocking by target websites,
- risk of doing DDoS on the target sites’ servers,
- non-uniform web structures,
- AJAX elements,
- need for real time latency and
- other anti-scraping tactics and tools such as ScrapeShield (provided by popular CDN Cloudflare)
Nevertheless, solutions have consequently evolved with the challenges, and most of the above issues are smoothly handled at PromptCloud using various mechanisms.
As such complexities involved in web crawling go up, in house teams might not be the most efficient way of performing crawls. In such a situation, an outsourcing partner can be a useful ally – as they have specialized experience in the field and have overcome most of the common challenges. Thus the ROI of using an outsourced service is generally much higher than developing in house capabilities.