All you need to know about WEB CRAWLING
Web Crawling also called Spidering, is the process of finding the web pages and downloading them. While a Web Crawler also known as Spider or a Robot, is a program which downloads web pages associated with the given URLs, extracts the hyperlinks contained in them and downloads the web pages continuously that are found by these hyperlinks. In a given period of time, a substantial fraction of the “surface web” is crawled. The web crawlers should be able to download thousands of pages per second, which in turn is distributed among hundreds of computers.
Companies like Google, Facebook, LinkedIn use web crawling to collect the data because most of the data that these companies need are in the form of web page with no API access.
Features of Crawler
- Politeness: Keep track of the maximum number of visits to the websites.
- Robustness: It should take care that it does not get trapped in the infinite number of pages.
- Distributed: The downloaded pages should be distributed among hundreds of computers in fraction of seconds.
- Performance and efficiency
- Quality: It is important to maintain the quality of the hyperlinks downloaded
Algorithm executed by a Web Crawler
A web crawler uses a small portion of the bandwidth of a website server, i.e. it extracts one page at a time. In order to implement it, request queue should be split into a single queue per web server – a server queue is open only if it has not been accessed within the specified politeness window.
For example: if a web crawler can fetch 100 pages per second, and the politeness policy dictates that it cannot fetch more than 1 page every 30 seconds from a server – we need URLs from at least 3,000 different servers to make the crawler reach its peak throughput.
Web crawlers play an important role in web search engines. In a web search engine, the web crawlers collect the pages that are to be indexed.
There are other uses also of web crawlers which is Web Data Mining.
Example of Web Data Mining:
- ShopWiki which is a price comparison service
- Attributor which is a service that mines the web for copyright violations
There are some websites that are quite difficult to find by the crawler. Such sites are called Deep or Hidden Web.
Categories of websites in Deep or Hidden Web
- Private Sites: Sites that require login id and password. Restricted to limited people and not available for all. They are static and cannot be crawled.
- Form Results: It is again restricted to limited people. The result is found after entering particular data. To understand it better example is train ticket, Flight ticket. The only constraint is it is difficult to find the changes that are done behind the form.
About the Author:
Vaishnavi Agrawal loves pursuing excellence through writing and have a passion for technology. She has successfully managed and run personal technology magazines and websites. She currently writes for intellipaat.com, a global training company that provides e-learning and professional certification training.