Web Crawling also called Spidering, is the process of finding the web pages and downloading them. While a Web Crawler also known as Spider or a Robot, is a program which downloads web pages associated with the given URLs, extracts the hyperlinks contained in them and downloads the web pages continuously that are found by these hyperlinks. In a given period of time, a substantial fraction of the “surface web” is crawled. The web crawlers should be able to download thousands of pages per second, which in turn is distributed among hundreds of computers.
Companies like Google, Facebook, LinkedIn use web crawling to collect the data because most of the data that these companies need are in the form of web page with no API access.
Features of Crawler
Algorithm executed by a Web Crawler
A web crawler uses a small portion of the bandwidth of a website server, i.e. it extracts one page at a time. In order to implement it, request queue should be split into a single queue per web server – a server queue is open only if it has not been accessed within the specified politeness window.
For example: if a web crawler can fetch 100 pages per second, and the politeness policy dictates that it cannot fetch more than 1 page every 30 seconds from a server – we need URLs from at least 3,000 different servers to make the crawler reach its peak throughput.
Web crawlers play an important role in web search engines. In a web search engine, the web crawlers collect the pages that are to be indexed.
There are other uses also of web crawlers which is Web Data Mining.
Example of Web Data Mining:
There are some websites that are quite difficult to find by the crawler. Such sites are called Deep or Hidden Web.
Categories of websites in Deep or Hidden Web
About the Author:
Vaishnavi Agrawal loves pursuing excellence through writing and have a passion for technology. She has successfully managed and run personal technology magazines and websites. She currently writes for intellipaat.com, a global training company that provides e-learning and professional certification training.