What is Web Crawling
Web Crawling, also called Spidering, is the process of finding web pages and downloading them. While a Web Crawler, also known as Spider or a Robot, is a program that downloads web pages associated with the given URLs, extracts the hyperlinks contained in them and downloads the web pages continuously that are found by these hyperlinks. In a given period of time, a substantial fraction of the “surface web” is crawled. The web crawlers should be able to download thousands of pages per second, which in turn is distributed among hundreds of computers. Now, you may be clear of what is web crawling. The complete blog will help you understand in detail what is web crawling in the business world.
Companies like Google, Facebook, LinkedIn use web crawling to collect data because most of the data that these companies need are in the form of a web page with no API access. Data mining services help in crawling the web to a great extent.
Features of Crawler
- Politeness: Keep track of the maximum number of visits to the websites.
- Robustness: It should take care that it does not get trapped in the infinite number of pages.
- Distributed: The downloaded pages should be distributed among hundreds of computers infraction of seconds.
- Performance and efficiency
- Quality: It is important to maintain the quality of the hyperlinks downloaded
The algorithm executed by a Web Crawler
A web crawler uses a small portion of the bandwidth of a website server, i.e. it extracts one page at a time. In order to implement it, the request queue should be split into a single queue per web server–a server queue is open only if it has not been accessed within the specified politeness window.
For example: if a web crawler can fetch 100 pages per second, and the politeness policy dictates that it cannot fetch more than 1 page every 30 seconds from a server–we need URLs from at least 3,000 different servers to make the crawler reach its peak throughput.
Web crawlers play an important role in web search engines. In a web search engine, the web crawlers collect the pages that are to be indexed.
There are other uses also of web crawlers, which is Web Data Mining.
Example of Web Data Mining
- ShopWiki, which is a price comparison service
- Attributor, which is a service that mines the web for copyright violations
There are some websites that are quite difficult to find by the crawler. Such sites are called Deep or Hidden Web.
Categories of Websites in Deep or Hidden Web
- Private Sites: Sites that require login id and password. Restricted to limited people and not available for all. They are static and cannot be crawled.
- Form Results: It is again restricted to limited people. The result is found after entering particular data. To understand it better example is train ticket, Flight ticket. The only constraint is it is difficult to find the changes that are done behind the form.
About the Author
Vaishnavi Agrawal loves pursuing excellence in writing and has a passion for technology. She has successfully managed and runs personal technology magazines and websites. She currently writes for intellipaat.com, a global training company that provides e-learning and professional certification training.