Clicky

All you need to know about WEB CRAWLING | PromptCloud
 

All you need to know about WEB CRAWLING

All you need to know about WEB CRAWLING

Web Crawling also called Spidering, is the process of finding the web pages and downloading them. While a Web Crawler also known as Spider or a Robot, is a program which downloads web pages associated with the given URLs, extracts the hyperlinks contained in them and downloads the web pages continuously that are found by these hyperlinks. In a given period of time, a substantial fraction of the “surface web” is crawled. The web crawlers should be able to download thousands of pages per second, which in turn is distributed among hundreds of computers.

spidering

Companies like Google, Facebook, LinkedIn use web crawling to collect the data because most of the data that these companies need are in the form of web page with no API access.

Features of Crawler

  • Politeness: Keep track of the maximum number of visits to the websites.
  • Robustness: It should take care that it does not get trapped in the infinite number of pages.
  • Distributed: The downloaded pages should be distributed among hundreds of computers in fraction of seconds.
  • Scalability
  • Performance and efficiency
  • Quality: It is important to maintain the quality of the hyperlinks downloaded
  • Freshness
  • Extensibility

 

Algorithm executed by a Web Crawler

algorithm

 

Politeness Policy

A web crawler uses a small portion of the bandwidth of a website server, i.e. it extracts one page at a time. In order to implement it, request queue should be split into a single queue per web server – a server queue is open only if it has not been accessed within the specified politeness window.

For example: if a web crawler can fetch 100 pages per second, and the politeness policy dictates that it cannot fetch more than 1 page every 30 seconds from a server – we need URLs from at least 3,000 different servers to make the crawler reach its peak throughput.

Web crawlers play an important role in web search engines. In a web search engine, the web crawlers collect the pages that are to be indexed.

There are other uses also of web crawlers which is Web Data Mining.

Example of Web Data Mining:

  1. ShopWiki which is a price comparison service
  2. Attributor which is a service that mines the web for copyright violations

 

There are some websites that are quite difficult to find by the crawler. Such sites are called Deep or Hidden Web.

 

Categories of websites in Deep or Hidden Web

  • Private Sites: Sites that require login id and password. Restricted to limited people and not available for all. They are static and cannot be crawled.
  • Form Results: It is again restricted to limited people. The result is found after entering particular data. To understand it better example is train ticket, Flight ticket. The only constraint is it is difficult to find the changes that are done behind the form.
  • Scripted pages: The data is in the scripted form. It can be scripted using Javascript, Flash or any other language. The constraint, in this is it slows down the web crawling because the script is executed

 

About the Author:

Vaishnavi Agrawal loves pursuing excellence through writing and have a passion for technology. She has successfully managed and run personal technology magazines and websites. She currently writes for intellipaat.com, a global training company that provides e-learning and professional certification training.

 

Tags:

Related Posts

No Comments

Post A Comment

Ready to discuss your requirements?

REQUEST A QUOTE
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • Click here to see if your requirement is a right fit for our services.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.

Price Calculator

  • Total number of websites
  • number of records
  • including one time setup fee
  • from second month onwards
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.

  • This field is for validation purposes and should be left unchanged.