Submit Your Requirement
Scroll down to discover

All You Need to Know About Web Crawling

May 2, 2016Category : Blog Web Scraping
All You Need to Know About Web Crawling

What is Web Crawling

Web Crawling, also called Spidering, is the process of finding web pages and downloading them. While a Web Crawler, also known as Spider or a Robot, is a program that downloads web pages associated with the given URLs, extracts the hyperlinks contained in them and downloads the web pages continuously that are found by these hyperlinks. In a given period of time, a substantial fraction of the “surface web” is crawled. The web crawlers should be able to download thousands of pages per second, which in turn is distributed among hundreds of computers. Now, you may be clear of what is web crawling. The complete blog will help you understand in detail what is web crawling in the business world.

spidering

Companies like Google, Facebook, LinkedIn use web crawling to collect data because most of the data that these companies need are in the form of a web page with no API access. Data mining services help in crawling the web to a great extent.

Features of Crawler

  • Politeness: Keep track of the maximum number of visits to the websites.
  • Robustness: It should take care that it does not get trapped in the infinite number of pages.
  • Distributed: The downloaded pages should be distributed among hundreds of computers infraction of seconds.
  • Scalability
  • Performance and efficiency
  • Quality: It is important to maintain the quality of the hyperlinks downloaded
  • Freshness
  • Extensibility

The algorithm executed by a Web Crawler

algorithm

 

Politeness Policy

A web crawler uses a small portion of the bandwidth of a website server, i.e. it extracts one page at a time. In order to implement it, the request queue should be split into a single queue per web server–a server queue is open only if it has not been accessed within the specified politeness window.

For example: if a web crawler can fetch 100 pages per second, and the politeness policy dictates that it cannot fetch more than 1 page every 30 seconds from a server–we need URLs from at least 3,000 different servers to make the crawler reach its peak throughput.

Web crawlers play an important role in web search engines. In a web search engine, the web crawlers collect the pages that are to be indexed.

There are other uses also of web crawlers, which is Web Data Mining.

Example of Web Data Mining

  1. ShopWiki, which is a price comparison service
  2. Attributor, which is a service that mines the web for copyright violations

There are some websites that are quite difficult to find by the crawler. Such sites are called Deep or Hidden Web.

Categories of Websites in Deep or Hidden Web

  • Private Sites: Sites that require login id and password. Restricted to limited people and not available for all. They are static and cannot be crawled.
  • Form Results: It is again restricted to limited people. The result is found after entering particular data. To understand it better example is train ticket, Flight ticket. The only constraint is it is difficult to find the changes that are done behind the form.
  • Scripted pages: The data is in the scripted form. It can be scripted using Javascript, Flash or any other language. The constraint, in this, is it slows down the web crawling because the script is executed

About the Author

Vaishnavi Agrawal loves pursuing excellence in writing and has a passion for technology. She has successfully managed and runs personal technology magazines and websites. She currently writes for intellipaat.com, a global training company that provides e-learning and professional certification training.

Web Scraping Service CTA

Leave a Reply

Your email address will not be published. Required fields are marked *

Generic selectors
Exact matches only
Search in title
Search in content
Filter by Categories
Blog
Branding
Classified
Data
eCommerce and Retail
Enterprise
Entertainment
Finance
Healthcare
Job
Marketing
Media
Real Estate
Research and Consulting
Restaurant
Travel
Web Scraping

Get The Latest Updates

© Promptcloud 2009-2020 / All rights reserved.
To top