Different Components of a Crawlable Search Engine
Extracting a piece of information from the millions of web pages would be impossible if there were no search engine. Just take a look at the following figure to know how search engine has been changing the usability of the web, dramatically.
|YEAR||SEARCH ENGINE||INDEXED WEB PAGES||AVG. NO. OF QUERIES/DAY|
|1997||Example: ALTAVISTA||2 TO 100 MILLION||20 MILLION|
|1998 TO 2015||Example: GOOGLE||30 TRILLION||3.5 BILLION|
What’s a search engine?
A search engine is a system in which inquired data is parsed from the web through a set of rules and criteria. Technically, the searched data is applied algorithmically to a sea of gigantic data sets to retrieve search results that best fit the original search criteria and report back to the user on SERP (Search Engine Result Page).
Search engines can be classified into four groups:
- a) Crawler-based search engine,
- b) Human-powered directories,
- c) Hybrid search engines and
- d) Meta search engines.
Today, we will discuss the various components of a Crawler-based search engine and how it works.
Components of a crawler-based search engine
Physical architectural component:
URL server: A URL server sends the list of URL to the crawler whose information has to be fetched.
Crawler: This is a specially designed software program or robots or bots, used for building a list of words from the millions of web pages found on the internet. A web crawler automatically traverses through the entire web and downloads web pages (web scraping), further following links from page to page.
Store server: This is the server where downloaded documents of crawlers are stored.
Barrel: A barrel stores the documents processed by indexers with minute details (hit list) about the occurrence of a word, its position in the document, font and capitalizations etc.
Sorter: The function of the sorter is to take the barrel which is sorted by the doc ID and rearrange them with the word ID to generate an inverted index.
Anchor file: This file holds information on link’s source and destination and the texts of the link.
Major data structural components:
Big files: Big files are virtual files spanning multiple file system and are addressable by a 64-bit integer.
Repository: The repository contains the full HTML value of every page in a compressed format. In a repository, the documents are stored one after another and are prefixed by doc ID, length and URL respectively.
Document index: A document index is a simple index sorted by doc ID. This is designed to serve a reasonably compact data structure and the ability to fetch the record from one disk during the search. In addition, there is a file which is known as URL-resolver, which converts relative URLs into absolute URLs for making the link search faster. It also helps to create a forward index and anchor file.
Lexicon: The search engine lexicon consumes tiny memory and can be fitted in a machine of 256 MB of physical memory. It works in two parts, a) list of words (concatenated but separated by nulls) and b) hash table and pointer.
Hit list: A Hit list holds the information on a particular word in a particular document with precise information like the barrel about its position, fonts and capitalisation. This is manipulated by forward and inverted indices.
Forward index: The forward index stores words for each document. However, it is partially sorted. The forward index is stored in a number of barrels and each barrel holds a number of word ID’s. It also holds the anchor text of a corresponding doc ID. If a word falls into a barrel, that barrel records the doc ID.
Inverted index: Inverted index uses the same barrel as that of the forward index. In an inverted index, documents are rearranged according to word ID rather than doc ID and this rearranging is done by sorter service.
How these components work together?
The whole process is described according to the paper submitted by Sergey Brin and Lawrence Page in Stanford University. At the beginning, a search engine has no idea where on the Web the required documents are situated. Therefore, a search engine should always be informed before any search takes place. To solve the issue search engine maintains some sequential procedure.
The URL server handovers some URL to the Crawler to start its journey. The crawler then visits those URLs and downloads everything it gets and if it finds any link there, it follows that link. This operation is called Web Crawling. The crawled documents are then stored in a store server. The store server compresses those documents and stores them in a repository. Consequently, the indexer scrutinizes the repository and decompresses the documents and finally converts them into hits. The indexer takes the hits and creates the forward index and distributes them to a set of barrels. It also extracts all links with their corresponding information and stores them in anchor file.
With the advent of web 2.0 the web surfing and web interaction scenario has dramatically changed along with that of the search engines. Admittedly, this change is a continuous process and with time, like everything, both the search engines and the web surfers will evolve to an even higher level of information exchange between them.