The Birth of a Web Crawling Bot

Stage 1. Understanding how the site reacts to human users

Before we can build a bot to crawl a new website, we should know how the site interacts with a real human being. At this stage, our engineers take the new target website for a spin to get an idea on the site navigation using a regular browser like Google Chrome or Mozilla Firefox. This sheds some light into the browser-server interaction which reveals how the server sees and processes an incoming request. Typically, it involves playing with request headers and request types via http. This lays down the base for building the bot since the bot would in a way be mimicking a real user on the target website.

Stage 2. Getting a hang of how the site behaves with a bot

As part of the second step, our engineers will send some test traffic in an automated manner to understand how differently the site interacts with a bot compared to a human user. This is necessary since most modern websites have certain in-built mechanisms to deal with bots differently. Understanding these mechanisms would help us choose the best path of action to build the bot. Some common examples are:

The site limits normal navigation after, say 20 pages

The request returns a 301 status code

Site throws a captcha in response

Server returns a 403 status code – this means the site refuses to serve our request despite understanding it

Restricted access from a certain geography (This is where proxies come into the picture)

Most websites are dual faced, they treat human users and bots differently – in their defence, it protects them from the bad bots and various forms of cyber attacks. You might have at some point come across a website asking you to prove your humanity to access a certain page or feature. Bots face this a lot. This is why we perform this test to completely understand the site from a bot’s point of view.

We also have an automated layer which is then used to identify the best approach for building the bot to crawl a particular website. It does a mild stress-testing on the site to detect its tipping points and then returns some crucial information which goes into making the crawler bot such as Sleep, Proxy/No proxy, Captcha, Number of possible parallel requests and more.

Stage 3. Building the bot

Once our engineers have gotten a clear blueprint of the target site, it’s time to start building the crawler bot. The complexity of the build will depend on the results of our previous tests. For example, if the target site is only accessible from let’s say Germany, we will have to include a german proxy to fetch the site. Likewise, depending on the specific demands of the site, there can be upto 10 modules working together in a bot.

Stage 4. Putting the bot to test

Being an enterprise-grade web scraping service provider, we give utmost priority to reliability and data quality. To ensure these, it’s important to test the crawler bot under different conditions, on and off peak time of the target site before the actual crawls can start. For this test, we try to fetch a random number of pages from the live site. After gauging the outcome, further modifications will be done to the crawler for improving its stability and scale of operation. If everything works as expected, the bot can go into production.

Stage 5. Extracting data points and data processing

Our crawler bots work differently from the search engine crawlers that most people are familiar with. While the search engine bots like Google bot would simply crawl web pages and add them to their index with some meta-data, our bots fetch the full html content of the pages to a temporary storage space where they undergo extraction and various other processes depending on client requirements.

We call this stage the Extraction and this is where the required data points get extracted from the pre-downloaded web pages. Once they’re extracted, the data is automatically scanned for duplicate entries and deduplicated. The next process in line is normalization where certain changes are made to the data for ease of consumption. For example, if the price data extracted is in dollars, it can be converted to a different currency before being delivered to the client.

That was a quick walkthrough of how our engineers approach a new web crawling bot build. Note that the high efficiency of the bots also depend on the server environment and the level of optimization which we’ve achieved over the years. A stable tech stack and infrastructure is essential to extract millions of data records on a daily basis, with no two bots being alike.