Submit Your Requirement
Scroll down to discover

Tips on properly extracting web data from even the most challenging sites by using best practices of web scraping

May 15, 2018Category : Blog
Tips on properly extracting web data from even the most challenging sites by using best practices of web scraping

Web crawling comes with its challenges and it should come as no surprise if you’ve ever tried your hands to scrape web for data. The data available on the web is bound to follow no rules, structure or standards and this alone makes it difficult to predict the kind of issues one might run into while crawling the web for data. The difficulty grows by many folds when the complex web scraping needs to be done at scale.

Web data, despite holding invaluable insights to businesses, still remains a hard nut to crack for many. This is where a specialized web scraping service like ours comes into the picture. At PromptCloud, we receive requirements of all sorts and each data scraping task is a challenge in itself. However, the complexity of extracting web data varies a lot depending on several factors. Here are some of the most challenging scraping tasks we’ve handled so far.

Project 1: Scrape Telecom Sector Data for a Business Intelligence Company  

Target sites: Websites of cell phone carriers

Data points required: All offers available for various customer segments

The company wanted to gather data pertaining to offers available on various cell phone carriers’ websites, to provide a competitive edge to their customers in this domain. The requirement was feasible despite being extremely complex. Following were the issues that made this an extremely challenging project.

Project Challenges

1. Too many steps to get to data 

The offer information on the source sites were displayed only after certain variables like customer’s zip code and offer types are entered. This constituted to a long path before the actual data was displayed. As a result, the crawler had to be programmed to select each and every possible combination of inputs, to effectively get the site to display all the available data.

2. Frequent site changes

Since the mobile industry is a fast-paced one, the data available on these websites tend to change very often. Mobile network providers make frequent changes to their existing offers, discontinue certain offers and come up with new ones. This demanded close monitoring and implementation of automated web scraper, to handle site change issues.

3. Character encoding issues

Character encoding of a website is typically declared by the website in its HTML code. However, certain websites can have a wrong character encoding declaration or use more than one character encoding across the site. These could effectively make the web crawler setup more complex and continue to cause issues if the site isn’t consistent with its character encoding.

4. Redundant data on the site

Redundant data can be a real problem, especially when the scale of web data extraction is large. While we have a cleaning system meant to find and remove redundant entries from the dataset, the site itself having redundant data makes it all the more difficult to handle the extraction.

Project 2: Extract Data from Hotel Discovery and Price Comparison Platforms

Target sites: Online travel portals and hotel websites

Data required: Hotel listings and reviews

The client wanted to extract hotel data from hundreds of travel websites from across the globe to build a one-stop hotel search engine. Every target site needed its own crawler setup and the individual challenges to be dodged while setting up data crawlers for 100+ sites made this a challenging project to embark upon.

Project Challenges

1. Blocking 

Certain sites in the target list had various blocking mechanisms which were targeted at automated crawlers. This had to be handled by using the optimal frequency of GET requests and only requesting a nominal number of pages at a time. We avoided the blocking mechanisms by following the best practices of web scraping.

2. Discovery

Discovery of URLs to be fetched is a critical stage in the web crawling and data extraction process and poor navigational structure of some target sites made it tough for the web crawlers to traverse through pages in a seamless fashion. We handled this by setting up multiple fallback rules for the URL discovery operation.

3. Character encoding issues

Character encoding issues were a challenge with this task. We had to manually ensure that the encodings we used matched that of each target site. In case of sites that showed inconsistency in character encoding, we also set up some automation to handle the problem.

4. Redundant data on the target sites

Redundant data present on several sites added to the challenging aspect of this project. We let our cleaning system take care of the redundancy in the extracted data and this approach seemed to work for the client too.

Project 3: The Big4 Consulting Firm need Product Data to build a Price Intelligence System

Target sites: Popular eCommerce portals

Data required: Product information

The client was looking to help one of their customers with price intelligence and needed a service which can not only deliver the product data but also do the matching. Although we usually don’t handle processes outside data extraction and delivery, we decided to take this up considering the scale and interesting nature of the requirement.

Project Challenge

1. Product matching 

Product matching is a highly challenging aspect which outside the scope of web scraping expertise. A strong matching system is essential here, as every other ecommerce portal will have some minor differences when it comes to the product descriptions including product name and brand name.

However, we developed an algorithm which could do the matching once the data has been extracted and indexed at our end in order to meet the demands of this unique project.

Web Scraping Service is All about Solving Challenges

Given the lack of standardization when it comes to the data displayed by websites, web scraping is and always will be a challenging task which needs to be tackled using skills, experience and expertise. This is exactly why we stress on the importance of going with a fully-managed solution when it comes to web data requirements for businesses irrespective of their size and domain.

Web Scraping Service CTA

Leave a Reply

Your email address will not be published. Required fields are marked *

© Promptcloud 2009-2020 / All rights reserved.
To top