Web crawling comes with its challenges and it should come as no surprise if you’ve ever tried your hands to scrape web for data. The data available on the web is bound to follow no rules, structure or standards and this alone makes it difficult to predict the kind of issues one might run into while crawling the web for data. The difficulty grows by many folds when the complex web scraping needs to be done at scale.
Web data, despite holding invaluable insights to businesses, still remains a hard nut to crack for many. This is where a specialized web scraping service like ours comes into the picture. At PromptCloud, we receive requirements of all sorts and each data scraping task is a challenge in itself. However, the complexity of extracting web data varies a lot depending on several factors. Here are some of the most challenging scraping tasks we’ve handled so far.
Project 1: Scrape Telecom Sector Data for a Business Intelligence Company
Target sites: Websites of cell phone carriers
Data points required: All offers available for various customer segments
The company wanted to gather data pertaining to offers available on various cell phone carriers’ websites, to provide a competitive edge to their customers in this domain. The requirement was feasible despite being extremely complex. Following were the issues that made this an extremely challenging project.
Project Challenges
1. Too many steps to get to data
The offer information on the source sites were displayed only after certain variables like customer’s zip code and offer types are entered. This constituted to a long path before the actual data was displayed. As a result, the crawler had to be programmed to select each and every possible combination of inputs, to effectively get the site to display all the available data.
2. Frequent site changes
Since the mobile industry is a fast-paced one, the data available on these websites tend to change very often. Mobile network providers make frequent changes to their existing offers, discontinue certain offers and come up with new ones. This demanded close monitoring and implementation of automated web scraper, to handle site change issues.
3. Character encoding issues
Character encoding of a website is typically declared by the website in its HTML code. However, certain websites can have a wrong character encoding declaration or use more than one character encoding across the site. These could effectively make the web crawler setup more complex and continue to cause issues if the site isn’t consistent with its character encoding.
4. Redundant data on the site
Redundant data can be a real problem, especially when the scale of web data extraction is large. While we have a cleaning system meant to find and remove redundant entries from the dataset, the site itself having redundant data makes it all the more difficult to handle the extraction.
Project 2: Extract Data from Hotel Discovery and Price Comparison Platforms
Target sites: Online travel portals and hotel websites
Data required: Hotel listings and reviews
The client wanted to extract hotel data from hundreds of travel websites from across the globe to build a one-stop hotel search engine. Every target site needed its own crawler setup and the individual challenges to be dodged while setting up data crawlers for 100+ sites made this a challenging project to embark upon.
Project Challenges
1. Blocking
Certain sites in the target list had various blocking mechanisms which were targeted at automated crawlers. This had to be handled by using the optimal frequency of GET requests and only requesting a nominal number of pages at a time. We avoided the blocking mechanisms by following the best practices of web scraping.
2. Discovery
Discovery of URLs to be fetched is a critical stage in the web crawling and data extraction process and poor navigational structure of some target sites made it tough for the web crawlers to traverse through pages in a seamless fashion. We handled this by setting up multiple fallback rules for the URL discovery operation.
3. Character encoding issues
Character encoding issues were a challenge with this task. We had to manually ensure that the encodings we used matched that of each target site. In case of sites that showed inconsistency in character encoding, we also set up some automation to handle the problem.
4. Redundant data on the target sites
Redundant data present on several sites added to the challenging aspect of this project. We let our cleaning system take care of the redundancy in the extracted data and this approach seemed to work for the client too.
Project 3: The Big4 Consulting Firm need Product Data to build a Price Intelligence System
Target sites: Popular eCommerce portals
Data required: Product information
The client was looking to help one of their customers with price intelligence and needed a service which can not only deliver the product data but also do the matching. Although we usually don’t handle processes outside data extraction and delivery, we decided to take this up considering the scale and interesting nature of the requirement.
Project Challenge
1. Product matching
Product matching is a highly challenging aspect which outside the scope of web scraping expertise. A strong matching system is essential here, as every other ecommerce portal will have some minor differences when it comes to the product descriptions including product name and brand name.
However, we developed an algorithm which could do the matching once the data has been extracted and indexed at our end in order to meet the demands of this unique project.
Web Scraping Service is All about Solving Challenges
Given the lack of standardization when it comes to the data displayed by websites, web scraping is and always will be a challenging task which needs to be tackled using skills, experience and expertise. This is exactly why we stress on the importance of going with a fully-managed solution when it comes to web data requirements for businesses irrespective of their size and domain.