Web crawling comes with its challenges and it should come as no surprise if you’ve ever tried your hands on scraping data from a website. The data available on the web is bound to follow no rules, structure or standards and this alone makes it difficult to predict the kind of issues one might run into while crawling the web for data. The difficulty grows by many folds when the complex crawling needs to be done at scale.
Web data, despite holding invaluable insights to businesses, still remains a hard nut to crack for many. This is where a specialized service like ours comes into the picture. At PromptCloud, we receive requirements of all sorts from all over the world and each scraping task is a challenge in itself. However, the complexity of extracting web data varies a lot depending on several factors. Here are some of the most challenging scraping tasks we’ve handled so far.
Project 1: A business intelligence provider catering to the telecom sector
Target sites: Websites of cell phone carriers
Data points required: All offers available for various customer segments
The company wanted to gather data pertaining to offers available on various cell phone carriers’ websites to provide a competitive edge to their customers from this domain. The requirement was feasible despite being extremely complex. Following were the issues that made this an extremely challenging project.
1. Too many steps to get to the data
The offer information on the target sites were displayed only after certain variables like the zip code of the customer and offer types which constituted to a long path before the actual data is displayed. This meant the crawler had to be programmed to select each and every possible combination of inputs in order to effectively get the site to display all the available data.
2. Frequent site changes
Since the mobile industry is a fast-paced one, the data available on these websites tend to change very often. Mobile network providers make changes to their existing offers, discontinue certain offers and come up with new ones. This demanded the need for close monitoring and implementation of automated means to handle site change issues.
3. Character encoding issues
Character encoding of a website is typically declared by the website in its HTML code. However, certain websites can have a wrong character encoding declaration or use more than one character encoding across the site. These could effectively make the crawler setup more complex and continue to cause issues if the site isn’t consistent with its character encoding.
4. Redundant data on the site
Redundant data can be a real problem, especially when the scale of extraction is large. While we have a cleaning system meant to find and remove redundant entries from the dataset, the site itself having redundant data makes it all the more difficult to handle the extraction.
Project 2: A hotel discovery and comparison platform
Target sites: Online travel portals and hotel websites
Data required: Hotel listings and reviews
The client wanted to extract hotel data from hundreds of travel websites from across the globe to build a one-stop hotel search engine. Every target site needed its own crawler setup and the individual challenges to be dodged while setting up crawlers for 100+ sites made this a challenging project to embark upon.
Certain sites in the target list had various blocking mechanisms which were targeted at automated crawlers. This had to be handled by using the optimal frequency of GET requests and only requesting a nominal number of pages at a time. We avoided the blocking mechanisms by following the best practices of web scraping.
Discovery of URLs to be fetched is a critical stage in the web scraping process and poor navigational structure of some target sites made it tough for the crawlers to traverse through pages in a seamless fashion. We handled this by setting up multiple fallback rules for the URL discovery operation.
3. Character encoding issues
Character encoding issues were a challenge with this task as well and we had to manually ensure that the encodings we used matched that of each target site. In case of sites that showed inconsistency in character encoding, we also set up some automation to handle the problem.
4. Redundant data on the target sites
Redundant data present on several sites added to the challenging aspect of this project. We let our cleaning system take care of the redundancy in the extracted data and this approach seemed to work for the client too.
Project 3: A popular business consulting firm looking to build a price intelligence system
Target sites: Popular ecommerce portals
Data required: Product information
The client was looking to help one of their customers with price intelligence and needed a service which can not only deliver the product data but also do the matching. Although we usually don’t handle processes outside data extraction and delivery, we decided to take this up considering the scale and interesting nature of the requirement.
1. Product matching
Product matching is a highly challenging aspect which outside the scope of web scraping expertise. A strong matching system is essential here as every other ecommerce portal will have some minor differences when it comes to the product descriptions including product name and brand name. However, we developed an algorithm which could do the matching once the data has been extracted and indexed at our end in order to meet the demands of this unique project.
Web scraping is all about solving challenges
Given the lack of standardization when it comes to the data displayed by websites, web scraping is and always will be a challenging task which needs to be tackled using skills, experience and expertise. This is exactly why we stress on the importance of going with a fully-managed solution when it comes to web data requirements for businesses irrespective of their size and domain.