Maintain Data Quality While Handling Large Scale Extraction

Contact information

PromptCloud Inc, 16192 Coastal Highway, Lewes De 19958, Delaware USA 19958

We are available 24/ 7. Call Now. marketing@promptcloud.com

April 12, 2017
Last updated: December 30, 2024
how to maintain data quality
Blog

Table of Contents

How to maintain data quality, The demand for high-quality data is increasing along with the rise in products and services that require data to run. Although the information available on the web is increasing in terms of quantity and quality, extracting it in a clean, usable format remains challenging to most businesses. Having been in the web data extraction business for long enough, we have come to identify the best practices and tactics that would ensure high-quality data from the web.

At PromptCloud, we not only make sure data is accessible to everyone, but we also make sure it’s of high quality, clean, and delivered in a structured format. Here is how we maintain the quality while handling zettabytes of data for hundreds of clients from across the world.

Manual QA Process

1. Crawler Review

Every web data extraction project starts with the crawler setup. Here, the quality of the crawler code and its stability is of high priority as this will have a direct impact on how to maintain data quality. The crawlers are programmed by our tech team members who have high technical acumen and experience. Once the crawler is made, two peers review the code to make sure that the optimal approach is used for extraction and to ensure there are no inherent issues with the code. Once this is done, the crawler is deployed on our dedicated servers.

2. Data Review

The initial set of data starts coming in when the crawler is run for the first time. This data is manually inspected, first by the tech team and then by one of our business representatives before the setup is finalized. This manual layer of quality check is thorough and weeds out any possible issues with the crawler or the interaction between the crawler and website. If issues are found, the crawler is tweaked to eliminate them before the setup is marked complete.

Automated Monitoring

Websites get updated over time, quite frequently than you’d imagine. Some of these changes could break the crawler or cause it to start extracting the wrong data. This is why we have developed a fully automated monitoring system to watch over all the crawling jobs happening on our servers. This monitoring system continuously checks the incoming data for inconsistencies and errors. There are three types of issues it can look for:

1. Data Validation Errors

Every data point has a defined value type. For example, the data point ‘Price’ will always have a numerical value and not text. In cases of website changes, there can be class name mismatches that might cause the crawler to extract the wrong data for a certain field. The monitoring system will check if all the data points are in line with their respective value types. If an inconsistency is found, the system immediately sends out a notification to the team members handling that project, and the issue is fixed promptly.

2. Volume-Based Inconsistencies

There can be cases where the volume count for records significantly drops or increase irregularly. This is a red sign as far as web crawling goes. The monitoring system will already have the expected record count for different projects. If inconsistencies are spotted in the data volumes, the system sends out a prompt notification.

3. Site Changes

Structural changes happening to the target websites are the main reason why crawlers break. This is monitored by our dedicated monitoring system, quite aggressively. The tool performs frequent checks on the target site to make sure nothing has changed since the previous crawl. If changes are found, it sends out notifications for the same.

High-End Servers

It is understood that web crawling is a resource-intensive process that needs high-performance servers. The quality of servers will determine how smooth the crawling happens and this, in turn, has an impact on the quality of data. Having firsthand experience in this, we use high-end servers to deploy and run our crawlers. This helps us avoid instances where crawlers fail due to the heavy load on servers.

Data Cleansing

The initially crawled data might have unnecessary elements like HTML tags. In that sense, this data can be called crude. Our cleansing system does an exceptionally good job at eliminating these elements and cleaning up the data thoroughly. The output is clean data without any of the unwanted elements.

Structuring

Structuring is what makes the data compatible with databases and analytics systems by giving it a proper, machine-readable syntax. This is the final process before delivering the data to the clients. With structuring done, the data is ready to be consumed either by importing it to a database or plugging it into an analytics system. We deliver the data in multiple formats – XML, JSON, and CSV which also adds to the convenience of handling it.

Prompt Support

Apart from a thorough QA process that includes both manual and automatic layers, our support engineers are always ready to help clients in case something goes unnoticed, but the chances are slim. Clients can get support through our dedicated ticketing system by raising tickets for the issues they’re facing. Critical problems will be marked as a high priority and solved within just 24 hours.