Maintaining Data Quality in Web Archiving
Over the years, technology has faced some radical changes and consequently, the lives of individuals and business entities have also evolved significantly. Important forces and factors that used to influence markets, help in decision-making, and count as occurrences capable of changing the direction of a company, have also faced their fair share of change. With time, the whole notion of many important business processes and domains has been completely rewritten in the last few years. With the dawn and influx of concepts like the Internet of Things and Big Data, the equation has completely changed for most businesses. At present, the companies that are leveraging these new factors correctly and smartly into their scheme of things are witnessing a massive competitive edge.
Data has evolved the smart way
Information is king during these times, and the intense volume of high quality, actionable information that is available to a particular company can often turn out to be the measure of its success. With activities like web data extraction slowly becoming routine business processes, certain standards of data quality are also starting to become defined and take shape. Modern techniques of web crawling are being discovered, and the tools for the job are also becoming more targeted and pinpoint.
A large part of web scraping and using the resultant data as important business information and insight is web archiving. Basically, web archiving is the process of collecting and storing web pages as they appear to users on a particular day and repeating the process over time as those pages change and evolve. They then use the stored information for the sake of research and information preservation.
Web archiving is a useful activity that is often carried out by businesses that intend to use the data they harvest and archive for furthering their own business interest. It is also sought after by companies that provide data as a service or DaaS solutions. With web archiving, there is always the question of maintaining high quality of data at all times. Let’s take a look at why that is important, and a few things to keep in mind to achieve that.
The Basics of Web Archiving
Web archiving is a process that starts with web crawling. A crawling software scrapes data from a set of web pages sequentially, storing the information for later use. The goal is to collect and harvest these web pages the way they appear at a certain date, and repeat the effort later when anything on the web page changes. During the crawling and archiving process, the aim is to collect entire web pages including embedded items like images. The emphasis is do it in a manner that completely preserves the inherent link structure. Used carefully, web archiving can yield a lot of relevant, actionable data that can empower businesses in different ways, if the quality of data is maintained throughout.
Web archiving can be carried out in a micro or macro level as per the particular requirements of the company using the techniques in question. At the micro level, it involves capturing one or a few particular websites with the intention of using the data for research. The purpose is usually to chronicle the way the website changes over time, and to look for elements that are added to it or taken away from it. Companies can sometimes even opt to archive their own websites to later look for ways to make their web presence better. Very often, dynamic web pages are difficult to completely harvest, and require some human interaction to yield the desired results. Consequently, micro archiving becomes a labor-intensive, time-consuming process.
Macro level web archiving, on the other hand, happens at a very large scale. When companies like to study the internet, analyze link patterns, and track changes over time, there is need for large scale data collection. To locate particular data types based on form or content also requires the collection and storage of large volumes of data, and web archiving is the ideal way to achieve that. This is usually the choice for companies providing data as a service solutions, and brings a lot of information to the table.
Finer Points Regarding Data Quality in Web Archiving
Assuming a normal, routine web archiving process is at play, it can be assumed that web scraping tools would be used to crawl full web pages and record them as is. The harvesting tool is the most accurate when it behaves like a human user. This way, it captures all embedded content and following the inherent link structure to other pages and capturing them as well. When the collection is complete, the system can put a timestamp on the collection and save it for later use in a database.
To preserve high quality of information and to keep things authentic, there are certain aspects of web archiving that need to be paid attention to. Assuming that web pages render the regular way, some of the other important aspects that matter in preserving data quality during web archiving follow below.
Recording Harvesting Context
When a web page or a website is captured for a specific purpose or as a part of a process that is large in scale, it can be very easy to lose track of things. Such archiving without context can very easily render the stored information incomplete or irrelevant. To preserve data quality and ensure that archiving efforts bear fruit, there must always be attention on recording the context and particular circumstances of the harvest.
This extra information can be of great help during the future use of the archived data, and present researchers with some all-important context which they can factor into their research. Companies should always prefer harvesting methods which have the option to store harvesting context, along with relevant contextual details like
- the date and time of harvest
- the duration of time it takes to complete the harvest
- the manner in which the information was requested via HTTP requests
- the way the response is generated
To further improve quality of data, companies can also look for a web archiving tool with the following features –
- The ability to correctly store not just the response information, but also the control information resulting from the harvesting protocol (like request headers)
- The ability to efficiently store linked metadata to other data which has already been stored (like encoding, language and subject classifiers)
- The ability to store not only the payload content, but also the control information resulting from the usual protocols involved in internet applications (like HTTP, FTP and DNS)
Efficiency at Scale
Macro archiving is a process that demands simplicity and transparency. There needs to be enough flexibility in the process to accommodate many types of data, and the resultant data must be stored in such a format which makes it easier to carry out later processing. Since any harvesting process can be faced with interruptions, there must be no requirement for harvested pages to be in any particular order.
There should also be no limitations regarding file size, but the ability to segment a large harvest into pre-configured size chunks is useful. It makes things easier for the kind of technology available during the time of making the harvest. The right format should also allow for simple, efficient merging of aggregations. Since web pages are likely to be crawled and harvested multiple times, a system to eliminate and avoid duplicate content needs to be in place as well. Keeping these points in mind, the following abilities are desirable in any web archiving system for better quality of data –
- Storage in the right format for efficient future processing
- Ability to have an integrated duplicate detection system linked to already stored data to reduce unnecessary storage load
- Ability to integrate key processes like data compression and built-in data integrity checks
- Support for implementing deterministic processing of long and information-heavy records through the use of segmentation and truncation
To wrap up
For companies that carry out enterprise grade web archiving processes, paying attention to these aspects can make it a lot easier to obtain high quality data and to preserve its quality over time across multiple resources. Good quality data is something that comes in handy for important research at a later time, and this should be kept in mind right from the start of the harvesting process. Never losing focus on the final use of the obtained data can be a great way to fine-tune web archiving processes to yield the right kind of results.
To enhance access and improve usability of data in the long run, you can also consider an archiving system which records important metadata about the harvested resources. Assigning topical subject classifications through simple textual analysis can provide a little extra access to researchers, and features like subset generation can come in handy at a later time. Data can also come in a variety of formats, some of which can become obsolete or unusable over time. The right web extraction and archiving tool should also bundle the ability to convert information into a standardized data type across the board. This way, it can help in improving the overall quality of data and making it more usable and relevant in the long run.