Using Web Scraping for Investigative Journalism
As a valuable data and insight generation tool, web scraping has added immense value to many businesses across different industry verticals. Right from healthcare to automotive, and life sciences to government agencies, there is no vertical that has remained untouched from the influence and impact of web scraping. However, what is interesting to note is the way web scraping and data extraction is being used for newer forms of application. One such exciting avenue of applying the scientific methods of data extraction is in the field of investigative journalism.
What is investigative journalism?
Investigative journalism is a crucial part of fact reporting. It is the field where the journalist will deeply investigate one topic, especially those related to law and order or activities that are criminal in nature. What is fascinating to note is the amount of efforts and time a journalist will spend on this one single topic. The investigation may take weeks, months, or even years, to yield the desired result, after researching and preparing a detailed investigation report.
A crucial aspect of investigative journalism is research and this is where high quality data mining helps improve the overall quality of the final reporting. Since most of the data to be researched is hidden or not visible in plain view, it takes a journalist a lot of effort to peel layer after layer of what is provided to him/her to uncover the correct facts. While considerable data is available through press releases, comments, press conferences, and corporate announcements, a true blue investigative journalist will not rely simply on these facts. He/She will dig deeper to uncover the dark truths hidden behind the mostly rosy picture presented to the general public. He/she will use data mining to accomplish this difficult task.
This is exactly the backbone of data journalism – i.e. powering up investigative journalism with help of data.
What is data journalism?
The term data driven journalism was coined in 2009. However, its practical application is as old age as the concept of data itself. Find it hard to believe? The report on the war-time conditions that British troops had to face in 1858 shows how beautifully a story was be woven around facts and data to present a compelling visualization that elicits prompt action from leaders. And yes – the report is more than 150 years old!
To define data journalism, it is the journalistic practice used in today’s age of data explosion. The practice sees a journalist analyzing data and generating insights from huge data sets. The outcome of this practice is to help create a fact-filled news story that relies on data rather than hearsay. You may ask why this practice is gathering so much steam in recent times while creating news story has been around for decades. The answer is simple – today’s age sees a lot of data being generated, stored, curated, and consumed. The main components that have driven data journalism include
- Availability of open source tools that brings down the cost of computer based data analysis and insight generation
- Open access to data and published content that has helped remove restrictions on access (e.g. access charges or subscription fees) or on its usage (e.g. copyright and licensing restrictions)
- The concept of open data that makes most of the data available freely on channels like Internet and trade or government publications.
The easy access to open data means that data journalism need not be limited to professional data scientists. Anyone possessing a familiarity with a spreadsheet can carry out investigative journalism to uncover hidden facts. However, this also means that the practice should have a well-defined process so that the wider spread of users doesn’t dilute the efficacy of investigative journalism.
Data journalism – The key steps
As discussed above, data journalism needs to be a well thought out process that involves key steps essential to execute the process. At a very basic level, the workflow states that information must first be sourced or found (or made sense of after finding). This may involve the use of tools like SQL. It must then be analyzed (that may require getting terminologies and technical jargon right). Post this, data must be visualized to present the collected information in a pictorial format to promote better digestion of data. Once this is ready it can be downloaded to the required audience or stakeholders. This is the final stage where the facts, reports, and trends are brought forward to a larger audience in the form of a news story.
The most well-known study on the workflow of data journalism was released in 2011 by Paul Bradshaw. It outlined six different phases under an “inverted pyramid of data journalism”. Let’s look at a typical workflow involving data journalism in this inverted pyramid:
- Find: Sourcing the information or data online
- Clean: Add filters and logic to transform data
- Visualize: The transformed data then shows results in form of inference, trends, statistics or patterns, in the form of a static or animated visual
- Publish: Joining together the visuals, to weave a compelling story
- Distribute: Sharing the story on various distribution channels such as the Internet, social media, smartphones, or tablets
- Measure: Monitor the consumption of the content to view trends and type of users reading it.
We will now explore these steps in better detail
Finding data – Gathering data is the first step towards investigative journalism. Right from doing field trips to finding out the actual cause of criminal wrongdoing to studying the impact of a long term issue, there are many ways of finding data. In order to find the data, you will first need to determine the right sources. If somebody has already published about an ongoing issue that you happen to be investigating then it makes sense to make the secondary research as a starting point. If you are however, investigating something sensitive, then you may need to bypass grapevine and rumors and carry out your own impartial and unbiased research to find the data.
Take the example of the controversial investigative journalism work carried out by a certain ‘NH’ in 1821 (yes, almost 200 years back!). It showed a list of students enrolled in schools at Manchester and Salford and the fees paid by them. By using manual scraping, the data journalist tried to figure out how many were receiving free education. While it showed nearly 25000 students receiving free education, the official records pegged the number at just 8000. This uncovered a massive flaw with the official statistics collected by clergymen (olden days’ data entry clerks). This was a classic case of finding data that triggered action.
Data cleaning – Usually, data from different sources will be in different formats. This needs to be cleaned and normalized for ease of future analysis. For instance, while doing data extraction for weight among obese children, US data will be in Kilograms, while UK data will be in Pounds. For ease of analysis, these will need to be cleaned and made consistent to a single measurement unit.
Data visualization – This is an important link where the data moves from being just numbers to a visual representation that can lead to quick inferences. Once the data is put on spreadsheets in a meaningful format, it is passed through data visualization tools like OpenRefine and Tableau Public. Here is a list of free data visualization tools available to you.
Publishing – Using a Content Management System, the visualization is published strategically, based on the expected readership.
Data distribution – Specialized content marketplaces provide access to this investigative visualization. Through this channel, others can pick up the data stories and carry on their own line of investigation.
Evaluating the impact of investigative journalism – The entire point of conducting in-depth investigative journalism is to create a profound impact. And how do you know whether your story is creating an impact? Of course, by tools that are created specifically to monitor the impact of data stories.
To sign off
Many case studies point out to the immense impact driven by investigative journalism using data extraction. The most well-known of these is the WikiLeaks publication of classified government agencies data. The way it impacted public and welfare policies at the highest level in countries like the USA, speaks volumes of the deep influence of investigative journalism.
Today it is no longer sufficient to collect data and derive insight. The insight needs to be backed by a creative visualization, but more important than that, it has to be backed by a solid story created to support your viewpoint. Data journalism, with the aid of data scraping, is increasingly being viewed as a key insight generation tool and is becoming a trusted aid for data visualization and data backed news story reporting.
Stay tuned for our next article on pricing your products right.
Planning to acquire data from the web? We’re here to help. Let us know about your requirements.