How many of us have done this – We were browsing the web and liked some information presented on a webpage. The immediate next step we did, was to copy the portion of the text we liked and paste it locally to a document editor like Notepad or Word, for storage or future use. If you have raised your hands, then welcome to the most basic rudimentary art of screen scraping, also known better as HTML scraping or web scraping.
In the most basic forms, web scraping helps us to collect information from the web (online) and store it locally for future use or analysis (offline). Many data entry executives will be well versed with this concept. They browse webpage after webpage with one singular aim – to collect as much required data as possible and then store it locally (usually in an Excel spreadsheet) in a pre-defined format. This data will then be analyzed at an aggregate level for better analysis and business intelligence.
While the basic usage of data will justify a manual extraction process, things get a bit more complex at an enterprise level. The immense volume, variety, veracity, and velocity, of the resulting big data makes it humanly impossible to do this activity manually. What is needed here is a targeted and well-thought out data scraping solution that can automatically crawl multiple webpages and extract the information requested by the program.
Introduction to web scraping
Right from shopping, booking movie tickets, and paying bills, to tracking the financial performance of your favorite stocks, or seeking medical information for an ailing loved one, everything is present and accessible online nowadays. As one of the biggest drivers of Big Data, the internet is spewing information by the gigabytes per second. For instance, let’s see how intense the action is on the internet every second.
- 2 new accounts are created on LinkedIn
- 3 million mails are sent
- 23 Uber rides are booked
- 40000 Google searches are performed
- 11690 Facebook logins are executed
- 5787 new tweets are posted
This does sound like a massive data explosion happening every single second on the Internet. Now imagine if a brand like Adidas needs to know what people are saying about their range of footwear or clothing or sports accessories? It cannot simply go to every site, query every instance of the word ‘Adidas’ and select the relevant portion of the text.
Before we look at a much cleaner and better way to crawl HTML tables, let’s go back to our example and try to figure out why would Adidas want this information?
Some ways that Adidas can use the information include
Competitive analysis – Suppose Nike (Adidas’ competitor) is launching a pair of shoes with laces that ties by itself. If people are going crazy over the new feature, it would make sense that Adidas would want to know about it. It would be easy to imagine the Adidas R&D labs working fervently to come up with a similar appealing feature to catch their users’ fancy and thus increase their top lines. However without the information provided on the internet (in form of user reviews, press releases, or trending news) it will be difficult for Adidas to judge what the other competitor is trying to do for its user acquisition objective.
Market analysis – By doing a market analysis, Adidas will be able to figure out the distinct and dynamic nature of the market towards a particular industry (in Adidas’ case, the footwear and sports accessories business). This information can help them tweak their four P’s of marketing (product, promotion, place, and price) to great effect. This way, by using the mounds of information, Adidas can actually accelerate business efficiency at much lower cost overheads. With the data explosion happening on the web, Adidas can also apply customer segmentation with precision. Hence, they can then alter their marketing campaigns to ensure that the right type of ads are shown to the right audience, to garner the right kind of impact from these targeted ads.
These points make it amply clear on the value of extracting data from web worth its weight in gold. Because of its impact on the fortunes of a business, Web scraping has quickly emerged as a separate speciality within the entire chain of big data. This is the cleaner and better way of scraping HTML tables that we were discussing earlier.
Characteristics of HTML scraping
Multiple approaches to HTML scraping exist in the market today. With web scraping, data extraction experts can easily extract large amounts of information (both structured and unstructured) and store it in an easy to comprehend format for further BI/big data analytics use. While the extracted data is stored usually in a database or spreadsheet format, the interesting thing to note is the key characteristic of HTML scraping. This type of data extraction service works primarily on web based browser pages and it’s built specifically to mine data from the internet.
Some instances of data displayed specifically on the internet include classifieds, yellow pages, directory listings, e-commerce reviews and comments, or contact databases. One more incredible source of information (most importantly unstructured information) worth your consideration for mining, is social media. With 310 million users on Twitter alone, the social network has amazingly vast array of information in the form of tweets, favorites, replies and images. Using a dedicated service to crawl this goldmine of data has provided decision makers and marketing teams with insights that would not have been possible with traditional sources of data collection and extraction.
An HTML scraping software will interact with the web server the same way as a standard web page (querying requests and receiving page information). But instead of displaying the webpage on screen like a web browser, an HTML scraping software will act a bit differently. It will save the data from the webpage into a local storage location or database.
HTML Scraping – What the experts didn’t tell you
Composition of an HTML page
An HTML page is like a standard container element that has multiple boxes, rows, and columns. The boxes are defined by what we know as HTML tags. So the biggest box (the HTML window) will have smaller boxes (similar to a table having multiple cells, rows, and columns). Some other elements include anchors, images, and videos. It is obvious that a person scripting this program will need to have elementary knowledge of HTML programming. What an HTML scraping script will do is look for these elements and then extract the necessary information based on the programming done on the scraping script.
How a script to Scrape html tables works?
A scripting program uses either PHP, Ruby, or Python, to develop the script. The trick to getting precise outcomes from the script is to know which webpages, which elements, and which tables to look at for a given query. So an expert in designing the web scraping script will also need a good understanding of the layout, formatting, and structure of the webpage. Also an HTML scraping expert will not only know about HTML (web development language protocol) but also about HTTP (web development communication protocol).
As is evident, a strategically crafted web scraping setup is key to getting better quality data in quicker time. Without this particular process, getting insights from data will become a tedious affair that will not yield the kind of RoI that management and stakeholders at client organizations would want to see. In order to get an incredible RoI on your investment into big data and web scraping technologies, it is important to know what the experts didn’t tell you about HTML scraping.
Stay tuned for our next article to learn how to use Social media Scraping for improving user experience.
Planning to acquire data from the web? We’re here to help. Let us know about your requirements.