In the Big Data world, Web Scraping or Data extraction services are the primary requisites for Big Data Analytics. Pulling up data from the web has become almost inevitable for companies to stay in business. Next question that comes up is how to go about web scraping as a beginner.
Data can be extracted or scraped from a web source using a number of methods. Popular websites like Google, Facebook, or Twitter offer APIs to view and extract the available data in a structured manner. This prevents the use of other methods that may not be preferred by the API provider. However, the demand to crawl a website arises when the information is not readily offered by the website. Python, an open-source programming language is often used for Web Scraping due to its simple and rich ecosystem. It contains a library called “BeautifulSoup” which carries on this task. Let’s take a deeper look into web scraping using python.
Setting up a Python Environment:
To carry out web scraping using Python, you will first have to install the Python Environment, which enables you to run code written in the python language. The libraries perform data scraping;
Beautiful Soup is a convenient-to-use python library. It is one of the finest tools for extracting information from a webpage. Professionals can crawl information from web pages in the form of tables, lists, or paragraphs. Urllib2 is another library that can be used in combination with the BeautifulSoup library for fetching the web pages. Filters can be added to extract specific information from web pages. Urllib2 is a Python module that can fetch URLs.
For MAC OSX :
To install Python libraries on MAC OSX, users need to open a terminal win and type in the following commands, single command at a time:
pip install BeautifulSoup4
pip install lxml
For Windows 7 & 8 users:
Windows 7 & 8 users need to ensure that the python environment gets installed first. Once, the environment is installed, open the command prompt and find the way to root C:/ directory, and type in the following commands:
Once the libraries are installed, it is time to write a data scraping code.
The data scraping must be done for a distinct objective such as to crawl the current stock of a retail store. First, a web browser is required to navigate the website that contains this data. After identifying the table, right-click anywhere on it and then select the inspect element from the dropdown menu list. This will cause a window to pop-up on the bottom or side of your screen displaying the website’s Html code. The rankings appear in a table. You might need to scan through the HTML data until you find the line of code that highlights the table on the webpage.
Python offers some other alternatives for HTML scraping apart from BeautifulSoup. They include:
Web scraping converts unstructured data from HTML code into structured data form such as tabular data in an Excel worksheet. Web scraping can be done in many ways ranging from the use of Google Docs to programming languages. For people who do not have any programming knowledge or technical competencies, it is possible to acquire web data by using web scraping services that provide ready to use data from websites of your preference.
To perform web scraping, users must have a sound knowledge of HTML tags. It might help a lot to know that HTML links are defined using anchor tag i.e. <a> tag, “<a href=“https://…”>The link needs to be here </a>”. An HTML list comprises <ul> (unordered) and <ol> (ordered) list. The item of list starts with <li>.
HTML tables are defined with<Table>, row as <tr> and columns are divided into data as <td>;
- <!DOCTYPE html> : A HTML document starts with a document type declaration
- The main part of the HTML document in unformatted, plain text is defined by <body> and </body> tags
- The headings in HTML are defined using the heading tags from <h1> to <h5>
- Paragraphs are defined with the <p> tag in HTML
- An entire HTML document is contained between <html> and </html>
Using BeautifulSoup in Scraping:
While scraping a webpage using BeautifulSoup, the main concern is to identify the final objective. For instance, if you would like to extract a list from webpage, a step wise approach is required:
- First and foremost step is to import the required libraries:
#import the library used to query a website
#specify the url wiki = “https://”
#Query the website and return the html to the variable ‘page’
page = urllib2.urlopen(wiki)
#import the Beautiful soup functions to parse the data returned from the website
from bs4 import BeautifulSoup
#Parse the html in the ‘page’ variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page)
- Use function “prettify” to visualize nested structure of HTML page
- Working with Soup tags:
Soup<tag> is used for returning content between opening and closing tag including tag.
Out:<title>List of Presidents in India till 2010 – Wikipedia, the free encyclopedia</title>
- soup.<tag>.string: Return string within given tag
- In :soup.title.string
- Out:u ‘List of Presidents in India and Brazil till 2010 in India – Wikipedia, the free encyclopedia’
- Find all the links within page’s <a> tags: Tag a link using tag “<a>”. So, go with option soup.a and it should return the links available in the web page. Let’s do it.
- In :soup.a
- Find the right table:
As a table to pull up information about Presidents in India and Brazil till 2010 is being searched for, identifying the right table first is important. Here’s a command to crawl information enclosed in all table tags.
Identify the right table by using the attribute “class” of the table that needs to filter the right table. Thereafter, inspect the class name by right-clicking on the required table of the web page as follows:
- Inspect element
- Copy the class name or find the class name of the right table from the last command’s output.
right_table=soup.find(‘table’, class_=’wikitable sortable plainrowheaders’)
That’s how we can identify the right table.
- Extract the information to DataFrame: There is a need to iterate through each row (tr) and then assign each element of tr (td) to a variable and add it to a list. Let’s analyze the Table’s HTML structure of the table. (extract information for table heading <th>)
To access the value of each element, there is a need to use the “find(text=True)” option with each element. Finally, there is data in dataframe.
There are various other ways to crawl data using “BeautifulSoup” that reduce manual efforts to collect data from web pages. Code written in BeautifulSoup is considered to be more robust than the regular expressions. The web scraping method we discussed use “BeautifulSoup” and “urllib2” libraries in Python. That was a brief beginner’s guide to start using Python for web scraping.
Stay tuned for our next article on how web scraping affects your revenue growth.
Planning to acquire data from the web for data science? We’re here to help. Let us know about your requirements.