Web Scraping Guide With Python Using Beautiful Soup

Contact information

PromptCloud Inc, 16192 Coastal Highway, Lewes De 19958, Delaware USA 19958

We are available 24/ 7. Call Now. marketing@promptcloud.com

Abhisek Roy

June 23, 2020
Blog

Table of Contents show

Introduction To Web Scraping With Python:

When it comes to web scraping, some programming languages are preferred over others. One of the most popular among these is Python. Besides being one of the easiest languages to learn due to its gentler learning curve, it also has the advantage of being a language with massive developer support- which has led to numerous third-party packages. These packages can be used for multiple functionalities, which may be difficult to perform with core Python. Some common examples are- for image processing or computer vision, we use OpenCV, for machine learning, we use TensorFlow and for plotting graphs, we use MatplotLib. When it comes to web scraping, one of the most commonly used libraries is BeautifulSoup. This library does not specifically scrape data from the internet, but in case you can get the HTML file of a webpage, it can help extract specific data points from it. In general, the library is used to extract data points from XML and HTML documents.

How does BeautifulSoup work?

Before you go on to write code in Python, you have to understand how BeautifulSoup works. Once you have extracted the HTML content of a web page and stored it in a variable, say html_obj, you can then convert it into a BeautifulSoup object with just one line of code-

soup_obj = BeautifulSoup(html_obj, ‘html.parser’)

Where html_obj is the HTML data, the soup_obj is the bs object that has been obtained and the “html.parser” is the parser that was used to do the conversion. Once you have the bs object, called soup_obj, traversing it is very easy, and since traversing it is straightforward enough, data extraction also becomes simple. Let us take an example. Say you need to fetch a data point called the product title, which is present on every page of an eCommerce website. Now you downloaded a single HTML product page from that website and realised that each page has the product name mentioned on an element of type span having id as productTitle. So how will you fetch this data from say 1000 product pages? Well, you will get the HTML data for each page, and fetch the data point in this manner-

for spans in soup.findAll(‘span’, attrs={‘id’: ‘productTitle’}):
name_of_product = spans.text.strip()

While this is a way to get textual data present between a certain tag element, you can fetch data from attributes of a tag as well.

How To Scrape Data From Web Pages Using BeautifulSoup?

Now that we have some basic understanding of how a bs object is traversed, let us go write some code, and see how it works. Using the code snippet below, you can scrape data from Zillow, a leading real estate marketplace based out of the USA very easily. You can run this code and input the URL of a listing, to get the output data in a JSON format. Let’s understand the code, line by line. First things first, make sure you have Python 3.7 or above installed on your machine. Use pip to install BeautifulSoup. All other packages come pre-bundled with Python, so you will not need to install any of them. Once done, install a code editor like Atom or VS Code, and you are ready to go. Understanding the code is important and hence we will be starting from the very first line. You need the four import statements for specific functionality. Next, you have three lines starting with “ctx”. These are specifically for ignoring the SSL Certificate errors that you might face when accessing websites via your code. Next, we take the website URL as input from the user. Here you can hardcode the URL as well, or even have an array of multiple URLs.

Web Scraping Process:

Next, we access the webpage using the Request function of urllib. Make sure to add the User-Agent in the header to make the website believe that you are using a browser. The reason behind doing this is that websites are meant to be accessed by browsers and not code, and hence they may block your IP if they catch you. Once this is done, we have completed all the basic steps and next, we will be converting the HTML object to a bs object, and then prettify it into the utf-8 format, to handle special characters and symbols in the webpage. Once this is done, we extract the title, the short details, and other properties by parsing the bs object. As you can see, in the script tag with attribute type = application/ld+json, there are multiple data points all stored in a JSON format. Also, you can see that we use an i==0, and i==1 check. This is because there are two script tags (with the same attribute) like this in a page. The first tag gives us some data points, while the second gives the rest. Once we have extracted all the data points, you can store it in a JSON file and save it as we have. You could also save, upload it to a site, or even hit an API with the data if you wanted.

import json
import ssl
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

# Ignore SSL Certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

# Take the Zillow listing URL as input
url = input(‘Enter Zillow House Listing Url- ‘)

# Make the website believe that you are using a browser
req = Request(url, headers={‘User-Agent’: ‘Mozilla/5.0’})
webpage = urlopen(req).read()

# HTML page -> BeautifulSoup Object soup = BeautifulSoup(webpage, ‘html.parser’)
html = soup.prettify(‘utf-8’)
property_details = {‘details_long’: {}, ‘address’: {}}

# Extract different data points of the property listing
for title in soup.findAll(‘title’):
property_details[‘Title’] = title.text.strip()
break

for meta in soup.findAll(‘meta’, attrs={‘name’: ‘description’}):
property_details[‘details_short’] = meta[‘content’].strip()

for div in soup.findAll(‘div’, attrs={‘class’: ‘character-count-truncated’}):
property_details[‘details_long’][‘Description’] = div.text.strip()

for (i, script) in enumerate(soup.findAll(‘script’,
attrs={‘type’: ‘application/ld+json’})):
if i == 0:
json_data = json.loads(script.text)
property_details[‘details_long’][‘no_of_rooms’] = json_data[‘numberOfRooms’]
property_details[‘details_long’][‘floor_size_in_sqrft’] = json_data[‘floorSize’][‘value’]
property_details[‘address’][‘street’] = json_data[‘address’][‘streetAddress’]
property_details[‘address’][‘locality’] = json_data[‘address’][‘addressLocality’]
property_details[‘address’][‘region’] = json_data[‘address’][‘addressRegion’]
property_details[‘address’][‘postal_code’] = json_data[‘address’][‘postalCode’]
if i == 1:
json_data = json.loads(script.text)
property_details[‘price_in_dollars’] = json_data[‘offers’][‘price’]
property_details[‘inage’] = json_data[‘image’]
break

with open(‘house_listing_data.json’, ‘w’) as outfile:
json.dump(property_details, outfile, indent=4)

with open(‘house_listing_data.html’, ‘wb’) as file:
file.write(html)

print(‘———-Extraction of data is complete. Check json file.———-‘)

The output JSON should look somewhat like this-

{
“details_long”: {
“no_of_rooms”: 3,
“floor_size_in_sqrft”: “1,392”
},
“address”: {
“street”: “22516 Joan Dr”,
“locality”: “California”,
“region”: “MD”,
“postal_code”: “20619”
},
“Title”: “22516 Joan Dr, California, MD 20619 | MLS #MDSM167670”,
“details_short”: “22516 Joan Dr , California, MD 20619-3175 is a single-family home listed for-sale at $150,000. The 1,392 sq. ft. home is a 3 bed, 2.0 bath property. Find 40 photos of the 22516 Joan Dr home on Zillow. View more property details, sales history and Zestimate data on Zillow. MLS # MDSM167670”,
“price_in_dollars”: 150000,
“inage”: “https://photos.zillowstatic.com/p_h/ISn2oi3glpvdv10000000000.jpg”
}

Conclusion:

Using BeautifulSoup for your web-scraping needs can be easy as long as you can analyze the HTML pages manually at first and decide on the tags that need to be targeted. It can work on pages that do not have any dynamic content and do not sit behind a login page. For more complex web pages, you will need more complex tools. Our team at PromptCloud helps out companies who are looking to leverage data and make data-backed decisions. We not only help set up fully-automated web scraping engines that run at frequent intervals, on the cloud but also help companies analyze the data to extract trends and other useful information.