Did you know that there are 12 factors to be considered while acquiring data from the web? If no, fret not! Download our free guide on web data acquisition to get started!
When it comes to web scraping, some programming languages are preferred over others. One of the most popular among these is Python. Besides being one of the easiest languages to learn due to its gentler learning curve, it also has the advantage of being a language with massive developer support- which has led to numerous third party packages. These packages can be used for multiple functionalities, which may be difficult to perform with core Python. Some common examples are- for image processing or computer vision, we use OpenCV, for machine learning, we use TensorFlow and for plotting graphs, we use MatplotLib.
When it comes to web scraping, one of the most commonly used libraries is BeautifulSoup. This library does not specifically scrape data from the internet, but in case you can get the HTML file of a webpage, it can help extract specific data points from it. In general, the library is used to extract data points from XML and HTML documents.
Before you go on to write code in Python, you have to understand how BeautifulSoup works. Once you have extracted the HTML content of a web-page and stored it in a variable, say html_obj, you can then convert it into a BeautifulSoup object with just one line of code-
Where html_obj is the HTML data, the soup_obj is the bs object that has been obtained and the “html.parser” is the parser that was used to do the conversion. Once you have the bs object, called soup_obj, traversing it is very easy, and since traversing it is straightforward enough, data-extraction also becomes simple.
Let us take an example. Say you need to fetch a data-point called the product title, that is present in every page of an eCommerce website. Now you downloaded a single HTML product page from that website and realised that each page has the product name mentioned on an element of type span having id as productTitle. So how will you fetch this data from say 1000 product pages? Well, you will get the HTML data for each page, and fetch the data point in this manner-
While this is a way to get textual data present between a certain tag element, you can fetch data from attributes of a tag as well.
Now that we have some basic understanding of how a bs object is traversed, let us go write some code, and see how it works. Using the code snippet below, you can scrape data from Zillow, a leading real estate marketplace based out of the USA very easily. You can run this code and input the URL of a listing, to get the output data in a JSON format. Let’s understand the code, line by line. First things first, make sure you have Python 3.7 or above installed in your machine. Use pip to install BeautifulSoup. All other packages come pre-bundled with Python, so you will not need to install any of them. Once done, install a code editor like Atom or VS Code, and you are ready to go.
Understanding the code is important and hence we will be starting from the very first line. You need the four import statements for specific functionality. Next, you have three lines starting with “ctx”. These are specifically for ignoring the SSL Certificate errors that you might face when accessing websites via your code. Next, we take the website URL as input from the user. Here you can hardcode the URL as well, or even have an array of multiple URLs.
Next, we access the webpage using the Request function of urllib. Make sure to add the User-Agent in the header to make the website believe that you are using a browser. The reason behind doing this is that websites are meant to be accessed by browsers and not code, and hence they may block your IP if they catch you. Once this is done, we have completed all the basic steps and next, we will be converting the HTML object to a bs object, and then prettify it into the utf-8 format, to handle special characters and symbols in the webpage.
Once this is done, we extract the title, the short details, and other properties by parsing the bs object. Like you can see, in the script tag with attribute type = application/ld+json, there are multiple data points all stored in a JSON format. Also, you can see that we use an i==0, and i==1 check. This is because there are two script tags (with the same attribute) like this in a page. The first tag gives us some data points, while the second gives the rest.
Once we have extracted all the data points, you can store it in a JSON file and save it as we have. You could also save, upload it to a site, or even hit an API with the data if you wanted.
The output JSON should look somewhat like this-
Using BeautifulSoup for your web-scraping needs can be easy as long as you can analyze the HTML pages manually at first and decide on the tags that need to be targeted. It can work on pages that do not have any dynamic content and do not sit behind a login page. For more complex web pages, you will need more complex tools. Our team at PromptCloud helps out companies who are looking to leverage data and make data-backed decisions. We not only help set up fully-automated web scraping engines that run at frequent intervals, on the cloud but also help companies analyze the data to extract trends and other useful information.
Your email address will not be published. Required fields are marked *
Save my name, email, and website in this browser for the next time I comment.