Did you know that there are 12 factors to be considered while acquiring data from the web? If no, fret not! Download our free guide on web data acquisition to get started!
The notable rise and exponential growth of web data have unlocked new avenues for various sectors. Right from manufacturing units to service sectors, data is an essential component adopted by businesses around the globe to stay relevant to the evolving times. Web data not only holds a goldmine of information about the competition and market but also offers insights that can be used to improve internal processes and operations.
Web scraping helps targeted online data to be extracted for further use by the analytics engine or BI tool. The objective of web scraping remains varied –
Imagine the benefits if we could structure the web data, get rid of the noise, and export them to machine-readable formats. Let’s see how this can be done by using Ruby.
Choice of the coding script
Data extraction and the actual implementation of web scraping practices isn’t an easy affair. Elementary knowledge of CSS, HTML, and the right coding script will make your journey smooth. Your choice of the coding script will play a crucial role in this context. Let’s find out why Ruby is creating a buzz in the market.
If you are planning to launch your first web scraping program, Ruby can play the role of a reliable scripting language. Quite a few reasons are responsible for the unmatched popularity of this language, and the following reasons will help you understand why it is so effective!
These are some of the factors that make Ruby an indispensable option for web scraping. The setup and installation should be done optimally, as these processes are critical to the execution of data extraction processes. Here’s a comprehensive tutorial to help you out through the process.
The step-by-step guide
Before we begin, let’s be clear about certain points. This tutorial is aimed at Mac users, if you use a different machine, the initial set up process could be slightly different. Secondly, the program uses Nokogiri, which can change webpages into ‘Ruby objects’ thus simplifying the web scraping process. With these two factors in mind, you can embark on your projects.
In this guide, we will be scraping the headlines of the first 100 listings on olx for used cars.
The setup process
Here are the basic requisites to develop a complete setup for web extraction using Ruby.
Step 1: Installing dependencies
During the installation process, make sure you gain complete information on the three useful Ruby Gems. These three options include:
Since we already explained a bit about Nokogiri, let’s discuss HTTParty and Pry. HTTParty is a gem which our web scraper will use to send HTTP requests to the pages we’re scraping. We will be using HTTParty to send out GET requests, which will return all the HTML content of the page as a string. For debugging, we use Pry, which is a ruby gem. It will help us parse the webpage’s code and is an essential component in this setup.
Follow the below commands and run them on your machine to get these gems installed on your computer.
gem install nokogiri
gem install party
gem install pry
Step 2: The creation of scraper files
You will have to create a folder named nokogiri_tutorial in any of the preferred locations on your computer. The desktop is the perfect place to do so. The next step is to download a text editor like ‘Sublime Text’ or any other option of your choice and save the file to this folder named “web_scraper.RB”. Once you complete these steps, you are good to work on the dependencies.
Step 3: Sending HTTP requests to the page
Start by creating a variable operation named ‘page’ and make sure it is equal to the HTTParty GET request of the page that we’re scraping.
In this case: https://www.olx.in/all-results/q-cars/
After this, you can enter “Pry. start(binding).” Navigate and find the folder marked as a web_scraping.Rb file. Save it straight away to your desktop and open the terminal by entering this command.
Your web scraping program is ready to be implemented. You can run this command and run it:
The terminal should get transformed into Pry, and it is essential to check the layout before working on further processes. You can move onto the next step. But, before you do that, make sure you type ‘exit’ in the chosen terminal, leave Pry, and then return to the original location of the program folder.
Step 4: Moving on to NokoGiri
The objective here is to first convert and change these car listings to NokoGiri objects, as that is crucial for parsing. Variable creation is important, and you will have to develop a new one by the name “parse_page.” Nokogiri has a distinctive way of converting HTML strings into Nokogiri objects. You can leave the Pry at the bottom of the code.
The next step will be to save the file containing the Ruby command. Pry will be opened automatically and a new variable “parse_page” should be entered. This will return the Olx page as a Nokogiri object.
Go ahead and create an HTML file in the same folder with the name ‘cars.html’ and copy-paste the results of the parse_page command into this file. This formatted HTML data will come in handy for reference later on.
Before starting with the next step, exit from Pry in your terminal.
Step 5: Data Parsing
Data parsing requires an elementary knowledge of programming as well as coding. Since you’re looking to extract headline texts of all car listings, the cars.html file will come in handy in cross checking. Locate the crucial elements from the folder and perform inspections using the ‘inspect element tool,’ or you can also view ‘page source code.’
Since we found that the listings are within a div with class name ‘content’, these are the commands that follow:
Check the coding layouts and arrays each time you run the command. Once parsing is complete, you will have to export data sets to CSV files.
Step 6: Exporting data files to CSV
When you reach step 6, you should have completed the scraping process successfully and unstructured data changed into structured data sets. Let’s now head back to the terminal. Exit out of Pry if you’re still in it so that your terminal is in the nokogiri_tutorial folder which contains the scraping program and cars.html file. Now type the following command:
Now you will be left with a blank CSV file to which you can save the data from cars_array. You can now write a simple script to write this data onto our new CSV file and you have your structured car listings data in a CSV file. This will make it easier to process and manipulate as and when you want to do so.
Hopefully, this should have given you a rough picture of how you can go about scraping a website using Ruby. It’s time to explore and crawl more complex and challenging sites using this newly mastered skill.
Thanks for this informative blog. It gives good information on building Ruby-based website scraper from scratch.
Your email address will not be published. Required fields are marked *
Save my name, email, and website in this browser for the next time I comment.