The notable rise and exponential growth of web data have unlocked new avenues for various sectors. Right from manufacturing units to service sectors, data is an essential component adopted by businesses around the globe to stay relevant to the evolving times. Web data not only holds a goldmine of information about the competition and market but also offers insights that can be used to improve internal processes and operations.
Web scraping helps targeted online data to be extracted for further use by the analytics engine or BI tool. The objective of web scraping remains varied –
- Data extraction is an effective way of advertising your business and promoting products/services
- Users, consumers, and web visitors can get the desired information about a service or product.
- Companies can gain competitive intelligence about the strategies and plans in place to grow their respective market share.
- Brands can know the general perception around their brand through social media interactions among people. This helps the marketing teams to devise and deploy relevant marketing messages meant specifically for the persona of this audience, thus boosting the likelihood of conversion.
- Businesses can gain more clarity on the needs, pain points, and preferences of their target audience. They can then drive product development in the right direction with this valuable intelligence.
Imagine the benefits if we could structure the web data, get rid of the noise, and export them to machine-readable formats. Let’s see how this can be done by using Ruby.
Choice of the coding script
Data extraction and the actual implementation of web scraping practices isn’t an easy affair. Elementary knowledge of CSS, HTML, and the right coding script will make your journey smooth. Your choice of the coding script will play a crucial role in this context. Let’s find out why Ruby is creating a buzz in the market.
If you are planning to launch your first web scraping program, Ruby can play the role of a reliable scripting language. Quite a few reasons are responsible for the unmatched popularity of this language, and the following reasons will help you understand why it is so effective!
- A powerful script: Ruby-On-Rails is a highly powerful and effective script for web scraping. For first-timers and newbies, this particular language is proven to be a strong resource.
- Reliable community: Ruby comes along with a strong team of developers who form a reliable and highly dependable community. With millions of documentations, no issue will be huge for you!
- Easy installation: The installation procedure is well documented and fairly easy to follow.
These are some of the factors that make Ruby an indispensable option for web scraping. The setup and installation should be done optimally, as these processes are critical to the execution of data extraction processes. Here’s a comprehensive tutorial to help you out through the process.
The step-by-step guide
Before we begin, let’s be clear about certain points. This tutorial is aimed at Mac users, if you use a different machine, the initial set up process could be slightly different. Secondly, the program uses Nokogiri, which can change webpages into ‘Ruby objects’ thus simplifying the web scraping process. With these two factors in mind, you can embark on your projects.
In this guide, we will be scraping the headlines of the first 100 listings on olx for used cars.
The setup process
Here are the basic requisites to develop a complete setup for web extraction using Ruby.
- Your computer, whether it’s a desktop or laptop should have Ruby on it. If you are a Mac loyalist, then half the job is done.
- You will need a text editor. That’s necessary for writing down the program commands. If your computer doesn’t have an in-built option, try downloading Sublime Text. With exciting features and cool controls, this text editor will make coding exciting and interesting.
- Another requisite is in-depth knowledge of the use of HTML and CSS. If you are planning to master the art of web scraping, knowledge of CSS and HTML will be crucial.
- Get knowledgeable on Ruby. A bit of information is essential in this context. You can check out some of the online courses and improve your knowledge-base. With these processes and factors in place, it will be time to begin the crucial steps.
Step 1: Installing dependencies
During the installation process, make sure you gain complete information on the three useful Ruby Gems. These three options include:
Since we already explained a bit about Nokogiri, let’s discuss HTTParty and Pry. HTTParty is a gem which our web scraper will use to send HTTP requests to the pages we’re scraping. We will be using HTTParty to send out GET requests, which will return all the HTML content of the page as a string. For debugging, we use Pry, which is a ruby gem. It will help us parse the webpage’s code and is an essential component in this setup.
Follow the below commands and run them on your machine to get these gems installed on your computer.
gem install nokogiri
gem install party
gem install pry
Step 2: The creation of scraper files
You will have to create a folder named nokogiri_tutorial in any of the preferred locations on your computer. The desktop is the perfect place to do so. The next step is to download a text editor like ‘Sublime Text’ or any other option of your choice and save the file to this folder named “web_scraper.RB”. Once you complete these steps, you are good to work on the dependencies.
Step 3: Sending HTTP requests to the page
Start by creating a variable operation named ‘page’ and make sure it is equal to the HTTParty GET request of the page that we’re scraping.
In this case: https://www.olx.in/all-results/q-cars/
After this, you can enter “Pry. start(binding).” Navigate and find the folder marked as a web_scraping.Rb file. Save it straight away to your desktop and open the terminal by entering this command.
Your web scraping program is ready to be implemented. You can run this command and run it:
The terminal should get transformed into Pry, and it is essential to check the layout before working on further processes. You can move onto the next step. But, before you do that, make sure you type ‘exit’ in the chosen terminal, leave Pry, and then return to the original location of the program folder.
Step 4: Moving on to NokoGiri
The objective here is to first convert and change these car listings to NokoGiri objects, as that is crucial for parsing. Variable creation is important, and you will have to develop a new one by the name “parse_page.” Nokogiri has a distinctive way of converting HTML strings into Nokogiri objects. You can leave the Pry at the bottom of the code.
The next step will be to save the file containing the Ruby command. Pry will be opened automatically and a new variable “parse_page” should be entered. This will return the Olx page as a Nokogiri object.
Go ahead and create an HTML file in the same folder with the name ‘cars.html’ and copy-paste the results of the parse_page command into this file. This formatted HTML data will come in handy for reference later on.
Before starting with the next step, exit from Pry in your terminal.
Step 5: Data Parsing
Data parsing requires an elementary knowledge of programming as well as coding. Since you’re looking to extract headline texts of all car listings, the cars.html file will come in handy in cross checking. Locate the crucial elements from the folder and perform inspections using the ‘inspect element tool,’ or you can also view ‘page source code.’
Since we found that the listings are within a div with class name ‘content’, these are the commands that follow:
Check the coding layouts and arrays each time you run the command. Once parsing is complete, you will have to export data sets to CSV files.
Step 6: Exporting data files to CSV
When you reach step 6, you should have completed the scraping process successfully and unstructured data changed into structured data sets. Let’s now head back to the terminal. Exit out of Pry if you’re still in it so that your terminal is in the nokogiri_tutorial folder which contains the scraping program and cars.html file. Now type the following command:
Now you will be left with a blank CSV file to which you can save the data from cars_array. You can now write a simple script to write this data onto our new CSV file and you have your structured car listings data in a CSV file. This will make it easier to process and manipulate as and when you want to do so.
Hopefully, this should have given you a rough picture of how you can go about scraping a website using Ruby. It’s time to explore and crawl more complex and challenging sites using this newly mastered skill.