This post is about DIY web scraping tools. If you are looking for a fully customizable web scraping solution, you can add your project to CrawlBoard.
Microsoft Excel is undoubtedly one of the most powerful tools to manage information in a structured form. Excel is like the Swiss army knife of data, with its great features and capabilities. Here is how MS Excel can be used as a basic web scraping tool to extract web data directly into a worksheet. We will be using Excel web queries to make this happen.
Web queries feature of MS Excel is used to fetch website data to excel and can be extracted into the worksheet easily. It can automatically find tables on the webpage and would let you pick the particular table you need the data from. Web queries can also be handy in situations where an ODBC connection is impossible to maintain, apart from just extracting data from the web pages. Let’s see how web queries work and how you can crawl HTML tables and use MS Excel as a web scraper tool.
Steps to Use MS Excel to Scrape Websites
We’ll start with a simple web query to crawl data from the Yahoo Finance page. This page is particularly easier to crawl and hence is a good fit for learning the method. The page is also pretty straightforward and doesn’t have important information in the form of links or images. Here is the tutorial video for using MS Excel as webscraping tool.
A. Create a New Web Query
1. Select the cell in which you want the data to appear
2. Click on Data> From Web
3. The New Web query box will pop up as shown below
4. Enter the web page URL you need to extract data from in the Address bar and hit the Go button
5. Click on the yellow-black buttons next to the table you need to extract data from
6. After selecting the required tables, click on the Import button and you’re done. Excel will now start downloading the content of the selected tables into your worksheet
Once you have the data scraped into your Excel worksheet, you can do a host of things like creating charts, sorting, formatting, etc. to better understand or present the data in a simpler way.
B. Customizing the Query
Once you have created a web query, you have the option to customize it according to your requirements. To do this, access web query properties by right-clicking on a cell with the extracted data. The page you were querying appears again, click on the Options button to the right of the address bar. A new pop up box will display where you can customize how the web query interacts with the target page. The options here let you change some of the basic things related to web pages like the formatting and redirections.
Apart from this, you can also alter the date range options by right-clicking on a random cell with the query results and selecting Data range properties. The data range properties dialog box will pop up where you can make the required changes. You might want to rename the data range to something you can easily recognize like ‘Stock Prices’.
Auto-refresh is a feature of web queries worth mentioning, and one which makes our Excel web scraper truly powerful. You can make the extracted data auto-refreshing so that your Excel worksheet will update the data whenever the source website changes. You can set how often you need the data to be updated from the source web page in the data range options menu. The auto-refresh feature can enable by ticking the box beside ‘Refresh every’ and setting your preferred time interval for updating the data.
Web Scraping at Scale
Although web data extraction using MS Excel can be a great way to crawl HTML tables from the websites into excel, it is nowhere close to the enterprise web scraping solution. This can prove to be useful if you are collecting data for your college research paper or you are a hobbyist looking for a budgeted way to get your hands on some data. If data for business is your need, you will have to depend on web scraping services with expertise in dealing with web scraping at scale. Outsourcing the complicated web scraping process, will also give you more room to deal with other things that need extra attention.
Also check: Why MS Excel is a Poor Choice for Data Projects.