Did you know that there are 12 factors to be considered while acquiring data from the web? If no, fret not! Download our free guide on web data acquisition to get started!
Microsoft Excel is undoubtedly one of the best scraping tools python to manage information in a structured form. Excel is like the Swiss army knife of data, with its great features and capabilities. Here is how MS Excel can be used as a basic web scraping tool to extract web data directly into a worksheet. We will be using Excel web queries to make this happen.
Web queries feature of MS Excel is used to fetch website data to excel and can be extracted into the worksheet easily. It can automatically find tables on the webpage and would let you pick the particular table you need the data from. Web queries can also be handy in situations where an ODBC connection is impossible to maintain, apart from just extracting data from the web pages. Let’s see how web queries work and how you can crawl HTML tables and use MS Excel as the best scraping tools python.
We’ll start with a simple web query to crawl data from the Yahoo Finance page. This page is particularly easier to crawl and hence is a good fit for learning the method. The page is also pretty straightforward and doesn’t have important information in the form of links or images. Here is the tutorial video for using MS Excel as web scraping tool.
1. Select the cell in which you want the data to appear
2. Click on Data> From Web
3. The New Web query box will pop up as shown below
4. Enter the web page URL you need to extract data from in the Address bar and hit the Go button
5. Click on the yellow-black buttons next to the table you need to extract data from
6. After selecting the required tables, click on the Import button and you’re done. Excel will now start downloading the content of the selected tables into your worksheet
Once you have the data scraped into your Excel worksheet, you can do a host of things like creating charts, sorting, formatting, etc. to better understand or present the data in a simpler way.
Once you have created a web query, you have the option to customize it according to your requirements. To do this, access web query properties by right-clicking on a cell with the extracted data. The page you were querying appears again, click on the Options button to the right of the address bar. A new pop up box will display where you can customize how the web query interacts with the target page. The options here let you change some of the basic things related to web pages like the formatting and redirections.
Apart from this, you can also alter the date range options by right-clicking on a random cell with the query results and selecting Data range properties. The data range properties dialog box will pop up where you can make the required changes. You might want to rename the data range to something you can easily recognize like ‘Stock Prices’.
Auto-refresh is a feature of web queries worth mentioning, and one which makes our Excel web scraper truly powerful. You can make the extracted data auto-refreshing so that your Excel worksheet will update the data whenever the source website changes. You can set how often you need the data to be updated from the source web page in the data range options menu. The auto-refresh feature can enable by ticking the box beside ‘Refresh every’ and setting your preferred time interval for updating the data.
Although web data extraction using best scraping tools python can be a great way to crawl HTML tables from the websites into excel, it is nowhere close to the enterprise web scraping solution. This can prove to be useful if you are collecting data for your college research paper or you are a hobbyist looking for a budgeted way to get your hands on some data. If data for business is your need, you will have to depend on web data extraction service with expertise in dealing with web scraping at scale. Outsourcing the complicated web scraping process, will also give you more room to deal with other things that need extra attention.
Also check: Why MS Excel is a Poor Choice for Data Projects.