Web scraping is undoubtedly the best way to gain business intelligence and insights on your target market. Many companies have substantially improved their overall performance from making data-backed decisions. While all is good with big data and the valuable insights it can give you, it all depends on the quality of data that you acquire which is mission critical.
In theory, anyone can start a website and publish what they want on the web. It is up to you to evaluate the source websites to make sure they provide trustworthy content that you can benefit from. The data that you extract from the web is only going to be as good as the sources from where you crawl it. Finding great websites where trustworthy data can be found should be given utmost care when you start out on your big data journey. Here are some things to keep in mind while choosing your sources for web scraping.
Avoid Sites that Discourage Bots
Although it’s technically feasible to crawl and extract data from sites that block automated bots using IP blocking or similar technologies, it is not recommended to include such websites in your list. Apart from the legal risk associated with scraping a site that discourages automated scraping, you also run the risk of losing data when this site might implement better blocking mechanisms in the future. Having this kind of a website as your source will need you to put extra efforts to overcome the barriers they put up and yet might end up with incomplete or useless data. It is always better to leave them alone and look for better and reliable sources for scraping data from.
Sites with too many Broken links
Links are the connecting tissue of the world wide web. It goes without saying that a website with too many broken links is a bad choice as a web scraping source. This is a definite indicator of the negligence from the website administrator’s part which means the quality of data that you might extract from the site has to be taken less seriously. A web scraping setup will also come to a halt when it encounters a broken link. These issues would be catastrophic for your web scraping plan. In the long run, you are better off with a different source that has similar data and better housekeeping.
Site Design and Navigation
This might sound a bit subjective, and site design is not something that has any direct effect on the web scraping process. A web crawler can crawl and extract data from any site regardless of how good it looks to a normal visitor. It is although a general understanding that websites that offer a clean and simple design and user experience tend to be good and reliable sources of information. On the other hand, websites with a cluttered and bad user interface often have low quality information on them. It’s always a good idea to make sure the data available is reliable in case your source website is not good on the user experience side.
Freshness of the Data
This should be one of the most important criteria while choosing sources for web scraping. The data that you acquire should be fresh and relevant to the current time period for it to be of any use at all. If the sources you choose have old and outdated data available, you are putting the future of your business at risk by getting results that don’t fit in the current time period. Basing your business decisions on this data could end up in a complete disaster if it goes unnoticed. You should always look for websites that are regularly updated with fresh and relevant data for including as sources for scraping. If the dates are not displayed on the site, you could always dig into the source code to find the last modified date of the html document.
Does the Site get any Search Engine Love?
Search engines have been getting smarter at identifying great websites over the last few years. Google’s changing algorithms have almost succeeded at eliminating the bad and spammy websites from the top side of the search engine results pages. This would mean, you can always use search engine rankings of a site to get an idea of how authoritative a particular website is in its niche. If a site appears nowhere in the search results, this might be a bad remark on the credibility of the website and the data it has. This shouldn’t however, be the sole reason for rejecting a site as a source for web scraping.
There could be some reliable sources that do not comply with some or all of the aforementioned points. The kind of websites that you would want to use as sources for web scraping will also vary depending on your niche and cause. Anyway, it is always a good idea to double check your sources for their relevancy to your data acquisition plans in order to ensure the best results from your big data investments.
Stay tuned for our next article to find out what experts didn’t tell you about HTML scraping.
Planning to acquire data from the web? We’re here to help. Let us know about your requirements.