Since the web is an ever-growing reservoir of data that keeps expanding by each passing minute, determining what sites to crawl and how frequently can be tough to determine. When using web crawling as your primary tool to acquire data, there certainly comes a point where you have to make these mission-critical decisions. If you are just starting out with web scraping, figuring these out could cost you time and wasted efforts which could easily be avoided. We’re happy to share our experience from crawling thousands of websites belonging to every possible industry to help you out.
Establishing the optimal numbers for your project: Why is it important?
Before starting out with your web data acquisition project, there are a few questions you should ask yourself.
- Is the data requirement time sensitive?
- How frequently does new data get updated on the target site (s)?
- How reliable are the target site (s) in terms of uptime?
- Will the data from your target site (s) give you a full picture?
These questions should help you identify the key things you should be focusing on while selecting your target sites and crawl frequency.
While determining the number of sites to crawl, it’s always better to go with a higher number. As you would already know, a web crawler setup can break due to many external causes. Some common scenarios include:
- Target website changing its structure
- Website downtime
- Scheduled maintenance of the website
- Blocking issues
There’s no way to make the crawler setup immune to such scenarios, and the only possible way out is to include more websites to your target site list.
When it comes to the crawl frequency, the key determining factor should be how often new data gets added to the sites you’re crawling. If the sites get updated often, you should go for a higher crawl frequency to not miss any data.
A time-sensitive use case will also demand a higher crawl frequency. However, you don’t have to do the hard math since we’ve already determined the optimal crawl frequency and number of sites for different use cases.
Price comparison engines now act as the first destination for online shoppers. The value provided by a price comparison site is that it enables the user to compare the price of a product on multiple ecommerce portals from one place. This means you should have a certain number of websites in your target list for your price comparison to be of any value to the end user. The frequency of crawls should also be very high for the price comparison use case considering the dynamicity of ecommerce prices.
Optimal number of sites to crawl: 5+
Optimal crawl frequency: Daily,
Sentiment analysis is typically done as a onetime activity by companies to gauge the response of their customers towards a new product or service they have introduced in the market. Another popular application is to gather the mass opinion on important world events. The number of sites to crawl here will entirely depend on the specific industry that you’re targeting. Forums, social media sites and ecommerce portals are commonly used as sources for sentiment analysis.
Optimal number of sites to crawl: 3+
Optimal crawl frequency: Weekly or monthly
The applications of data aggregation are increasing worldwide in every aspect of business. When you are acquiring data for business intelligence, more is always better. Since data has to be large in size to be effective enough for business intelligence applications, crawling more sites makes sense. Apart from this, it is quite possible that the data available on some sites is skewed, which again calls for a larger target list. You should identify and include more relevant sites in your target list to have exhaustive data at your disposal.
Optimal number of sites to crawl: 10+
Optimal crawl frequency: Weekly
Brand monitoring has become an absolute necessity in today’s customer-centric business world. In order to keep your customers happy and thus drive the growth of your company, it is essential to track your brand and product mentions across the web. Considering brand monitoring being an ongoing activity, the frequency of crawls should be high in this case. The number of sites you should crawl will depend on your product and where its users share their opinions. If you are a consumer products brand, the best sources to monitor are ecommerce sites and social media.
Optimal number of sites to crawl: 5+
Optimal crawl frequency: Daily
The effectiveness of a web crawling project will be heavily impacted by the target sites and crawl frequency, and it’s imperative to take informed decisions on these aspects. As a general rule, it’s better to crawl more sites as more is better with big data.