Scraping is becoming a rather mundane job with every other organization getting its feet wet with it for their own data gathering needs. There have been enough number of crawlers built – some open-sourced and others internal to organizations for in-house utilities. Although crawling might seem like a simple technique at the onset, doing this at a large-scale is the real deal. You need to have a distributed stack set up to take care of handling huge volumes of data, to provide data in a low-latency model, and also to deal with fail-overs. This still is achievable after crossing the initial tech barrier and via continuous optimizations. (P.S. Not under-estimating this part because it still needs a team of Engineers monitoring the stats and scratching their heads at times).
Focused crawls on a predefined list of sites
However, you bump into a completely new land if your goal is to generate clean and usable data sets from these crawls i.e. “extract” data in a format that your DB can process and aid in generating insights. There are 2 ways of tackling this:
a. site-specific extractors which give desired results
b. generic extractors that result in a few surprises
Assuming you still do focused crawls on a predefined list of sites, let’s go over specific scenarios when you have to pick between the two-
1. Mass-scale crawls; high-level metadata –
Use generic extractors when you have a large-scale crawling requirement continuously. Large-scale would mean having to crawl sites in the range of hundreds of thousands. Since the web is a jungle and no two sites share the same template, it would be impossible to write an extractor for each. However, you have to settle in with just the document-level information from such crawls like the URL, meta keywords, blog or news titles, author, date, and article content which is still enough information to be happy with if your requirement is analyzing sentiment of the data.
A generic extractor case
Generic extractors don’t yield accurate results and often mess up the datasets deeming it unusable. The reason being
programmatically distinguishing relevant data from irrelevant datasets is a challenge. For example, how would the extractor know to skip pages that have a list of blogs and only extract the ones with the complete article? Or delineating article content from the title on a blog page is not easy either.
To summarize, below is what to expect of a generic extractor.
Pros-
- minimal manual intervention
- low on effort and time
- can work on any scale
Cons-
- Data quality compromised
- inaccurate and incomplete datasets
- lesser details suited only for high-level analyses
- Suited for gathering- blogs, forums, news
- Uses- Sentiment Analysis, Brand Monitoring, Competitor Analysis, Social Media Monitoring.
2. Low/Midscale crawls; detailed datasets –
If precise extraction is the mandate, there’s no going away from site-specific extractors. But realistically this is doable only if your scope of work is limited i.e. few hundred sites or less. Using site-specific extractors, you could extract as many numbers of fields from any nook or corner of the web pages. Most of the time, most pages on a website share similar templates. If not, they can still be accommodated for using site-specific extractors.
Designing extractor for each website
Pros-
- High data quality
- Better data coverage on the site
Cons-
High on effort and time
Site structures keep changing from time to time and maintaining these requires a lot of monitoring and manual intervention
Only for limited scale
Suited for gathering – any data from any domain on any site be its product specifications and price details, reviews, blogs, forums, directories, ticket inventories, etc.
Uses- Data Analytics for E-commerce, Business Intelligence, Market Research, Sentiment Analysis
Conclusion
Quite obviously you need both such extractors handy to take care of various use cases. The only way generic extractors can work for detailed datasets is if everyone employs standard data formats on the web (Read our post on standard data formats here). However, given the internet penetration to the masses and the variety of things folks like to do on the web, this is being overly futuristic.
So while site-specific extractors are going to be around for quite some time, the challenge now is to tweak the generic ones to work better. At PromptCloud, we have added ML components to make them smarter and they have been working well for us so far.
What have your challenges been? Do drop in your comments.