If you deal with data on the web, more often than not would you have made a choice to employ an external entity for your data acquisition needs.
Here are the SLAs you’d need to consider when having a DaaS provider do the crawls for you-
1. Crawlability– If you’re into the crawling business, this is the primary attribute to be assured of. Irrespective of the technical variety of the websites, crawls should run smoothly. The crawlers need to be adept with the roadblocks and their corresponding workarounds. Here’s a post discussing these roadblocks and this one digging into AJAX pages.
2. Scalability– While crawling as a process might seem overrated when doing this for a few web pages or even to a couple of sites at max, the problem changes by an order of magnitude when this needs to be done at scale. Managing multiple clusters, distributing crawls across them, monitoring the same, collating results from these crawls and then grouping them is where the devils of crawling lie. Make sure your provider is agnostic to the scale you anticipate (look for cues like thousands of sites or millions of pages). Even if your current need is a low-scale arrangement, it’s better to go with a solution that’s scalable so that you have a reasonably though-out solution at your disposal with all nuts and bolts in place.
3. Data structuring capabilities– Crawling is only half the problem if your requirement is ready to use structured data. Every web page is different, and so are the requirements pertaining to every project. How detailed can your provider be in terms of extracting information from any nook of the page is something for you to validate. This becomes especially critical when your vendor is using a generic crawler in which case number of fields is limited as opposed to writing custom rules per site wherein you define the data schema as per your needs. It’s also a good idea to add quality checks at your end to avoid compromises because with web-scale and automation, there could be surprises.
4. Data accuracy– This is in lieu with the above point on structuring capabilities. You’d like access to untouched and uncontaminated information from the web pages. Most providers will extract data as-is from the site for the same reason because any minor modification might defeat the purpose of extracting such data in most cases. However, sometimes you might be resistant to too many new lines, spaces, tabs, etc. (from the web page itself) and hence some level of cleaning could be asked for.
5. Data coverage– Crawls can end up in few pages being missed or skipped for various reasons like page does not exist, page timing out or taking faster to load, or just that the crawler never got to that page. Although such issues are unavoidable especially at scale, they can sure be cured by keeping logs and, for the least, being aware of which ones crept in. Discuss the tolerance levels that you’re comfortable with so that the provider can configure their system accordingly.
6. Availability– Data acquisition, at its core, demands availability of right data at the right time. Let your provider know beforehand of the uptimes that you expect. Most of the providers who run data acquisition as a primary business should be able to guarantee ~99% availability of their data delivery channels.
7. Adaptability– Let’s come to terms with the fact that whichever process you have adopted between waterfall to agile, requirements do change because of the market dynamism. When acquiring data, you might reveal that adding more information to the data feeds will give you a competitive edge or you might simply have gotten aware of other data sources. How easily your provider can adapt to (if at all) such dynamics is something to check for upfront.
8. Maintainability– As big a deal the crawling and structuring of data is, so is monitoring the pipeline for regular automated feeds. Although it purely depends on your provider’s business model, it’s better to be aware of what’s included with the project. Given how often websites change, it’s better to employ someone who gets notified of changes and does the fixes, so that your team can avoid the hassles of maintaining it.
Do you think there’s more to this? We welcome your comments.