With budget allocation deadlines for 2017 just around the corner, it’s the right time to talk about budgets. Now is also the right time to revisit the allocations and make the necessary changes to avoid any pitfall later in the year. Spending less and getting more return is the key idea that most businesses focus on while allocating budgets. However, there are certain activities that get overshadowed by other prominent activities—eventually affecting the cost allocation. One such example is the budget for business intelligence. Most companies tend to allocate a common budget for data analytics without taking into account the important and standalone stages that are part of it. We’re talking about big data acquisition, which in itself is a challenge and attention-deserving activity that deserves exclusive budget.
Data acquisition deserves more care and focus as the quality of the data would be a big deciding factor for your overall ROI from the data analytics project. Your project will only be as good as the data you have. Many companies are unaware of the technical challenges associated with data extraction and would settle for something like a DIY tool or a poorly set up in-house crawling solution to harvest data for their business. There are more than one thing wrong with this approach. Here are some reasons why you should allocate a dedicated budget for data acquisition rather than just one umbrella budget for the analytics project.
As the cost of data acquisition can go up in proportion to the amount of data that you require, it’s not a good idea to cram it into your whole data analytics budget. Most Data as a Service (DaaS) providers would have set costs bound to the volume. This is actually a good thing as you only need to pay for what you get. However, it’s a viable idea to have a separate budget for data acquisition anticipating the variable data requirements for an entire year.
The web is a goldmine of data, but it also takes a lot of efforts to derive relevant data from this unstructured pile of information out there. Since websites don’t follow any standard structure for rendering the data on their pages, figuring out how each site stores its data and writing code to fetch it takes skilled technical labour. Here are other pain points involved in data acquisition that you did not hear about.
Irrespective of whether you have an in-house crawling set up or are outsourcing the data acquisition to a vendor, you must be aware of the resource intensive nature of a crawling setup. As a web crawling setup will have to make GET requests on a continuous basis to fetch the data from target servers, the process needs multiple high-performance servers fine-tuned for the project to run smoothly. Having such servers is also crucial to the quality and completeness of the data, which is something you don’t want to compromise on. High performance servers are very costly as expected. Servers make up about 40% of the costs associated with data acquisition.
Web crawling is a technically complex task. Identifying HTML tags in which the data points are enclosed in, finding source files for AJAX calls, writing programs that are not resource intensive and smartly rotating IP addresses etc. would take a skilled team of programmers to be carried out. The costs associated with hiring and retaining a team of talented programmers can easily add up to be a significant cost in the data acquisition project.
A huge pile of data cannot simply be connected to a data visualisation system. Data acquired from the web is immediately not in a condition to be compatible with the data analytics engines. It has to be in a structured form to be machine-readable and this is a task that takes up a lot of resources. Giving the data a proper structure will require adding tags to each data point. DaaS vendors use customised programs to format huge data sets so that they are ready to consume. This is a cost incurring factor in data acquisition.
Web crawlers are bound to break at some point of time. This is because of the frequent changes to the websites’ structure and design. Such changes will mean the crawler programmed with reference to the old code no longer works. Web crawling service providers would use advanced monitoring technologies to spot such changes so that the crawler can be modified accordingly. This prompt action is key to the quality of the service. This is semi-automated and would again require labour costs along with time. Maintaining a web crawler setup in good shape is a demanding task that adds up to the cost of data acquisition.
As business intelligence and competitive analysis is evolving with the newfound resource that big data is, it might not be the best way to finalise data analytics budget without factoring in data acquisition cost. The ideal course of action is to understand the importance of data acquisition as a process in the big data project and allocate a dedicated budget so that you don’t run out of funds to acquire data.