The term big-data is not only about the collection of large quantities of inter-connected and inter-dependent meaningful data but also encompasses speedy handling, processing, and analysis of data. This is something most companies are not ready for, and most service providers are yet to take on the challenges fully. This creates a huge difference between demand and supply and provides a colossal opportunity for those in service market to make a neat profit by providing these services to all those in need. However, that is easier said than done.
The service industry started with providing software to help businesses, moved on to providing platforms and ready infrastructure and is now helping companies to move to the cloud. In all these cases, the service industry faced issues that could be solved and learned from, so that it took less effort when tackling a similar problem. However, the problem with taking on big-data projects of various organizations is that, they are almost never similar- some collect data from thousands of sensors, for some, data is paper records, collected over decades, and for others, digitally stored documents, pictures, or even sound and video recordings. What these companies want to make out of the data also vary-
With such a variety of structured and unstructured data and assortment of problem statements, you can understand, that every problem and every client is different and would need a custom effort and approach. A specialized team would be needed, and service-companies can’t just build a team of mass-recruited fresher’s with basic skill-sets.
In this initial stage, both business analysts, as well as data scientists, would have to sit and decide as to which business problem will be solved and what improvements in metrics are aimed using which dataset. Unless this is done at the very beginning, roadblocks and confusion would set in at later stages.
Although this might not seem to be a part of big data analysis, it is an important part indeed. Without the data, what would you even analyze? Most companies and organizations have saved petabytes of data, but mostly in unstructured formats, and having duplicate entries and other errors. The first service that would be required is data collection, followed by data cleansing since we all know about the bane of dirty data. When we hear, “big data”, the first thing that comes to our mind are complex models and colorful inferences in the form of 3D graphs. The reality is far from it. On average, 60-80% of the total project time, is spent by data-scientists to prepare their data, clean it, and store it in an organized fashion.
Indeed, most data scientists find cleaning and preparing data as the most unenjoyable part of their work, but after all, it is the most important one. Unless the data that you have, is spick and span, it is more or less guaranteed, that your inferences won’t be first grade either. From Excel to Python or R, there are several ways to clean and structure data, so that it can be used as per requirements later on. In case there are more than one sources of data, say a company collects data, both from video feed as well as sensors, there has to be a point, where the data meet, or where one data compliments the other. For this, the data has to be structured properly and is a part of the cleaning phase as well. It is important that all collected data, be it from more than one sources must give the same inference, or point in the same direction.
It is at this step, where the so-called “magic” happens. Different models are built, the data is split into training and test sets and with much difficulty and going round and round the same problem, trying to increase accuracy, the team has to converge at one particular model that it thinks, is ideal for the problem at hand. It might also happen, that more than one model is used and the most common result is chosen. It is a test and retest phase where experience helps more than theory.
What looks good in the book, might not in real life. It is rarely seen that a modeling algorithm hits gold at first go. The model has to be closely monitored and its results documented and stored, so that the model can be re-trained, continuously so that it keeps getting better. Other optimization might also be needed, as the data science team sees fit, from time to time.
Like any other software product, it would need maintenance so as to make sure that it does not train itself on new incoming junk data, or it is able to adapt to some new changes in the data stream, etc.
Data science is a relatively new domain and it is highly unlikely that companies around the globe will try to build their own data-science team, from scratch. The challenges start right from the hiring process since you would need people with some specialized skill-set and a bit of experience as well. You might even need to take the help of someone who has an experience in putting together a data-science team and has done it before. Sounds almost like putting together a task force for the military right? Tackling big-data is no less challenging, let me assure you. This leaves a lot of playing field for service providers, and it is time for them to train capable individuals in their organization, and seize the day.