No other form of technology evolution has added such a huge impetus and impact on business fortunes, as data mining. When done strategically and with a pre-defined plan, it has the capability of uncovering pearls of insight not known to the senior management and decisions makers of the company. The benefits of a visual, easy to understand, and easy to integrate with your company data warehouse have the potential to provide visibility on preferences, patterns and pain points in different departments in business. This helps the managers to devise and develop data backed action points to give a much needed thrust to businesses.
When looking at data mining, we need to look at relational database management systems (RDBMS). This is the core building block that is subjected to data mining to uncover intelligence and insights. When exploring relational databases, two key components include tables and relations. Lets review these in detail now –
Tables – The data in RDBMS in objects called tables. Of course only related data can be stored in one table. So if a table is for customer name, it cannot store order values of the customer.
Relations – If you have 500 customer names and 500 different order values (in two separate tables), how do you know which customer had placed what order value? This is done by relationship – it helps connect multiple tables in a meaningful way.
With the help of structuring, you can fine tune the quality of the database that will be used in the data analysis. With its help you get to clear out noisy data, inaccurate data and inconsistent data. By removing all occurrences of ‘bad data’ what is left behind is the leaned up data that can then be passed through a further preprocessing stage of normalization, generalization and aggregation. Some of the examples of bad data can be
- Salary = “-135” (noisy data – contains incorrect or error data)
- Name = “” (incomplete data – lacks primary attribute of interest)
- Age = “10”, Date of Birth = “10/09/1955” (Inconsistent – two separate instances of the data do not match up)
It is important to clean the data and have the relational database in a meaningful and usable format. When we talk about data structuring it is pertaining to ‘meaningful format’ i.e. The data warehouse requires uniform integration of good quality data, so that the other steps through which it passes later on also deliver good quality output.
Structuring the data involves two key steps –
- De-duplication – As is obvious, this step involves removing duplicate records so that the integrity of the database can be maintained. If same records are present in multiple data sources the next steps of normalization and aggregation will not yield proper results.
- Standardization – Imagine a pile of records that says “Saint Thomas”, “St. Thomas”, or “St Thomas” randomly. From a data mining point of view, all of these should be classified as a single type of entity “St. Thomas”. Thus, data standardization helps to devise and implement business rules around abbreviations, synonyms, patterns, casing, or order matching. This type of data cleaning ensures that redundancies and inconsistencies are wiped out to lead to a better quality data.
Ah! The much loved technique of the data mining expert. With the help of normalization you get to-
- Achieve unambiguous and accurate interpretation of the data and its various relationships
- Ensure the atomicity of the data is preserved at all times.
The first step can be achieved by removing issues with insertion, update or deletion of the data or records. The second step can be achieved when structured data is integrated together sans any ambiguity, duplication, or inconsistencies. Normalization also helps to scale the data of every record so that it is scaled to a clearly defined range. For instance the field income might range from “Rs. 4000” to “Rs. 3,00,000” across multiple records of an enterprise. A data mining expert will help scale the values so that it falls within a prescribed range, to help in further mining and analysis. This scaling can be achieved by
- z-score normalization
- b.min-max normalization
- decimal scaling
With help of data normalization, a data scientist will also be able to ensure optimal mining time by reducing the terabytes of data that might be present in the data warehouse. This not only speeds up the overall data mining process, but also helps to improve TaT of delivery of insights. The way the data is reduced (by statistical representation) ensures that the lower number of records still yield the same analytical output as with the master database.
Difference between the two
Together, data structuring and data normalization help in ensuring that the data you collected is given a sense of semblance so that further analytics and BI can be executed on this ‘spruced up’ data. The obvious advantage and importance of these two steps is apparent – good data leads to good quality business intelligence; the converse too is true. Some other enterprise level benefits these two steps provide in overall data mining process include –
- Redundancies are reduced to improve the performance of the database
- Enhanced data quality and precision
- Better efficiencies in operations
- Intelligent projection and benchmarking for future performance
- Better level of data accessibility
- Better decision making based on quality data
There are certain differences between data structuring and data normalization worth knowing about.
- In the overall data mining preprocessing hierarchy, data structuring comes before data normalization. Thus normalization can be carried out on structured data only. Also the efforts put in during data structuring (data cleaning, de-duplication, formatting tables) will serve as an input during the data normalization stage.
- While data structuring is concerned with the arrangement of data, tables, and records within the database, data normalization is concerned with scaling the data and removing ambiguity and thus preparing it for the next step of passing the data through analytical and BI tools
- In data structuring the formatting is limited to records. Hence all activities at a higher record-level – integrating multiple databases, removing duplicate records, adding new columns of data, etc is part of data structuring. On the other hand, data normalization concerns itself with how the data should look and behave when it is being processed by data mining and analytics tools. Thus formatting the actual values, scaling of values for better analytical relevance and accuracy, is part of data normalization
- With help of use of primary key identification and optimization, data structuring helps to maintain optimal database design. In data normalization this optimized database is processed further for removal of redundancies, anomalies, blank fields, and for data scaling.
- Simply having a structured data is not adequate for good quality data mining. Structured data has to be normalized to remove outliers and anomalies to ensure accurate and expected data mining output.
Both data structuring and data normalization helps in maintaining the overall integrity, uniformity and sanity of the data in the warehouse. With these advanced levels of pre-processing done on the data, getting to the next level of data mining and further analytics for decision making becomes easier and better. If you too have a data warehouse make sure it gets the expert touch of a reputed data mining expert so that the insights that are eventually generated provides stellar results to your business fortunes.