Supervised vs Unsupervised Machine Learning Techniques
Discovering patterns from data by employing intelligent algorithms is generally the core concept of machine learning. These discoveries often lead to actionable insights, prediction of various trends and help businesses gain a competitive edge or sometimes even power new and innovative products. We had recently explained the concept of machine learning and how to train a machine learning algorithm in this blog post. Since we didn’t go deep into the different types of ML algorithms and how they work, we came up with this post, where we will explain the classifications of machine learning algorithms based on the way they ‘learn’ to make predictions.
At a high level, there are two broad types of machine learning techniques – Supervised and Unsupervised. Let’s look at how they’re different from each other.
Supervised and unsupervised ML techniques
As we mentioned before, supervised and unsupervised ML techniques represent the ‘way’ a machine learning algorithm learns to make predictions.
In supervised learning, the creator of the ML algorithm has a well-defined output that’s expected from the machine. The input and its respective output is predefined and the ML algorithm only learns to perfect the art of giving output based on the input with higher accuracy over time.
Supervised learning is also like learning with a teacher. The teacher, in this case is the training data set provided to the machine learning system.
While learning with a teacher, the student is told what represents what. For example, you could teach a kid the distinct characteristics of a dog that helps distinguish it from other animals, such as:
- Shape of their faces (Long)
- How they sound (Bark)
- Body size (Small to medium)
- Other specific traits (Dogs wag their tail often)
With this data, the kid should be able to identify various breeds of dogs. Every time he/she spots a new and unknown breed of dog, the traits to look for gets updated with more data. For example, a pug doesn’t have a long face like most other breed of dogs, yet is a dog. This is supervised learning since we first gave the kid a set of traits to look for and he simply perfected it with experience.
However, in the case of unsupervised learning, the kid is on his own. He is simply presented with various animals without any hints on what is what. He learns to identify different animals by grouping them on the basis of the traits that are observed. This is unsupervised machine learning in a nutshell.
Simply put, supervised learning is machine learning based on data with expected outcomes whereas in the case of unsupervised machine learning, the ML system learns to identify patterns from the data on its own.
Supervised Machine learning
Most of the practical applications of machine learning use supervised learning. In supervised learning, you define the input variable (x) and output variable (Y) and enable an algorithm to learn how to map the input to the output.
This can be defined as Y = f(X)
The idea is to make the machine perfect at this mapping so that it can predict the output variables (Y) accurately for any new input data that you throw at it. The algorithm slows down the learning activity when it achieves an acceptable level of accuracy.
Supervised learning can further be grouped into classification and regression problems:
Classification: A classification problem would have an output variable that’s a category, like big, small, medium or “red” or “green”.
Regression: In a regression problem, the output variable is an actual value, such as “kilograms” or “dollars”.
Some of the popular supervised machine learning algorithms are:
Regression algorithms are primarily meant for detecting statistical dependencies between numerical variables. The linear regression model basically tries to find the best linear approximation for your data representation. When this approximation is successful, you can easily predict values of the dependent variable for any value of the independent one. This way, the algorithm can be used to determine the dependency between any two numerical columns in your input dataset. For example, you can use linear regression to predict sales in the coming year by using historical data as input or project the number of people that would visit your website based on seasonal trends.
Random Forest is pretty much like the swiss army knife of all data science algorithms. On a lighter note, when you can’t think of a particular algorithm for your problem, go for random forest. Random Forest is another example of a supervised machine learning algorithm used for clustering data points in functional groups. This is especially useful for large datasets with a high number of variables as it becomes difficult to manually cluster the data by taking all variables into account.
Due to its versatile nature, this machine learning algorithm can be used for both regression and classification tasks. It can also handle dimensional reduction methods, treat missing values, outlier values and many other data exploration methods. Random Forest is an ensemble learning method in which a group of weak models are combined to act as a strong model.
Support vector machines
Support Vector Machines is another supervised machine learning algorithm that can be used for regression or classification problems. In SVM, each data item is plotted as a point in n-dimensional space (n is the number of features you have) with the value of each feature being the value of a particular coordinate. The classification is then performed by identifying the hyper-plane that distinguishes the two classes in the best way.
SVM is typically used for tasks that involve text classification such as detecting spam, sentiment analysis and category assignment. It is also useful in image recognition projects where color-based classification and aspect-based recognition are the vital aspects. Another notable application is in handwritten digit recognition, which is useful in automating postal services.
Unsupervised Machine learning
In unsupervised machine learning, there is only the input data (X) and no corresponding output variables are defined. The idea here is to reveal the underlying distribution or structure of the data without placing restrictions on the model. In unsupervised machine learning models, there is no correct answers just like there is no teacher. The algorithms are left on their own to discover and present interesting structures in the data.
Unsupervised learning can further be grouped into Clustering and Association problems:
Clustering: In a clustering challenge, you are basically trying to discover the underlying groupings in the data, such as grouping customers by their shopping behavior.
Association: In an association problem, the goal is to identify rules that define large portions of the data, such as people who bought iPhones also tend to buy battery packs.
Popular examples of unsupervised algorithms are:
K-means clustering is an unsupervised machine learning algorithm which is used in situations where the data you have is unlabeled (data with undefined groups or categories). The algorithm is meant for identifying groups in the data where the number of groups is denoted by the variable K. K-means works by assigning each data point to one of K groups based in the provided features. It then proceeds to cluster the data points based on their feature similarity.
Simply put, K-means clustering reveals undefined groups from unlabeled data. This is especially useful in confirming business assumptions from large and complex datasets. Once the algorithm is run and groups are defined, new data points can easily be added to the correct group.
Apriori is a classic unsupervised machine algorithm used for mining relevant association rules and itemsets. It is ideal to be deployed on a database with large number of transactions such as items bought by customers from a store.
The apriori principle would cut down the number of itemsets that need to be examined. The principle states that if an itemset is not frequent, none of its subsets are going to be frequent either. The apriori algorithm, being exceptionally good for association rules based machine learning is being widely used by retail companies.
The interesting outcomes from association rules based learning can be understood from the beer-diapers story. A retail store analyzed their data to find that young American males who bought diapers on Friday afternoon also tend to buy beer. They then went ahead and placed the beer isle close to the diaper isle and as expected, the beer-sales went up.
This probably indicates that raising kids can be grueling and parents imprudently turned to beer to relieve their stress. Anyway, this story is a perfect example of association rules in machine learning.
Machine learning is assisting businesses achieve never-before levels of efficiency and paving way for new technological innovations. Since the data available on the web is growing in quantity and quality by each passing minute, machine learning technologies can be trusted with uncovering groundbreaking insights from these datasets. If you are looking to unlock the true potential of the data at your disposal, getting familiar with these machine learning techniques will prove to be imperative.