Expert Insights: The Basics of Machine Learning

Saket Chaturvedi
|
August 21, 2019

Machine learning has been around for quite some time and we see or use it knowingly or unknowingly in our daily lives. The best example comes the moment when we open our emails – THE SPAM FILTER! It saves you a lot of time by automatically keeping the most important emails in your inbox and moving the suspicious ones to your spam folder. Let’s look at how machine learning is defined, how it helps our everyday processes, and the different types of machine learning.

How do you define machine learning (ML)?

Machine learning is a method of data analysis that automates analytical model building. Your system can learn from data and behaviors, identify patterns, and make decisions with minimal human intervention, which allows you to get the most out of the application you are using.

A more general definition – “Field of study that gives computers the ability to learn without being explicitly programmed.” — Arthur Samuel, 1959.

Okay, I know what machine learning is, but where can it help?

ML has applications in a wide spectrum of industries such as telecom, healthcare, IT, automobiles, etc.
ML helps you predict the outcome based on the patterns identified in your data.
ML algorithms learn quickly from the data and keep learning. Which makes them increasingly more accurate over time.
ML can operate over high dimensional data easily and provide methods to reduce the complexity and get the most out of it.
ML can help you to automate tasks which would require time and manual work if done otherwise. An example would be email classification, as we saw previously.
ML algorithms can help in understanding your customers better by analyzing and giving your detailed insights which would help in providing much better services or products to your customers.

Are there types of machine learning?

In short, YES, there are different types of machine learning systems which can be categorized as:

Supervised and Unsupervised
Batch and Online
Instance and Model-based
Reinforcement Learning

Supervised and unsupervised learning

A supervised machine learning system requires human intervention and an unsupervised does not. The data required for training a supervised machine learning model contains the desired output(labels). The algorithm learns using these labels and makes predictions on the new data accordingly.

A typical example of a supervised learning task would be predicting categorical classes (Spam/Ham) or a numerical output such as house prices, salaries, etc. A supervised system would require considerable labeled data to train and identify appropriate patterns to predict accurate results.

Some of the supervised learning algorithms are:

K-Nearest Neighbors
Linear Regression
Logistic Regression
Support Vector Machines
Decision Trees & Random Forests
Neural Networks (although some architectures can be unsupervised too)

Unsupervised systems don’t need labeled data. They try to find out the underlying patterns in the supplied data. Some of the algorithms for unsupervised learning include:

Clustering – K-Means, Hierarchical, Expectation Maximization
Visualization and Dimensionality Reduction – Principal Component Analysis, Kernel PCA, etc
Association Rule Learning – Apriori, Eclat

An example of unsupervised learning would be the segmentation of customers in a store. A clustering algorithm would help to identify the visitors in different groups such as males or females based on the data. Another example is dimensionality reduction, which aims to simplify the data without losing too much information. For example, merging two correlated variables into one and creating another feature. This is called feature extraction.

Batch and online learning

In batch learning, your system is not able to learn incrementally. You need to train your system using all the available data every time there is something new. This process can be automated, however, you still need to train the system with whole data every time. This makes it tedious and costly as the resources required to train the system on a huge dataset requires a lot of computing resources.

With online learning, your system learns incrementally by feeding the data instances sequentially in mini-batches. Hence the learning becomes faster and cheaper. It is good for the systems where the data comes in continuously but the resources are limited. To control how the system adapts to new data we keep a check on ‘learning rate’ which decides how fast or slow the system learns from the data. If your learning rate is high, your system will learn faster, but it will forget the patterns learned earlier. If it is too slow, the system will be slow but will be less sensitive. The goal is to set a suitable learning rate which is neither too slow or very fast.

Instance-based and model-based learning

In instance-based learning, there is no model created and the knowledge is represented by the training data itself. For any new data, a similarity measure is used to find out the most closely related training data. Hence, an instance-based method memorizes the examples and generalizes to new data using a similarity measure.

In model-based learning, we use a set of data to create a model which is used to make predictions. Here we try to find the most useful variables which can be used to train the model and get optimal parameter values. These parameter values are different based on the algorithm used to create the model. For example, for a linear regression model which is represented as y = mx+b, m and b are the parameters which are learned by the model. To find the most accurate parameters which make your model perform the best, you can use either a fitness function or a cost function which will tell you how good or bad your model is. Once everything is done, you can make predictions using your model.

In summary:

Analyze the data
Select a model
Train using training data and validate
Apply the model to make predictions

Reinforcement learning

In reinforcement learning, you have a learning system called Agent, which observes the environment and performs actions which can result in rewards or penalties. It learns by itself overtime to find out the best strategy (policy) which results in most rewards.

What are some of the challenges with machine learning?

Since your model is only as good as your data, there can be some challenges when trying to find the right fit. The following factors must be taken into consideration:

A poor sample for training your model to generalize to a population will result in bad predictions.
Bad data - Noisy, erroneous, and data full of outliers makes it difficult for the model to identify the underlying patterns. Hence, it is good to do some basic analysis and data cleaning before using the data in the model.
Overfitting – This is when your model has high accuracy on training data and poor accuracy on test data. This means your model fits too well on your training data but it will not generalize well to new samples. This can be a result of fitting a model which is too complex for the data. We can rectify by using a simple model with fewer parameters, gathering more training data, or cleaning up the data (removing noise, outliers, etc.).
Underfitting – If you use a model which is too simple for your problem, It will fail to learn from your data and will make poor predictions. This can be fixed by using a powerful model and better features. The problems of overfitting and underfitting can also be tackled using techniques such as regularization, which constrict the model complexity based on hyperparameters.

How do I know if my model is a good fit?

To know the accuracy of your model, the best way is to train, validate and test. This is done by dividing the data into training, testing and, optionally, a validation set.

Training set, as the name suggests, serves as the data source for training the model. It should be a good sample which represents the population for which we are fitting the model. If your training data is bad, your model will turn out bad.

We train our model using the training set and validate on the validation set and finally test using the test set. Until we are satisfied with our model performance on the validation set, it is advisable to not use the test set. Finally, the train, validation, and test errors are calculated which give information on how the model is behaving. Based on these calculations we can find where there’s room for improvement.

A common pitfall of modeling is overfitting. Overfitting is when a model memorizes details about the training set that aren’t present in the overall population. So basically your model fits and learns the details of training data too well so that it degrades the performance on unseen data. If your training error is low and test error is high, you are overfitting. A useful method to validate your model is ‘cross-validation’, which splits your training data into subsets. Each subset is used to train and validate the model. Once the best parameters are selected, the final model can be tested against the test set, and problems like overfitting can be easily avoided.

So now that you know what machine learning is and how it can help you become more efficient and effective, let us get you set up for success. Check out what our customers are saying about Atrium.