Understanding Machine Learning Methodology

Motivation

Well, if we talk about a human cell sample extracted from a patient. The cell would have some characteristics. One of the interesting questions we can ask, what kind of statistics that cell have? One could easily presume that only a doctor with years of experience could diagnose a tumor and say if the patient is developing cancer or not.

Let’s imagine that we’ve obtained a dataset containing characteristics of thousands of human cell samples extracted from patients who were believed to be at risk of developing cancer. Analysis of the original data showed that many of the characteristics differed significantly among different samples.

We can use the values of these cell characteristics in samples from other patients to give an early indication of whether a new sample belongs to which type or characteristics. We should clean our data, select a proper algorithm for building a prediction model, and train our model to understand patterns of different kinds of cells within the data.

Once the model has been trained by going through data iteratively, it can be used to predict our new or unknown cell with rather high accuracy.

This is what we called machine learning! It is the way that a machine learning model can do a doctor’s task or at least help that doctor make the process faster.

What is Machine Learning?

Machine learning is the ability of computers to learn without being explicitly programmed.

“Without being explicitly programmed” means, e.g. we’ve to predict the image of animals. So before machine learning, each image would be transformed to a vector by features then traditionally we’ve to write down a lot of rules or methods in order to get computers to be intelligent and detect the animals. Perhaps it would be the failure because its highly dependent upon current data sets.

So here comes the machine learning, using machine learning allows us to build a model that looks at all the feature sets, and their corresponding type of animals, and learn it learns the pattern of each animal. It is a model built by machine learning algorithms. It detects without explicitly being programmed to do so. In essence, machine learning follows the same process that a 4-year-old child uses to learn, understand, and differentiate animals.

So, machine learning algorithms, inspired by the human learning process, iteratively learn from data and allow computers to find hidden insights. These models help us in a variety of tasks, such as object recognition, summarization, recommendation, and so on.

Machine Learning impacts society in a very influential way. E.g.

Paypal uses Machine Learning to detect fraud.
Amazon uses Machine Learning to give you suggestion, what you can further buy.
Banks also use Machine Learning to approve Loans.
Telcos use customers data to segment them.

Applications of Machine Learning;

There are many applications of machine learning like Search engine results, voice recognition, Number Plate Recognition, Dream Reader. This small sampling is just the beginning, from automatic cars to scientific discovery, any of these things are part of today’s world of machine learning.

If we talk about the search engine, Imagine if we’re on Google, we give very reliable information and speed, it’s automated and time goes on we got more information, the search engine returns better and better results.

Same with Voice Recognition, where its better and better voice recognizing what we’re saying and able to transcribe it for any of our Google commands or home devices where they recognized our voice, we can see that in a number of recognition apps.

So the use of machine learning is because it helps make life easier. It helps our processes be more consistent and reliable.

Major Techniques of Machine Learning

So, let’s quickly examine a few of the more popular techniques.

Regression / Estimation; Predict Continous Values
- This technique is used for predicting a continuous value;
  - E.g. predicting things like the price of a house based on its characteristics, or to estimate the CO2 emission from a car’s engine.

Classification; Predicting the item class/category of a case.
- A Classification technique is used for Predicting the class or category of a case.
  - E.g. if a cell is benign or malignant, or whether or not a customer will churn.

Clustering; Finding the structure of data; summarization.
- Clustering groups of similar cases.
  - E.g. Can find similar patients, or can be used for customer segmentation in the banking field.

Anomaly Detection; Discovering abnormal and unusual cases.
- Anomaly detection is used to discover abnormal and unusual cases.
  - E.g. It is used for credit card fraud detection.

Sequence mining; Predicting next events, click-stream (Morkov Model, HMM).
- Sequence mining is used for predicting the next event.
  - E.g. the click-stream in websites.

Dimension Reduction; Reducing the size of data (PCA).
- Dimension reduction is used to reduce the size of data.

Recommendation Systems; Recommending Items.
- This associates people’s preferences with others who have similar tastes and recommends new items to them.
  - E.g. Recommended Books or Food.

How does Machine Learning work?

Machine Learning works in different phases.

Phase#1: Learning

We’ve phase#1 which “Learning”, that broken up into three different steps;

Pre-Processing: The first step is we need to clean and format the data. (That is because computers are not smart when it comes to figuring out the difference between a picture or text when we send it in), so the first thing we do is usually clean the data so all our pictures are in one file and text is being processed separately. Because if we would try to process text like we do a picture we’re not gonna get the right answer and vice-versa, once we pre-process the data and we’ve it nicely clean, we’re gonna go in and start learning.
Learning: In this step, we take that data and learn from it. And here comes the supervised and unsupervised learning.
Testing: In this step, we have it a test to make sure we are getting the right answer out of it.

Phase#2: Prediction

In this phase, we’re actually using it or putting it into commercial use and that is to do a prediction and on there now we have our train model and our new data come together and output is going to be a prediction of what we are looking for. We can see that in the form of predicted data.

Machine Learning Workflow; It works iteratively;

Define Objective
Prepare the Data
Collect Data
Select Algorithm
Train Model
Test Model
Predict

Machine Learning with Python

Python is a preferred language among data scientists. We can write our machine learning algorithm using python, and it works very well. However, there are a lot of modules and libraries already implemented in python that can make our life much easier.

Numpy

Numpy is a math library to work with n-dimensional arrays in Python. It enables you to do computation efficiently and effectively. It is better than regular python because of its amazing capabilities.

E.g. for working with arrays, dictionaries, functions, datatypes, and working with images, we need to know Numpy.

SciPy

SciPy is a collection of numerical algorithms and domain-specific toolboxes, including signal processing, optimization, statistics and much more. SciPy is a good library for scientific and high-performance computation.

Matplotlib

Matplotlib is a very popular plotting package that provides 2D plotting as well as 3D plotting.

Pandas

Pandas library is a very high-level python library that provides high-performance, easy to use data structures. It has many functions for data importing, manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

Scikit Learn

Scikit-learn is a collection of algorithms and tools for machine learning. Scikit-learn is a free machine learning library for the Python programming language.

It has most of the classification, regression and clustering algorithms.
It’s designed to work with the Python numerical and scientific libraries
- NumPy
- SciPy.

Most of the tasks that need to be done in a machine learning pipeline are implemented already in scikit learn, including

Pre-processing of data.
Feature selection
Feature extraction
Train/Test splitting
Defining the Algorithms.
Fitting models
Tuning parameters
Prediction
Evaluation, and
Exporting the model.

Tools for Hands-On

Particularly if we talk about tools to start with then there are a number of tools and IDEs available to start with. One of the cool tools to start with is “Jupyter Notebook“. All we need to do is install Anaconda for it. We would get a simple interface where we can easily run and test our code easily.

Supervised vs Unsupervised vs Reinforcement

Supervised Learning

It’s the “Task Driven” (Predict next value). Here, we teach the model! then with that knowledge, it can predict unknown or future instances.

Supervise means to observe and direct the execution of a task, project, or activity. Obviously, we aren’t going to be supervising a person. Instead, we’ll be supervising a machine learning model that might be able to produce classification regions, etc.

So, how do we supervise a machine learning model? We do this by “teaching” the model. i.e. we load the model with knowledge so that we can have it predict future instances.

But! How exactly do we teach a model? We teach the model by training it with some data from a labeled dataset. It’s important to note that the data is labeled.

And what does a labeled dataset look like? Well, it can look something like a spreadsheet with proper labeling over it. The top row is called Attributes and the columns are called Features, which include the data.

If you plot this data and look at a single data point on a plot, it’ll have all of these attributes. That would make a row on this chart, also referred to as an observation.

Looking directly at the value of the data, you can have two kinds.

The first is numerical; When dealing with machine learning, the most commonly used data is numeric.
The second is categorical; It’s non-numeric because it contains characters rather than numbers.

Supervised Learning Types

There are two types of Supervised Learning techniques.

Classification: is the process of predicting a discrete class label or category.

Regression: is the process of predicting a continuous value as opposed to predicting a categorical value in classification.

Unsupervised Learning

Its data-driven (identify clusters). Here, we do not supervise the model, but we let the model work on its own to discover information that may not be visible to the human eye.

It means, The Unsupervised algorithm trains on the dataset, and draws conclusions on UNLABELED data.

Unsupervised learning has more difficult algorithms than supervised learning since we know little to no information about the data, or the outcomes that are to be expected.

Dimension reduction, Density estimation, Market basket analysis, and Clustering are the most widely used unsupervised machine learning techniques.

Dimensionality Reduction: and/or feature selection plays a large role in this by reducing redundant features to make the classification easier.
Market basket analysis: It is a modeling technique based upon the theory that if you buy a certain group of items, you’re more likely to buy another group of items.
Density estimation: It is a very simple concept that is mostly used to explore the data to find some structure within it.
Clustering: It is considered to be one of the most popular unsupervised machine learning techniques used for grouping data points or objects that are somehow similar. It is a grouping of data points or objects that are somehow similar by
- Discovery Structure
- Summarization
- Anomaly detection

Cluster analysis has many applications in different domains, whether it be a bank’s desire to segment its customers based on certain characteristics, or helping an individual to organize and group his/her favorite types of books!

Comparison:

So, The biggest difference between Supervised and Unsupervised Learning is that supervised learning deals with labeled data while Unsupervised Learning deals with unlabeled data.

In supervised learning, we have machine learning algorithms for Classification and Regression.
In unsupervised learning, we have methods such as clustering.
In comparison to supervised learning, unsupervised learning has fewer models and fewer evaluation methods that can be used to ensure that the outcome of the model is accurate.
As such, unsupervised learning creates a less controllable environment, as the machine is creating outcomes for us.

Reinforcement Learning

Here, it involves teaching the machine to think for itself based on its past action reward.

Comparison of Machine Learning with other Key Technologies;

AI tries to make computers intelligent in order to mimic the cognitive functions of humans. So, Artificial Intelligence is a general field with a broad scope including Computer Vision, Language Processing, Creativity, Summarization.
Machine Learning is the branch of AI that covers the statistical part of artificial intelligence. It teaches the computer to solve problems by looking at hundreds or thousands of examples, learning from them, and then using that experience to solve the same problem in new situations.
Deep Learning is a very special field of Machine Learning where computers can actually learn and make intelligent decisions on their own. Deep learning involves a deeper level of automation in comparison with most machine learning algorithms.