Algorithm selection

March 27, 2021

Algorithm selection

The work of the learning algorithm

This sounds ridiculous, but first, we need to understand what algorithms do in the machine learning process. So let's see,

The learning algorithm can be compared to an engine that manages the entire machine learning process. The next important task of data preprocessing is the work of the learning algorithm.

We first divide our dataset into two parts,

Training data (contains a lot; testing data is omitted here)

Testing data (contains a small amount; no data from the training dataset is in the testing dataset)

Now we feed this training data to the algorithm, usually using the fit () function to do the feed and analysis work in the algorithm in Scikit-learn.

The mathematical model works behind this algorithm. With this mathematical model, the algorithm adjusts the internal parameters during dataset analysis. Math needs to be discussed to understand these tasks, but for the time being, we can proceed by considering everything as magic to keep the work going. Of course, we will look at mathematical analysis but now is not the perfect time. You may lose interest in math, so we will learn to drive first and then we will see how the engine, that is, the learning algorithm works.

The next task is simple, by calling the predict () function we can predict things that are not in the dataset (for example, to diagnose diabetes, we will first train the model with Pima Indian Dataset with the fit () function, then predict any person with those parameter data. ) I will know through function how likely he is to have diabetes)

Which algorithm will train and predict?

There are over 50 established learning algorithms. Again, you can create your own custom algorithm by making a crossover between them. But how do we know which algorithm is perfect for our work? This topic is for discussion.

Everyone has their own selected factors for selecting algorithms. When you become an expert, you will find out for yourself which algorithm is best for a job.

For now, you can choose based on these factors.

Algorithm Decision Factors

Learning Type (Supervised or Unsupervised)

Result (Value or Yes / No type answer)

Simple or Complex

Basic or Advanced

We will select the combination of solution statement and workflow process, whichever algorithm would be better to choose.

Learning type

The learning process of each algorithm is different. Let's look at the solution statement first and then decide what kind of learning we need.

A Predictive Model needs to be created after the Pima Indian Data has been processed and transformed using the machine learning workflow. Now, this model needs to be diagnosed with an accuracy of 80% or more who may be suffering from diabetes.

In the bolded part of the above statement, we see that there is talk of building a predictive model.

We know

Prediction Model => Supervised Machine Learning

That is, we get what the learning type of our chosen algorithm will be.

Finally, we will not use algorithms that work with Unsupervised Learning

In doing so,

22 algorithms were dropped, leaving 27 algorithms

That's a lot! No problem, we can filter more one by one.

Result type

As I said before, we usually want to get two kinds of answers. One is the value (Regression: such as how the bargain will be with the size of the house) and the other is the yes / no type answer (Classification).

What is meant here is that determining the problem of diabetes is actually classification, because we want to know whether there will be diabetes or not. Again, discrete values; Such as 1-100, 101-200, 201-300, or small, medium, large, etc. also fall into the classification.

Let's see how many algorithms are omitted,

6 algorithms were omitted due to result type classification, 20 hours were left

We should avoid complex algorithms as we begin to learn machine learning. Apply KISS (Keep It Short and Simple) formula.

What are the complex algorithms?

Ensemble Algorithms:

These are special algorithms because each Ensemble Algorithm is the sum of many algorithms.

Very good performance

Debugging is not convenient

This reduced the algorithm to 14 hours

Basic or Enhanced?

Enhanced

Variations of Basic

Better than Performance Basic (say?: P)

Extra convenience

Complex

Basic

Easy

So it is easy to understand

Yes, you understand, since we are beginners, it is better to stay at our Basic.

Three Candidate Algorithm at the end of the filtering

We now have three algorithms,

Naive Bayes

Logistic Regression

Decision Tree

I will choose one of them. After a little discussion about the three, we will come to a decision which one is better to use. All three are basic and classic algorithms of machine learning. Complex algorithms are basically made using these as building blocks. Let's start with Naive Bayes.

Naive Bayes

Developed by applying the Naive Bayes algorithm 'Bayes Theorem'. For those who have not heard of the Bayes Theorem, one of the fundamental theorems of probability is the Bayes Probability Theorem. As it is a very important theorem, it will be discussed in detail later. (Mathematics again)

The 'Naive Bayes' algorithm determines the probability of anything happening. E.g., H

What are the chances of getting diabetes with high blood pressure? This algorithm mixes probability with different 'Feature / Input Variables' and determines the probability of an event (based on the previous dataset, of course).

Here are some of its features

Determines the probability of occurrence

Each feature or input variable (in case of problems: no of preg, insulin, etc.) is equally important.

Therefore, Blood Pressure and BMI (Body Mass Index) is equally important (as well as all variables).

A small amount of data is enough for the prediction

Logistic Regression

The name is confusing, meaning we knew Regression means Continuous. But Classification is Discrete value. You may be wondering, why are we discussing the Regression method for classification?

In fact, the output of Logistic Regression is 1 (.99999) or 0 (0.00001).

Features

Binary results

Relationship of Input Variable / Feature is Weighted (not all features may be equally important)

Let's see the next algorithm.

Decision Tree

Its structure is like a binary tree (you can imagine if the data structure is read)

Each node is actually a decision

It takes a lot of data to split the decision

Finally, I selected Naive Bayes

Why?

Easy to understand

Works faster (about 100 times faster than normal algorithms)

The model remains stable even if the data changes

Debugging is relatively easy

The biggest reason is that this algorithm matches our problem perfectly because we want to find out the likelihood and the function of this algorithm is to determine the likelihood :)

Summary

Lots of learning algorithms available

I did the selection

Learning Type - Supervised

Result - Binary Classification

Complexity - Non-Ensemble

Basic or Enhanced - Basic

I selected Naive Bayes for training, because

Easy, fast, and stable.

Search This Blog

Everything of Artificial Intelligence (A.I)

Algorithm selection

Comments

Post a Comment