Algorithm selection
The work of the learning algorithm
This sounds ridiculous, but first, we need to understand what algorithms do in the machine learning process. So let's see,
The learning algorithm can be compared to an engine that manages the entire machine learning process. The next important task of data preprocessing is the work of the learning algorithm.
We first divide our dataset into two parts,
Training data (contains a lot; testing data is omitted here)
Testing data (contains a small amount; no data from the training dataset is in the testing dataset)
Now we feed this training data to the algorithm, usually using the fit () function to do the feed and analysis work in the algorithm in Scikit-learn.
The mathematical model works behind this algorithm. With this mathematical model, the algorithm adjusts the internal parameters during dataset analysis. Math needs to be discussed to understand these tasks, but for the time being, we can proceed by considering everything as magic to keep the work going. Of course, we will look at mathematical analysis but now is not the perfect time. You may lose interest in math, so we will learn to drive first and then we will see how the engine, that is, the learning algorithm works.
The next task is simple, by calling the predict () function we can predict things that are not in the dataset (for example, to diagnose diabetes, we will first train the model with Pima Indian Dataset with the fit () function, then predict any person with those parameter data. ) I will know through function how likely he is to have diabetes)
Which algorithm will train and predict?
There are over 50 established learning algorithms. Again, you can create your own custom algorithm by making a crossover between them. But how do we know which algorithm is perfect for our work? This topic is for discussion.
Everyone has their own selected factors for selecting algorithms. When you become an expert, you will find out for yourself which algorithm is best for a job.
For now, you can choose based on these factors.
Algorithm Decision Factors
Learning Type (Supervised or Unsupervised)
Result (Value or Yes / No type answer)
Simple or Complex
Basic or Advanced
We will select the combination of solution statement and workflow process, whichever algorithm would be better to choose.
Learning type
The learning process of each algorithm is different. Let's look at the solution statement first and then decide what kind of learning we need.
A Predictive Model needs to be created after the Pima Indian Data has been processed and transformed using the machine learning workflow. Now, this model needs to be diagnosed with an accuracy of 80% or more who may be suffering from diabetes.
In the bolded part of the above statement, we see that there is talk of building a predictive model.
We know
Prediction Model => Supervised Machine Learning
That is, we get what the learning type of our chosen algorithm will be.
Finally, we will not use algorithms that work with Unsupervised Learning
In doing so,
22 algorithms were dropped, leaving 27 algorithms
That's a lot! No problem, we can filter more one by one.
Result type
As I said before, we usually want to get two kinds of answers. One is the value (Regression: such as how the bargain will be with the size of the house) and the other is the yes / no type answer (Classification).
What is meant here is that determining the problem of diabetes is actually classification, because we want to know whether there will be diabetes or not. Again, discrete values; Such as 1-100, 101-200, 201-300, or small, medium, large, etc. also fall into the classification.
Let's see how many algorithms are omitted,
6 algorithms were omitted due to result type classification, 20 hours were left
We should avoid complex algorithms as we begin to learn machine learning. Apply KISS (Keep It Short and Simple) formula.
What are the complex algorithms?
Ensemble Algorithms:
These are special algorithms because each Ensemble Algorithm is the sum of many algorithms.
Very good performance
Debugging is not convenient
This reduced the algorithm to 14 hours
Basic or Enhanced?
Enhanced
Variations of Basic
Better than Performance Basic (say?: P)
Extra convenience
Complex
Basic
Easy
So it is easy to understand
Yes, you understand, since we are beginners, it is better to stay at our Basic.
Three Candidate Algorithm at the end of the filtering
We now have three algorithms,
Naive Bayes
Logistic Regression
Decision Tree
I will choose one of them. After a little discussion about the three, we will come to a decision which one is better to use. All three are basic and classic algorithms of machine learning. Complex algorithms are basically made using these as building blocks. Let's start with Naive Bayes.
Naive Bayes
Developed by applying the Naive Bayes algorithm 'Bayes Theorem'. For those who have not heard of the Bayes Theorem, one of the fundamental theorems of probability is the Bayes Probability Theorem. As it is a very important theorem, it will be discussed in detail later. (Mathematics again)
The 'Naive Bayes' algorithm determines the probability of anything happening. E.g., H
What are the chances of getting diabetes with high blood pressure? This algorithm mixes probability with different 'Feature / Input Variables' and determines the probability of an event (based on the previous dataset, of course).
Here are some of its features
Determines the probability of occurrence
Each feature or input variable (in case of problems: no of preg, insulin, etc.) is equally important.
Therefore, Blood Pressure and BMI (Body Mass Index) is equally important (as well as all variables).
A small amount of data is enough for the prediction
Logistic Regression
The name is confusing, meaning we knew Regression means Continuous. But Classification is Discrete value. You may be wondering, why are we discussing the Regression method for classification?
In fact, the output of Logistic Regression is 1 (.99999) or 0 (0.00001).
Features
Binary results
Relationship of Input Variable / Feature is Weighted (not all features may be equally important)
Let's see the next algorithm.
Decision Tree
Its structure is like a binary tree (you can imagine if the data structure is read)
Each node is actually a decision
It takes a lot of data to split the decision
Finally, I selected Naive Bayes
Why?
Easy to understand
Works faster (about 100 times faster than normal algorithms)
The model remains stable even if the data changes
Debugging is relatively easy
The biggest reason is that this algorithm matches our problem perfectly because we want to find out the likelihood and the function of this algorithm is to determine the likelihood :)
Summary
Lots of learning algorithms available
I did the selection
Learning Type - Supervised
Result - Binary Classification
Complexity - Non-Ensemble
Basic or Enhanced - Basic
I selected Naive Bayes for training, because
Easy, fast, and stable.


Comments
Post a Comment