Model training Machine Learning

 Machine learning training



By definition,

Letting specific data teach a Machine Learning Algorithm to create a specific prediction model.

This means that the training that can turn a machine learning model into a specific prediction model by training it with a specific dataset is machine learning training.

Naturally, I can't predict the sun-rain by training the model with a diabetes dataset. For that, we need a separate specific dataset.

Often the model needs to retrain.

Why do we retrain the model?

Pima Indian dataset is updated every few days with new data (meaning observation, ie new row is being added). We know according to the features of machine learning, the more datasets there are, the better, so if we train the model again with the newly added dataset, we will definitely get better results than before.

Data New Data => better predictions'

From the new dataset again we can verify the performance by leaving some data for training and some for data testing.




Training overview

Dataset splitting

All you have to do at the beginning of the training is to share the dataset. On average we usually keep 70% of the data for training and the remaining 30% for testing.


We train the algorithm by feeding it to the training data algorithm. The real meaning of training an algorithm is to set the internal parameters of the algorithm for that specific dataset. The matter will be clear when we look at the mathematical analysis.

Let's see first what is our Training Goal?

Training goals and training data

We are taking hypothetical datasets to understand the training goals. Again, we are not currently using the Diabetes dataset to understand the "training goals".

Let's say, two input variables/features are enough to understand whether there will be a cold or not. The two features are X & Y. If we plot the Scatter of Y relative to X,


Explanation of the scatter plot:

The blue dot here means that there will be no cold for those combinations of X and Y and the red dot means cold for all those combinations of X and Y.

We can draw a simple decision boundary to separate the dots. This is basically your trained algorithm. Your algorithm can draw such a decision boundary for the above dataset.


But after drawing the boundary, it is seen that some red cross has moved in the blue part and some blue circle has moved in the red part.


So training is not 100%. But 100% accuracy is not our goal either. If you make so many accurate models, the chances of overfitting are very high. Overfitting and underfitting are important aspects of machine learning, so as always we will discuss them in detail later.

By training the model using this training data, we can create a decision boundary. For now, this is the purpose of using my training data.

What is the function of testing data?

It seems natural, why we are not using 100% data in training? Why are we splitting at 70-30% ?! Didn't it reduce the training data? It will not be a bad performance? Where is the problem if you train with all the data?

Yes, there are many questions and I will try to answer them all.

Why aren't we using 100% data in training? Why are we splitting at 70-30% ?! Didn't it reduce the training data?

The questions are actually the same, we try to understand the matter through an example discussed earlier.

Suppose I'm teaching someone (assuming he doesn't know how to multiply, just knows how to add) animate. Notice that I have written 'teaching' in bold. That's why I'm teaching him the logic of naming. Now if I call him,

2 X 1 = 2

2 X 2 = 4

2 X 3 = 6

2 X 4 = 8

2 X 5 = 10

2 X 6 = 12

2 X 7 = 14

2 X 8 = 16

2 X 9 = 18

2 X 10 = 20

I memorized the names of these two again and again. If I told him now, he would say 2 X 3 =? He can answer instantly that 2 X 3 = 6. Now if I tell him to test, well, now tell me the name of the house of 5 This time he will not be able to answer immediately, in some cases he may not be able to answer.

What I mean by this example is that I could not teach him anything at all. With a 100% data supply, I finally stopped him from figuring out the logic. But if I did that,

2 X 1 = 2

2 X 2 = 4

2 X 3 = 6


2 X 5 = 10

2 X 6 = 12

2 X 7 = 14

2 X 8 = 16


2 X 10 = 20

Missing two data here, and I gave him the responsibility to find out what will be the missing values? This time he will try to figure out the logic. Do not utter any memorized words. Not only that, I can actually figure out from his answer if he has learned logic at all.

This means that we can verify with all the testing data whether the model I created is really predictable, I know I have some data but I did not provide it in training. So if the model can predict, then I am quite successful. Because I was able to train a model. How do I know if someone else is giving the right answer by asking a question I don't have?

Input variable or feature selection

Feature selection or feature engineering is actually a huge topic in data science. Because as I said before, many times there are some features in the dataset that are useless, the prediction will be better if it is left out. Useful feature selection of useful feature selection by trimming this useless variable is feature engineering.

We use feature engineering during dataset cleaningI was, that is, we cut the skin out of the co-relation.

Features selected in Pima Indian Diabetes Dataset:

No of Pregnancies

Glucose Concentration

Blood Pressure

Thickness

Insulin

Body Mass Index

Diabetes Predisposition

Age

Comments