What is meant by machine learning workflow?

 By definition,

An orchestrated and repeatable pattern that systematically transforms and processes information to create prediction solutions.



An orchestrated and repeatable pattern:

This means that with the same workflow we will define the problem, and with that workflow, we will build the solution.

Transforms and processes information:

Before creating a model with data, it has to be made usable for training.

Suppose we want to create a predictive model that answers yes or no. If the input data is numerical then the output is numerical. For this reason, we can replace the yes / no labels in the training data with 1 and 0. This is called information preprocessing.

Create prediction solutions:

The ultimate goal of any machine learning is to be predictable. However, the prediction should meet the needs of the customer.

For example, a new data model of my model takes 2 days to train, it takes 1 more day to predict. Now if more new data comes in those 1 days, I need more time to train them. Until then, the time limit for predicting data will increase further. Will any healthy normal person adopt this model? Of course not, so the closer a model is able to predict the answer in less training time, the better the algorithm and machine learning system.


Machine Learning Workflow:

Asking the right questions



What I really want to do is start working. In the case of house size and price problem, it is understood that I want to input the size of the house and the price I want to output.

Complex work can be easier if we find the answer after setting the right question.

Sorting data


Now I have to identify the data to solve the problem or to train the model. A machine can only tell the difference between a good deed and a bad deed if it is trained to show good and bad deed only then it will be able to see any new phenomenon and match it from its training data whether it is good or bad deed

Select the algorithm


The most difficult task is to select the algorithm. There is no point in using artificial neural nets for problems that can be solved using simple linear regression.

Since the same work can be done with different models, the model that will show less error is usually selected.

To select the algorithm, you must understand the problem set and the available dataset. As the house and bargaining problem is a regression problem, if I apply the clustering (data classification) algorithm here, of course the prediction will be very bad.

Model training


I will divide the dataset in my hand into two parts, one is Training Dataset and the other is Testing Dataset.

Testing Dataset does not contain any data in the Training Dataset. If there was, the main purpose of machine learning would be soil. (This means that the created model learns to memorize data instead of learning something new, as we do when we test ourselves, doesn't it?)

Model testing


The reason for dividing the dataset into two parts is that the model has to be taught first by giving training data. Then you have to make an input choice (Testing Dataset) and give the input to the model to find out how accurate or wrong its predicted output is.

So I am not giving all the data in my hands in training, so some data is left in my hands with which I can verify whether the model I have created can learn anything at all.

It's a lot like this, let's say I want to teach the ML model I created to the house of 3. Then I will create a dataset like this,

3 X 1 = 3

3 X 2 = 7

3 X 3 = 9

3 X 4 = 12

3 X 5 = 15

3 X 6 = 16

3 X 9 = 26

3 X 10 = 30

Now if I separate the training data and the testing data from here, the datasets would look like this,

Note: There are also separate algorithms for separating training data and testing data from datasets. Sophisticated algorithms for data selection work very well if the dataset is too small. We will see in detail later.

Training data

3 X 1 = 3

3 X 2 = 7

3 X 3 = 9

3 X 4 = 12

3 X 6 = 16

3 X 9 = 26

3 X 10 = 30

Testing data

3 X 5 = 15

It is clear here that I did not give 3 X 5 in the training data. That means I will train the model with the rest of the data except 3 X 5. Then with 3 and 5 inputs we will see if the model is giving output close to 15. If given, it means that the algorithm selection, data preprocessing and training of my model has been done properly. And if it doesn't work then I have to try again with new algorithms starting from data preprocessing.

Creating a model is an entirely iterative process, which means you have to do it over and over again. It is not right to assume that a model will be perfect at once.


Workflow Guidelines

Not the last step, you have to be careful to build the model from the beginning

You have to think carefully from the beginning to build the model, because each step depends on the previous step. Much like a chain, if you make a mistake, you have to start all over again.

Therefore, the ultimate target, the dataset in hand, the algorithm selection, everything has to be done with care and thought.

You may need to return to the previous step at any time

Suppose you have a quality dataset but you want the output to be the result of the addition of two numbers. So the dataset doesn't match what you want. So now we have to replace the quality dataset with the addition dataset and then train the model again.

Data must be sorted

RAW data will never be sorted out to your liking, it must be pre-processed for model training.

Data preprocessing is basically the most time consuming.

The more fun the data

In fact, the more data you can feed the model, the better the prediction accuracy. This theory is self-evident.

Whether it is a problem or a solution, it is better

Never pay attention to a bad solution. If after many attempts at solving a problem you do not get the expected performance, then ask about the other steps.

Am I asking the right question?

Do I have the necessary data to solve the problem?

Is the algorithm I selected correct?

If you do not get a satisfactory answer, it is better not to solve the problem. Because the model that gives the correct answer 50% of the time and the remaining 50% incorrect answer is good for creating confusion, not for solving the problem.

So far so good. In the next episode we will look at the 1st step of machine learning. "How to ask the right questions?"

Comments