The first step in creating a machine learning model.
Workflow revision
Since today's topic is 'Asking the right question' or 'Creating the right question', let's take a look at the workflow of the last episode.
workflow
Problems
If we want to apply general methods then first of all we need to solve a problem with those methods. Let's take a look at our selected problem,
Details of the problem
Predict whether a person will be diagnosed with diabetes
Didn't I get the right question at the beginning of the work?
Looking at the details of the problem, it may seem that we have already got our desired question, what does it mean to ask a new question? Is this happening only after solving this problem ?!
The answer is no. It is wrong to define a machine learning problem with just one line of question, break this question down into smaller and specific questions, then solve them.
So what will be the questions?
The questions have to be such that if we solve them we can actually create a Fully Functional Predictive Model.
We are building a predictive model based on the question, so we need the question to the point. Based on the criteria we will create the solution.
More useful than a one-line question is some of the statements in the question that define the beginning and end of our solution build (e.g., how good the model's prediction success rate is, we'll sit down to predict without model optimization) and how we We will reach our goal.
So let's look at the Solution Statement Goals
Scope & Dataset
Performance Score and Performance Target
Where to use (Context)
How to create a solution
Adding these points one by one will get us the questions we want.
Scope and data source:
Observing the American Diabetes website Dataset highlights a number of diabetes risk factors that will help us identify important input variables of the Dataset.
Age
Adults are more likely to develop diabetes.
Race
African-American, Asian-American, American-Indian people are more likely to have diabetes.
Gender
Gender has no effect on diabetes.
Why are the above factors important?
Input Variable Filtering: The factors show that we will focus more on 'Race' and less on 'Gender'. If we don't have Gender as an input variable then the prediction will definitely go bad.
To select datasets: Datasets that have Age, Race (may have more), we need to select.
We will select the Pima Indian Diabetes Study from the repository of the University of California Irvine (UCI) for this work. Because this dataset fills our demanded criteria.
Modified statement
Our modified problem statement after observing the scope and dataset
Using Pima Indian Diabetes Dataset to find out who will be affected by diabetes
Performance Score and Performance Target
No matter how complex the solution to the problem, the output we can easily guess, the solution to this problem will come in the form of yes/no. I.e. Binary Result (True or False)
If we build the model, we will get a performance score. That means how well our model can predict. But there is a limit to this performance. Usually, 100% prediction is not rated, but we have to try to see how well it can be done.
So we have to think about accuracy while building. There is no guarantee that we will get that accuracy if we want.
There is nothing worse than 50% accuracy. So if the performance score of the model I made is 50%, it means that if the model makes a prediction, the probability is 50-50. So we must look at more than 50% accuracy.
50% extreme poor performance score whenever we make disease predictions. Genetic differences are a big factor here. It has been found that twins and all the same halls have genetic differences. Because of these differences, the risk of developing diabetes varies from person to person.
So 80% accuracy is fairly reasonable. Now we can proceed with the endpoint.
So let's make some more changes to our target statement.
Modified statement
Using Pima Indian Diabetes Dataset to find out who has diabetes with 80% or more accuracy.
Context field
Our problem is medically based, so we need to draw relevance here. This will make the solution better.
Everyone is genetically different from each other, so different factors, known and unknown, work here. Even though they seem to have the same factor, one person in two people may have diabetes and the other may not.
This is where we're using a potential sentence called "could be." Therefore, we are not 100% sure whether there will be diabetes at all.
If we add like or probability to our statement, it will be a changed statement,
Modified statement
Using Pima Indian Diabetes Dataset to find out who can get diabetes with 70% or more accuracy
Creating solutions
The machine learning procedure has not yet come in our statement. So using machine learning workflow will give us a better idea of how to create a solution.
Machine Learning Workflow:
Pima Indian Data preprocessing
Data tRansformation (if required)
Statement changed again
Modified statement
A Predictive Model needs to be created after Pima Indian Data has been processed and required transformation using machine learning workflow.
Now, this model needs to be diagnosed with an accuracy of 80% or more who may be suffering from diabetes.
Final statements and questions
Which dataset to use? - Pima Indian dataset
What is the target performance? - 80%
How to create a solution? - Using machine learning workflow to create a predictive model through data preprocessing and transformation then to predict using dataset.
As can be seen from the discussion, one little question of one line we have researched in different ways and got answers to several important questions with which we can easily solve the problem. We will solve this problem using this workflow in later episodes.



Comments
Post a Comment