Input variable or feature selection [Machine Learning]
Feature selection or feature engineering is actually a huge topic in data science. Because as I said before, many times there are some features in the dataset that are useless, the prediction will be better if it is left out. Useful feature selection of useful feature selection by trimming this useless variable is feature engineering.
We did feature engineering during dataset cleaning, that is, we cut the skin out of co-relation.
Features selected in Pima Indian Diabetes Dataset:
No of Pregnancies
Glucose Concentration
Blood Pressure
Thickness
Insulin
Body Mass Index
Diabetes Predisposition
Age
Model training using Scikit-learn
Finally, we sat down to do the coding after reading the theory and gaining extensive knowledge. Get ready, we will start model training now.
Remember what to do at the beginning of the training? If not here's a new product just for you!
Data splitting
We will split 70-30% of the data through the code below. 80% is training data, the rest is testing data. Take out the Jupiter notebook and start writing in the code you did earlier.
from sklearn.model_selection import train_test_split
feature_column_names = ['num_preg', 'glucose_conc', 'diastolic_bp', 'thickness', 'insulin', 'bmi', 'diab_preb', 'age']
predicted_class_name = ['diabetes']
# Getting feature variable values
X = data_frame [feature_column_names] .values
y = data_frame [predicted_class_name] .values
# Saving 30% for testing
split_test_size = 0.30
# Splitting using scikit-learn train_test_split function
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = split_test_size, random_state = 42)
random_state = 42 means to guarantee that the splitting is from the same place every time the program is run.
Is the splitting of dataset really 80-30? Let's check
print ("{0: 0.2f}% in training set" .format ((len (X_train) / len (data_frame.index)) * 100))
print ("{0: 0.2f}% in test set" .format ((len (X_test) / len (data_frame.index)) * 100))
Output:
69.92% in training set
30.08% in test set
Close!
Is there any missing data? (0 value, not null value)
Sometimes a column may have different values but when you go to check you see a lot of values 0 which is not possible. How to deal with it? There is an algorithm that can be used to replace 0 values with an average value and take it to a working state. Before we see that, let's see how many of our values are actually 0!
print ("# rows in dataframe {0}". format (len (data_frame)))
print ("# rows missing glucose_conc: 0}". format (len (data_frame.loc [data_frame ['glucose_conc'] == 0])))
print ("# rows missing diastolic_bp: 0}". format (len (data_frame.loc [data_frame ['diastolic_bp'] == 0])))
print ("# rows missing thickness: 0}". format (len (data_frame.loc [data_frame ['thickness'] == 0])))
print ("# rows missing insulin: 0}". format (len (data_frame.loc [data_frame ['insulin'] == 0])))
print ("# rows missing bmi: 0}". format (len (data_frame.loc [data_frame ['bmi'] == 0])))
print ("# rows missing diab_pred: 0}". format (len (data_frame.loc [data_frame ['diab_pred'] == 0])))
print ("# rows missing age: 0". format (len (data_frame.loc [data_frame ['age'] == 0])))
Output:
# rows in dataframe 768
# rows missing glucose_conc: 5
# rows missing diastolic_bp: 35
# rows missing thickness: 227
# rows missing insulin: 374
# rows missing bmi: 11
# rows missing diab_pred: 0
# rows missing age: 0
Imputation is a technique to replace a value with a value. There is already ready-made code on the bike for importing, we will continue to use it for the time being.
from sklearn.preprocessing import Imputer
#Impute with mean all 0 readings
fill_0 = Imputer (missing_values = 0, strategy = "mean", axis = 0)
X_train = fill_0.fit_transform (X_train)
X_test = fill_0.fit_transform (X_test)
Here fill_0 is also a kind of model, whose job is to replace 0 values with a logical value through mean strategy.
We will train with this modified train value and test with test value.
Why not use Imputer in y_train or y_test?
Because simple, there is no missing data.
Model training
Eventually, we will train the model by calling that magical function.
from sklearn.naive_bayes import GaussianNB
# create Gaussian Naive Bayes model object and train it with the data
nb_model = GaussianNB ()
nb_model.fit (X_train, y_train.ravel ())
We have already discussed and decided that our selected algorithm will be Naive Bayes and a model of that algorithm is Gaussian Naive Bayes. We created the object of an empty model, then trained by calling the fit () function with the training value.

Comments
Post a Comment