Input variable or feature selection [Machine Learning]

 Feature selection or feature engineering is actually a huge topic in data science. Because as I said before, many times there are some features in the dataset that are useless, the prediction will be better if it is left out. Useful feature selection of useful feature selection by trimming this useless variable is feature engineering.



We did feature engineering during dataset cleaning, that is, we cut the skin out of co-relation.

Features selected in Pima Indian Diabetes Dataset:

No of Pregnancies

Glucose Concentration

Blood Pressure

Thickness

Insulin

Body Mass Index

Diabetes Predisposition

Age

Model training using Scikit-learn

Finally, we sat down to do the coding after reading the theory and gaining extensive knowledge. Get ready, we will start model training now.

Remember what to do at the beginning of the training? If not here's a new product just for you!

Data splitting

We will split 70-30% of the data through the code below. 80% is training data, the rest is testing data. Take out the Jupiter notebook and start writing in the code you did earlier.

from sklearn.model_selection import train_test_split


feature_column_names = ['num_preg', 'glucose_conc', 'diastolic_bp', 'thickness', 'insulin', 'bmi', 'diab_preb', 'age']


predicted_class_name = ['diabetes']


# Getting feature variable values


X = data_frame [feature_column_names] .values

y = data_frame [predicted_class_name] .values


# Saving 30% for testing

split_test_size = 0.30


# Splitting using scikit-learn train_test_split function


X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = split_test_size, random_state = 42)

random_state = 42 means to guarantee that the splitting is from the same place every time the program is run.

Is the splitting of dataset really 80-30? Let's check

print ("{0: 0.2f}% in training set" .format ((len (X_train) / len (data_frame.index)) * 100))

print ("{0: 0.2f}% in test set" .format ((len (X_test) / len (data_frame.index)) * 100))

Output:

69.92% in training set

30.08% in test set

Close!

Is there any missing data? (0 value, not null value)

Sometimes a column may have different values ​​but when you go to check you see a lot of values ​​0 which is not possible. How to deal with it? There is an algorithm that can be used to replace 0 values ​​with an average value and take it to a working state. Before we see that, let's see how many of our values ​​are actually 0!

print ("# rows in dataframe {0}". format (len (data_frame)))

print ("# rows missing glucose_conc: 0}". format (len (data_frame.loc [data_frame ['glucose_conc'] == 0])))

print ("# rows missing diastolic_bp: 0}". format (len (data_frame.loc [data_frame ['diastolic_bp'] == 0])))

print ("# rows missing thickness: 0}". format (len (data_frame.loc [data_frame ['thickness'] == 0])))

print ("# rows missing insulin: 0}". format (len (data_frame.loc [data_frame ['insulin'] == 0])))

print ("# rows missing bmi: 0}". format (len (data_frame.loc [data_frame ['bmi'] == 0])))

print ("# rows missing diab_pred: 0}". format (len (data_frame.loc [data_frame ['diab_pred'] == 0])))

print ("# rows missing age: 0". format (len (data_frame.loc [data_frame ['age'] == 0])))

Output:

# rows in dataframe 768

# rows missing glucose_conc: 5

# rows missing diastolic_bp: 35

# rows missing thickness: 227

# rows missing insulin: 374

# rows missing bmi: 11

# rows missing diab_pred: 0

# rows missing age: 0

Imputation is a technique to replace a value with a value. There is already ready-made code on the bike for importing, we will continue to use it for the time being.

from sklearn.preprocessing import Imputer


#Impute with mean all 0 readings

fill_0 = Imputer (missing_values ​​= 0, strategy = "mean", axis = 0)


X_train = fill_0.fit_transform (X_train)

X_test = fill_0.fit_transform (X_test)

Here fill_0 is also a kind of model, whose job is to replace 0 values ​​with a logical value through mean strategy.

We will train with this modified train value and test with test value.

Why not use Imputer in y_train or y_test?

Because simple, there is no missing data.

Model training

Eventually, we will train the model by calling that magical function.

from sklearn.naive_bayes import GaussianNB


# create Gaussian Naive Bayes model object and train it with the data

nb_model = GaussianNB ()


nb_model.fit (X_train, y_train.ravel ())

We have already discussed and decided that our selected algorithm will be Naive Bayes and a model of that algorithm is Gaussian Naive Bayes. We created the object of an empty model, then trained by calling the fit () function with the training value.

Comments