Model Performance Testing - Final Episode [ML]

March 27, 2021

Model Performance Testing - Final Episode [ML]

Model Performance Revision - ROC

Before understanding ROC, you must know about Confusion Matrix, if you don't know, read from the previous episode.

To draw ROC curve in ROC space, FPR (False Positive Rate) is to be placed on X-axis and TPR (True Positive Rate) is to be placed on Y-axis.

So if we set the TPR and FPR rates obtained from the Confusion matrix, we get one point, thus we get one point from each of the models we work with on the same dataset.

By adding these points and drawing a graph, you will get your desired ROC curve.

The TPR of a perfect classifier is 1 and the FPR is 0.

ROC can be understood with an example.

Let's look at a scenario,

I took a diabetes dataset, there are 1000 observations in the dataset, I divided it into 80% -20%. So 80% of the data is training data, 20% of the data is testing data.

Suppose again, 100 out of 200 testing data is Positive (I mean, to put it simply, the output of 100 data is diabetes). And 100 Negative.

I made four models, I will train these four models and test their performance. The four models are,

Gaussian Naive Bayes Model

Logistic Regression Model

Random Forest Model

Artificial Neural Network Model

We haven't seen Artificial Neural Network yet and there is no problem if we don't know about it.

I can train each model like in the previous episode and then find out their Confusion Matrix, right? In the same way I will teach the people with 80% dataset and then read to test their performance. (Find the Confusion Matrix).

Also keep in mind, I took out the Confusion Matrix of each model as well as their TPR, FPR.

Gaussian Naive Bayes Model

TP = 63

FP = 28

FN = 37

TN = 72

TPR = 0.63

FPR = 0.28

Logistic Regression Model

TP = 77

FP = 77

FN = 23

TN = 23

TPR = 0.77

FPR = 0.77

Random Forest Model

TP = 24

FP = 88

FN = 76

TN = 12

TPR = 0.24

FPR = 0.88

Artificial Neural Network Model

TP = 76

FP = 12

FN = 24

TN = 88

TPR = 0.76

FPR = 0.12

We already know that in case of ROC Curve, Y-axis has TPR and X-axis has FPR. Then we can easily place these four coordinates in the ROC space.

Coordinates

Coordinate -> Model (X, Y)

--------------------------------------

G point -> Gaussian Naive (0.28, 0.63)

L point -> Logistic Regression (.77, .77)

R point -> Random Forest (.88, .24)

A point -> Artificial Neural Network (.76, .12)

We will now plot these points.

import numpy as np

import matplotlib.pyplot as plt

# fpr, tpr

naive_bayes = np.array ([0.28, 0.63])

logistic = np.array ([0.77, 0.77])

random_forest = np.array ([0.88, 0.24])

ann = np.array ([0.12, 0.76])

# plotting

plt.scatter (naive_bayes [0], naive_bayes [1], label = 'Naive Bayes', facecolors = 'black', edgecolors = 'orange', s = 300)

plt.scatter (logistic [0], logistic [1], label = 'Logistic Regression', facecolors = 'orange', edgecolors = 'orange', s = 300)

plt.scatter (random_forest [0], random_forest [1], label = 'Random Forest', facecolors = 'blue', edgecolors = 'black', s = 300)

plt.scatter (ann [0], ann [1], label = 'Artificial Neural Network', facecolors = 'red', edgecolors = 'black', s = 300)

plt.plot ([0, 1], [0, 1], 'k--')

plt.xlim ([0.0, 1.0])

plt.ylim ([0.0, 1.0])

plt.xlabel ('False Positive Rate')

plt.ylabel ('True Positive Rate')

plt.title ('Receiver operating characteristic example')

plt.legend (loc = 'lower center')

plt.show ()

I plotted the scatter differently to explain each point here. If you have many models or if you plot based on the parameter change performance of the same model then your plotted ROC curve will be like this.

I have only 4 models here, so if you plot the line here, you won't understand, so the scatter plot is done.

ROC Curve Explanation

FPR = 0 and TPR = 1 of 100% Accurate Model. It is easy to understand from the ideal that ANN has performed best as a model, then Naive Bayes, then Logistic Regression and finally Random Forest.

As I said before (and again), ANN> NB> LR> RF will not always be like this, the performance of one model varies according to the type of dataset and problem. I imagined the whole thing here.

The dashed line in the middle along the angle is called the Line of no-discrimination. The better the point is above this line and the worse it is below.

AUC or Area Under Curve

Surely you see a ROC Curve above? The better the ROC curve covers the area there. The AUC of 100% Accurate Model is TPR * FPR or the area of the whole graph.

There are a lot of questions about measuring performance with AUC. Nowadays, everyone prefers ROC. So I did not talk about AUC.

Overfitting

As mentioned earlier, sometimes the performance of the model is so good that the Accuracy Rate is about 95-99% in the case of Training Data. But if you allow to test in Testing Data, 40% Accuracy Rate is not.

The question is, why is it?

In fact, the dataset that we train with, there is noise as well as the actual data. That is, you will never get 100% Pure Dataset.

To be a classic example, I gathered some datasets, got marks on how many hours I read and how many hours I slept. Now I sit down to practice modeling on this dataset and if I see any way, I get more marks if I reduce my reading, and based on that I fell asleep before the next test but didn't read at all (because my created AI says I get more marks when I sleep. Will go). It means what the result will be.

So what is the reason for this wrong prediction? There are two reasons, (1) not enough data, (2) the number of columns in the dataset (variables, such as how many hours of reading and how many hours of sleep) is low. There could be many reasons for Marx to get better, if the test is MCQ and I stumbled upon it with a good amount of bucks, or the question is much easier and so on. So these are the ones I have trained without input, so the model is naturally that? Without knowing the reasons, I will adapt myself to the given dataset in such a way that the error is the least.

Model training means error reduction, and the hyper parameters of each model are set according to mathematical analysis to reduce error. The hyper parameter that will use the least error will use the model (this is normal). But in order to reduce the error, if the model adapts to the noise of the dataset, then there will be enough trouble.

We will see more details about overfitting later in a few steps.

Reduce overfitting

All that can be done to reduce overfitting is to gather data and increase the number of columns. Pure datasets and good predictions can give as good results as possible. That's all there is to it. If you want, it is possible to get good results by tuning the algorithm. We will look at a method.

Regularization & Regularization Hyperparameter

We can control how we want to learn an algorithm. Machine learning algorithm means that one or the other mathematical model is working behind it, so the learning mechanism of that mathematical model can be controlled with certain parameters.

Assume that a model outputs with this formula,

To control its learning, we can create a Regularized Model by subtracting from the part result,

Here, e is Regularization Hyperparameter.

Notably, its value will be slightly lower than the previous prediction, so I will get this Training Dataset this Accuracy less than before. But it's good! Because? The reason is that this time he is not memorizing every dataset, because the Regularization Hyperparameter will not allow him to memorize, the more its value increases, the more its predicted value will be penalized. We can therefore call it Penalized Machine Learning Model.

Whenever the model has to adapt to the dataset to reduce the error, the omni lambda will remove it with a penalty. Our ultimate task will be to tune this lambda in such a way that the accuracy of the Testing Dataset is good. Let's go to Accuracy Golla in Training Dataset: P

Increase accuracy through tuning Regularization Hyperparameter in Logistic Regression model

The title of the topic became a little bigger. As we learned a while ago, by hacking the mathematical model we can reduce the overfitting of the model through regularization. Model based Regularization Hyperparameter is different. The Psychit Library already codes the Model of Logistic Regression and provides a convenient interface for changing the Regularization Hyperparameter.

Our job will be to change the value of Regularization Hyperparameter and collect the prediction score. Then store the Hyperparameter Value that will have the highest accuracy of the prediction.

I saw the theory, now it's time to look at the practical. Now you must take out the notebook and write the code.

from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression (C = 0.7, random_state = 42)

lr_model.fit (X_train, y_train.ravel ())

lr_predict_test = lr_model.predict (X_test)

# training metrics

print "Accuracy: 0: .4f}". format (metrics.accuracy_score (y_test, lr_predict_test))

print "Confusion Matrix"

print metrics.confusion_matrix (y_test, lr_predict_test, labels = [1, 0])

print ""

print "Classification Report"

print metrics.classification_report (y_test, lr_predict_test, labels = [1, 0])

Output

Accuracy: 0.7446

Confusion Matrix

[[44 36]

[23 128]]

Classification Report

precision recall f1-score support

1 0.66 0.55 0.60 80

0 0.78 0.85 0.81 151

avg / total 0.74 0.74 0.74 231

These are the things we did for the Naive Bayes model. Here C is our Regularization Hyperparameter, we assumed at the beginning, we will check the accuracy for its different values later.

Determining the value of C (Regularization Hyperparameter)

C_start = 0.1

C_end = 5

C_inc = 0.1

C_values, recall_scores = [], []

C_val = C_start

best_recall_score = 0

while (C_val <C_end):

C_values.append (C_val)

lr_model_loop = LogisticRegression (C = C_val, random_state = 42)

lr_model_loop.fit (X_train, y_train.ravel ())

lr_predict_loop_test = lr_model_loop.predict (X_test)

recall_score = metrics.recall_score (y_test, lr_predict_loop_test)

recall_scores.append (recall_score)

if (recall_score> best_recall_score):

best_recall_score = recall_score

best_lr_predict_test = lr_predict_loop_test

C_val = C_val + C_inc

best_score_C_val = C_values [recall_scores.index (best_recall_score)]

print "1st max value of 0: .3f} occured at C = 1: .3f}". format (best_recall_score, best_score_C_val)

% matplotlib inline

plt.plot (C_values, recall_scores, "-")

plt.xlabel ("C value")

plt.ylabel ("recall score")

Since Regularization Hyperparameter C, and I want to see recall_scores for different C values (the better the recall_score), I auctioned C_start = 0.1, auctioned C_end = 5, and increased the value of C in the loop by 0.1.

And I checked the accuracy with the predicted dataset for the value of each C, whenever the value of recall is higher than the previous one, then best_recall_score will be assigned recall_score i.e. current score.

The code is not difficult if you understand the previous issues.

I put two lists called C_values and recall_scores for the value store

Output

Graph of how performance is changing as the value of C increases.

When the value of C is between 2-3 then Recall Score is highest, the value of C is between 4-5 and 0-1.

class_weight = Model performance with 'balanced' and C changes

There is no reason why Regularization Hyperparameter will be the same, there can be more than one. A little while ago we figured out the value of C. Now we will balance another parameter (class_weight) to see how the performance is giving.

class_weight = The main purpose is to find the performance by changing the value of C to 'balanced'.

C_start = 0.1

C_end = 5

C_inc = 0.1

C_values, recall_scores = [], []

C_val = C_start

best_recall_score = 0

while (C_val <C_end):

C_values.append (C_val)

lr_model_loop = LogisticRegression (C = C_val, class_weight = "balanced", random_state = 42)

lr_model_loop.fit (X_train, y_train.ravel ())

lr_predict_loop_test = lr_model_loop.predict (X_test)

recall_score = metrics.recall_score (y_test, lr_predict_loop_test)

recall_scores.append (recall_score)

if (recall_score> best_recall_score):

best_recall_score = recall_score

best_lr_predict_test = lr_predict_loop_test

C_val = C_val + C_inc

best_score_C_val = C_values [recall_scores.index (best_recall_score)]

print "1st max value of 0: .3f} occured at C = 1: .3f}". format (best_recall_score, best_score_C_val)

% matplotlib inline

plt.plot (C_values, recall_scores, "-")

plt.xlabel ("C value")

plt.ylabel ("recall score")

Output:

Class Weight Balanced Recall Score has increased to 0.73+, definitely what we were looking for!

Confusion matrix

Code:

from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression (class_weight = "balanced", C = best_score_C_val, random_state = 42)

lr_model.fit (X_train, y_train.ravel ())

lr_predict_test = lr_model.predict (X_test)

# training metrics

print "Accuracy: 0: .4f}". format (metrics.accuracy_score (y_test, lr_predict_test))

print metrics.confusion_matrix (y_test, lr_predict_test, labels = [1, 0])

print ""

print "Classification Report"

print metrics.classification_report (y_test, lr_predict_test, labels = [1,0])

print metrics.recall_score (y_test, lr_predict_test)

Output:

Accuracy: 0.7143

[[59 21]

[45 106]

Classification Report

precision recall f1-score support

1 0.57 0.74 0.64 80

0 0.83 0.70 0.76 151

avg / total 0.74 0.71 0.72 231

0.7375

Through regularization we can thus increase accuracy (reduce overfitting).

K-Fold / N-Fold Cross-validation

Reducing overfittingAnother effective algorithm is K-Fold Cross-validation. The name sounds very difficult but the work is very easy.

In our diabetes dataset, however, the negative answer is more (meaning there is no diabetes). K-Fold Cross-validation helps to give very good accuracy in cases where the balance of the dataset is low.

K-Fold or N-Fold Cross-validation is the same thing when k = N! Or K = Number of observation.

What is done in k-Fold Cross-validation, the complete dataset is subsampled to k equal sized.

Now one by one data is taken from this number of k for testing.

For example, I have 25 observation datasets, I divided them into 5 groups.

Therefore, each group had a dataset at 5 o'clock. This time I held the first group of these five groups, gave the rest to the training, tested with the held dataset.

I will hold the second group in the second pass (I will not send it in training), and I will send the rest in training.

In the same way, I will hold the positional group in the fourth and fifth pass and send the rest to the training.

In this way I will train 5 times in 5-Fold. Since each group has 5 observations and 5 group numbers, it will be called 5-Fold Cross-validation.

Model training and testing using cross-validation

The cross-validation enabled model has to be made in the bicycle, you can get the cross-validation enabled model by attaching the CV with any normal model.

For example, the Cross-validation Enabled model of LogisticRegression will be LogisticRegressionCV, thus true for the rest.

Let's see its performance,

from sklearn.linear_model import LogisticRegressionCV

lr_cv_model = LogisticRegressionCV (n_jobs = -1, random_state = 42, Cs = 3, cv = 10, refit = False, class_weight = "balanced")

# set number of jobs to -1 which uses all cores to parallelize

lr_cv_model.fit (X_train, y_train.ravel ())

lr_cv_predict_test = lr_cv_model.predict (X_test)

# training metrics

print "Accuracy: 0: .4f}". format (metrics.accuracy_score (y_test, lr_cv_predict_test))

print metrics.confusion_matrix (y_test, lr_cv_predict_test, labels = [1, 0])

print ""

print "Classification Report"

print metrics.classification_report (y_test, lr_cv_predict_test, labels = [1,0])

Output:

Accuracy: 0.7100

[[55 25]

[42,109]

Classification Report

precision recall f1-score support

1 0.57 0.69 0.62 80

0 0.81 0.72 0.76 151

avg / total 0.73 0.71 0.72 231

Performance in 10-Fold Cross Validation did not come bad!

The chapter got much bigger, but Bias-Variance was left out. In the next episode we will see what Bias-Variance Trade-off is and its impact.

Scikit-learn Algorithm Cheat Sheet

Bicycle has its own chitsheet on top of the algo selection from the dataset. Very effective,

The topics discussed in the end

Model Evaluation through Test Data

Result interpretation

Increase accuracy

Confusion matrix

Recall

Precision

AUC (Area Under Curve)

ROC (Receiver Operating Characteristics) - Curve

Overfitting

Model hyper parameters

Overfitting minimization

K-Fold / N-Fold Cross-validation

Bias-Variance Trade Off

Perfection discount for better performance.

Search This Blog

Everything of Artificial Intelligence (A.I)

Model Performance Testing - Final Episode [ML]

Comments

Post a Comment