Preparing data (data preprocessing)

March 26, 2021

Preparing data (data preprocessing) - 2

Changing the data frame

Most of the time there may be missing data in the dataset. We also have to handle that missing data. Yes, we may not get lost data, but if we do not take the necessary action, the program may crash.

Which columns should be omitted?

Which will not be used

There are columns but no data

If you have the same column more than once, you have to delete the rest by leaving one

Many times the names may look like two different columns but in fact, the thing is the same. For example, one column says Length (meter) and the other column says Size (centimeter), suddenly it looks like two things are different because the label is Size and Length. But as you can see, by multiplying each of the length's data by 100, we get the size data. It is not possible to extract the same type of data by hand calculation and although it is not an efficient method. These extra columns actually generate noise in the dataset. We will separate similar columns through Statistical Analysis (Correlation here).

What is Correlated Column?

If the same data is in a slightly different format, the Length and Size in the example above are actually the same thing, only the Unit is different. Therefore, they are Correlated Column.

With or without adding a little information.

Confuses learning algorithms.

A few words about linear regression

To understand the next example, we need some basics of linear regression.

Consider the following hypothetical dataset,

House Size (sq ft)

Price (Tk in lac)

Graph

If you are asked, what will be the price of 5 sq ft house? You can say without hesitation, the answer will be 25 lac.

How did you say?

Very simple, for every ‍1 sq ft increase the price increases by 5 lac.

If we want to build a mathematical model, it will be a lot like this.

Or,

Where, being the price, being the size is 5 and the function tells what the price will be for its value.

In reality the model is not so simple, there are many punches, now I am getting the value with an alpha multiplication then beta, gamma, theta habijabi multiply with what is there but you may not get the value close

Let us see the following dataset,

House Size (sq ft)

No of rooms

Price (tk in lac)

Graph

Now if I tell you, if the size of the house is 6 sq ft then what will be the price? This time you will get in a lot of trouble, because the increased price is not balanced with the increase in size per square foot. Finding the difference by subtracting the previous one and adding the difference to it will get the next price, the problem is not so simple. The reason has been added again No of rooms.

Now if I was asked to make a mathematical model of it, I would be in a lot of trouble. Which of the following is a linear equation in which 1, 2, ... 5 inputs give 10, 12 ... 22 respectively?

Exactly I can't build a model, but maybe I can build a model nearby whose equations can be a lot like this,

Example of Correlated Column

Let's say we discuss the famous issue of House Price Prediction again.

House Area (Acre)

Size (kilo sq meter) (approx.)

No of rooms

Price (tk in lac)

I sat down to predict the dataset column without examining it well with the following formula (Linear Regression Formula),

We saw in the case of linear regression that we multiply each feature (input variable) by a co-efficient then add them and predict the output. Putting the same column Area & Size twice will never get the output price right.

It is easy to understand that the two columns are the same here because the example I created is: P. Jokes apart, if there are a lot of columns, and all of them have different names and different data, but it's actually a unit-based synonym for each other, it's a very complicated use to find out. So here we will take the help of an important topic of statistics (Correlation).

Pearson's Correlation Co-efficient or Pearson's r

Calling the co-relation function in the Pandas library calculates the co-relation according to the following formula. There will be more detailed discussions on co-relation, be happy with this formula for now.

In this formula, x is one variable and y is another variable (Isn't it too obvious?).

We need to find out the value of r. With the value of r we can understand the compatibility of the two variables. If r = 1, then there is no difference between the two variables, so if we calculate the co-relation of any variable with itself, the value of r is 1

If you want further explanation, if you do Correlation Co-efficient calculation between Acre and Sq meter in the above example, you get the value of r as 1.

Proof:

Now let's see how to find out if there is any data missing in any column of the dataset.

Find the null or blank part of the dataset

Open the notebook you created earlier and enter the following code,

print data_frame.isnull (). values.any ()

isnull (). values.any ()

isnull ()

It returns the same dataframe but the difference is that there is no more value, Empty Cell is replaced by True and Non-Empty Cell is replaced by False.

values.any ()

Returns isnull () to the dataframe, but with .values it becomes an array of True / False.

6. Check the any () function to see if any values in the array are empty or Empty.

There is no blank data in the pima-data.csv file. So calling this program statement shows False.

Deleting a deliberate cell and running the data_frame.isnull (). Values.any () statement again

Here I am pima-data.csI intentionally deleted a cell of the file, loaded it again with Pandas and continued the code.

It turns out that the output is now True. So one or the other cell is empty.

Creating Correlation Matrix Heatmap

So far we have learned a little bit about Correlation and we have seen how to find out if there is any Null value hidden in the dataset. Now we will see how to generate Correlation Matrix Heatmap. Before that, let's talk about what Heatmap is.

Heatmap

According to Wikipedia,

A heat map (or heatmap) is a graphical representation of data where the individual values contained in a matrix are represented as colors.

That is, we generate a plot by replacing the numerical value with color. That would be the Heatmap.

Therefore, Correlation Heatmap is to plot the graph by replacing the Correlation values with colors.

Correlation Heatmap

We have seen how to calculate the co-relation between two variables

Ask yourself, how many values (floating points) are easy to compare or easy to compare colors? Of course the colors are easy to compare,

All we have to do is select a variable and find the correlation with each variable (even its own). To do this we arrange the variables Row and Column wise,

----

num_preg

glucose_conc

diastolic_bp

thickness

insulin

bmi

age

num_preg

corr_value

glucose_conc

corr_value

diastolic_bp

corr_value

thickness

corr_value

insulin

corr_value

bmi

corr_value

age

corr_value

As mentioned earlier, the correlation of a variable with itself will always be 1. All the values along the diagonal of the table must be 1 because the co-relation between them has been found out. And corr_value means the co-relation of one variable and another can be a value, since we will determine these values using the library, so we do not see the need to calculate manually.

The important thing now is to choose the color of the heatmap. No worries, we can work for now by looking at the built-in color map of the Matplotlib library. If you want, you can look at the documentation and give the color of your choice. For now, we will use the default.

Matplotlib Heat Map Color Guide

Matplotlib will set the color according to the following sequence when generating the heatmap.

Less Correlated to More Correlated

Blue -> Cyan -> Yellow -> Red -> Dark Red (Correlation 1)

Function to generate heatmap

Let's write the function to generate a quick heatmap, the function will be like this

# Here size means plot-size

def corr_heatmap (data_frame, size = 11):

# Getting correlation using Pandas

correlation = data_frame.corr ()

# Dividing the plot into subplots for increasing size of plots

fig, heatmap = plt.subplots (figsize = (size, size))

# Plotting the correlation heatmap

heatmap.matshow (correlation)

# Adding xticks and yticks

plt.xticks (range (len (correlation.columns)), correlation.columns)

plt.yticks (range (len (correlation.columns)), correlation.columns)

# Displaying the graph

plt.show ()

Why did I use subplot?

If I wished, I could have generated a heatmap using plt.matshow (correlation), but I could not generate a graph of the desired size, so I am going to generate a convenient plot of any size by assigning the plot to the subplot.

What are xticks and yticks?

plt.xticks (range (len (correlation.columns)), correlation.columns) This code means, the length of each block will be 1 unit and the spots will be up to 0, 1, 2 ... len (correlation.columns). Each block is labeled with the next argument (correlation.columns).

The same applies to plt.yticks ..

What has been done with plt.show ()?

U kiddin 'bro?

Heatmap plotting through corr_heatmap (data_frame, size) function

I wrote the function with difficulty and if you do not use it? We can easily plot a hitmap with the following code snippet,

corr_heatmap (data_frame)

Generated heatmap closeview

Notable

We have already seen that if two variables are the same then their Correlation will be 1. Each variable in the diagonal has its own co-relation so the colors of the diagonal blocks are dark red.

But notice well, skin and thickness are the co-relation of these two but 1 (dark red color).

So, skin and thickness are actually the same thing, just the difference of the unit. Can't believe it?

Do one thing, then multiply each value of thickness by 0.0393701 and you will see that you are getting the value of skin. 1 millimeter = 0.0373701 inch Now you can tell which unit is actually what?

Got calprit, now dataset cleaning

From the above work it is understood that we are columns of the same type. The feature of Tidy Data was that each column must be unique. Leave one of the duplicates and remove the rest from the dataset.

I will delete the skin variable here, you can delete it or the thickness if you wantRen, completely your wish.

# Deleting 'skin' column completely

del data_frame ['skin']

# Checking if the action was successful or not

data_frame.head ()

We were able to drop a duplicate column. The work is not finished yet, the data has to be molded. No worries, this is the last step of data preparation. So cheers!

Data Molding

Data type adjustment

Our dataset should be such that it is suitable for working with all algorithms. Otherwise we have to do data tweaking for each algorithm which is quite a hassle. So we will do the trouble once and for all so that it does not become a cause of headache.

Data type checking

Let's check the datatypes once before data molding.

data_frame.head ()

Once you enter it, you will see some samples of the dataframe again and you will notice that all the values here are float or integer type but one remains Boolean type.

Data type changing

We will make True 1 and False 0. This can be done with the following code snippet,

# Mapping the values

map_diabetes = {True: 1, False: 0

# Setting the map to the data_frame

data_frame ['diabetes'] = data_frame ['diabetes']. map (map_diabetes)

# Let's see what we have done

data_frame.head ()

Congratulations!

With this molded and clean dataset we can work on the algorithm we want.

But?

Data Rule # 3

Rare events are less likely to be predicted with high accuracy

Normal, because Rare event means there will be less such event in your dataset. And the lower the dataset of such an event, the worse the prediction will be. But it is better not to worry about it. First he fixes the traditional prediction, then he fixes the rare event.

Some more analysis.

Checking True / False Ratio

If we want to see, in this dataset, what percentage of people are diabetic and how many are not, they take out a notebook and write the code immediately.

num_true = 0.0

num_false = 0.0

for item in data_frame ['diabetes']:

if item == True:

num_true + = 1

else:

num_false + = 1

percent_true = (num_true / (num_true + num_false)) * 100

percent_false = (num_false / (num_true + num_false)) * 100

print ("Number of True Cases: 0} ({1: 2.2f}%)". format (num_true, percent_true))

print ("Number of False Cases: 0} ({1: 2.2f}%)". format (num_false, percent_false))

Output:

Number of True Cases: 268.0 (34.90%)

Number of False Cases: 500.0 (65.10%)

We can actually write the code in Pythonic Way in four lines.

# Pythonic Way

num_true = len (data_frame.loc [data_frame ['diabetes'] == True])

num_false = len (data_frame.loc [data_frame ['diabetes'] == False])

print ("Number of True Cases: {0} ({1: 2.2f}%)". format (num_true, (num_true / (num_true + num_false)) * 100))

print ("Number of False Cases: {0} ({1: 2.2f}%)". format (num_false, (num_true / (num_true + num_false)) * 100))

Data Rule # 4

Keep and check data manipulation history regularly

There is a way to do this (using Jupyter Notebook)

Versions using control systems, such as: Git, SVN, BitBucket, GitHub, GitLab, etc.

Summary

What did I do in these two episodes?

I read the data with Pandas

I got the idea about co-relation

I deleted the duplicate column

I molded the data

I also checked the True / False ratio

Search This Blog

Everything of Artificial Intelligence (A.I)

Preparing data (data preprocessing) - 2

Comments

Post a Comment