Preparing data (data preprocessing) - 2
Changing the data frame
Most of the time there may be missing data in the dataset. We also have to handle that missing data. Yes, we may not get lost data, but if we do not take the necessary action, the program may crash.
Which columns should be omitted?
Which will not be used
There are columns but no data
If you have the same column more than once, you have to delete the rest by leaving one
Many times the names may look like two different columns but in fact, the thing is the same. For example, one column says Length (meter) and the other column says Size (centimeter), suddenly it looks like two things are different because the label is Size and Length. But as you can see, by multiplying each of the length's data by 100, we get the size data. It is not possible to extract the same type of data by hand calculation and although it is not an efficient method. These extra columns actually generate noise in the dataset. We will separate similar columns through Statistical Analysis (Correlation here).
What is Correlated Column?
If the same data is in a slightly different format, the Length and Size in the example above are actually the same thing, only the Unit is different. Therefore, they are Correlated Column.
With or without adding a little information.
Confuses learning algorithms.
A few words about linear regression
To understand the next example, we need some basics of linear regression.
Consider the following hypothetical dataset,
House Size (sq ft)
Price (Tk in lac)
1
5
2
10
3
15
4
20
Graph
If you are asked, what will be the price of 5 sq ft house? You can say without hesitation, the answer will be 25 lac.
How did you say?
Very simple, for every 1 sq ft increase the price increases by 5 lac.
If we want to build a mathematical model, it will be a lot like this.
Or,
Where, being the price, being the size is 5 and the function tells what the price will be for its value.
In reality the model is not so simple, there are many punches, now I am getting the value with an alpha multiplication then beta, gamma, theta habijabi multiply with what is there but you may not get the value close
Let us see the following dataset,
House Size (sq ft)
No of rooms
Price (tk in lac)
1
3
10
2
3
12
3
4
14
4
4
17
5
5
22
Graph
Now if I tell you, if the size of the house is 6 sq ft then what will be the price? This time you will get in a lot of trouble, because the increased price is not balanced with the increase in size per square foot. Finding the difference by subtracting the previous one and adding the difference to it will get the next price, the problem is not so simple. The reason has been added again No of rooms.
Now if I was asked to make a mathematical model of it, I would be in a lot of trouble. Which of the following is a linear equation in which 1, 2, ... 5 inputs give 10, 12 ... 22 respectively?
Exactly I can't build a model, but maybe I can build a model nearby whose equations can be a lot like this,
Example of Correlated Column
Let's say we discuss the famous issue of House Price Prediction again.
House Area (Acre)
Size (kilo sq meter) (approx.)
No of rooms
Price (tk in lac)
1
4
3
10
2
8
4
12
3
12
4
16
I sat down to predict the dataset column without examining it well with the following formula (Linear Regression Formula),
We saw in the case of linear regression that we multiply each feature (input variable) by a co-efficient then add them and predict the output. Putting the same column Area & Size twice will never get the output price right.
It is easy to understand that the two columns are the same here because the example I created is: P. Jokes apart, if there are a lot of columns, and all of them have different names and different data, but it's actually a unit-based synonym for each other, it's a very complicated use to find out. So here we will take the help of an important topic of statistics (Correlation).
Pearson's Correlation Co-efficient or Pearson's r
Calling the co-relation function in the Pandas library calculates the co-relation according to the following formula. There will be more detailed discussions on co-relation, be happy with this formula for now.
In this formula, x is one variable and y is another variable (Isn't it too obvious?).
We need to find out the value of r. With the value of r we can understand the compatibility of the two variables. If r = 1, then there is no difference between the two variables, so if we calculate the co-relation of any variable with itself, the value of r is 1
If you want further explanation, if you do Correlation Co-efficient calculation between Acre and Sq meter in the above example, you get the value of r as 1.
Proof:
Now let's see how to find out if there is any data missing in any column of the dataset.
Find the null or blank part of the dataset
Open the notebook you created earlier and enter the following code,
print data_frame.isnull (). values.any ()
isnull (). values.any ()
isnull ()
It returns the same dataframe but the difference is that there is no more value, Empty Cell is replaced by True and Non-Empty Cell is replaced by False.
values.any ()
Returns isnull () to the dataframe, but with .values it becomes an array of True / False.
6. Check the any () function to see if any values in the array are empty or Empty.
There is no blank data in the pima-data.csv file. So calling this program statement shows False.
Deleting a deliberate cell and running the data_frame.isnull (). Values.any () statement again
Here I am pima-data.csI intentionally deleted a cell of the file, loaded it again with Pandas and continued the code.
It turns out that the output is now True. So one or the other cell is empty.
Creating Correlation Matrix Heatmap
So far we have learned a little bit about Correlation and we have seen how to find out if there is any Null value hidden in the dataset. Now we will see how to generate Correlation Matrix Heatmap. Before that, let's talk about what Heatmap is.
Heatmap
According to Wikipedia,
A heat map (or heatmap) is a graphical representation of data where the individual values contained in a matrix are represented as colors.
That is, we generate a plot by replacing the numerical value with color. That would be the Heatmap.
Therefore, Correlation Heatmap is to plot the graph by replacing the Correlation values with colors.
Correlation Heatmap
We have seen how to calculate the co-relation between two variables
Ask yourself, how many values (floating points) are easy to compare or easy to compare colors? Of course the colors are easy to compare,
All we have to do is select a variable and find the correlation with each variable (even its own). To do this we arrange the variables Row and Column wise,
----
num_preg
glucose_conc
diastolic_bp
thickness
insulin
bmi
age
num_preg
1
corr_value
corr_value
corr_value
corr_value
corr_value
corr_value
glucose_conc
corr_value
1
corr_value
corr_value
corr_value
corr_value
corr_value
diastolic_bp
corr_value
corr_value
1
corr_value
corr_value
corr_value
corr_value
thickness
corr_value
corr_value
corr_value
1
corr_value
corr_value
corr_value
insulin
corr_value
corr_value
corr_value
corr_value
1
corr_value
corr_value
bmi
corr_value
corr_value
corr_value
corr_value
corr_value
1
corr_value
age
corr_value
corr_value
corr_value
corr_value
corr_value
corr_value
1
As mentioned earlier, the correlation of a variable with itself will always be 1. All the values along the diagonal of the table must be 1 because the co-relation between them has been found out. And corr_value means the co-relation of one variable and another can be a value, since we will determine these values using the library, so we do not see the need to calculate manually.
The important thing now is to choose the color of the heatmap. No worries, we can work for now by looking at the built-in color map of the Matplotlib library. If you want, you can look at the documentation and give the color of your choice. For now, we will use the default.
Matplotlib Heat Map Color Guide
Matplotlib will set the color according to the following sequence when generating the heatmap.
Less Correlated to More Correlated
Blue -> Cyan -> Yellow -> Red -> Dark Red (Correlation 1)
Function to generate heatmap
Let's write the function to generate a quick heatmap, the function will be like this
# Here size means plot-size
def corr_heatmap (data_frame, size = 11):
# Getting correlation using Pandas
correlation = data_frame.corr ()
# Dividing the plot into subplots for increasing size of plots
fig, heatmap = plt.subplots (figsize = (size, size))
# Plotting the correlation heatmap
heatmap.matshow (correlation)
# Adding xticks and yticks
plt.xticks (range (len (correlation.columns)), correlation.columns)
plt.yticks (range (len (correlation.columns)), correlation.columns)
# Displaying the graph
plt.show ()
Why did I use subplot?
If I wished, I could have generated a heatmap using plt.matshow (correlation), but I could not generate a graph of the desired size, so I am going to generate a convenient plot of any size by assigning the plot to the subplot.
What are xticks and yticks?
plt.xticks (range (len (correlation.columns)), correlation.columns) This code means, the length of each block will be 1 unit and the spots will be up to 0, 1, 2 ... len (correlation.columns). Each block is labeled with the next argument (correlation.columns).
The same applies to plt.yticks ..
What has been done with plt.show ()?
U kiddin 'bro?
Heatmap plotting through corr_heatmap (data_frame, size) function
I wrote the function with difficulty and if you do not use it? We can easily plot a hitmap with the following code snippet,
corr_heatmap (data_frame)
Generated heatmap closeview
Notable
We have already seen that if two variables are the same then their Correlation will be 1. Each variable in the diagonal has its own co-relation so the colors of the diagonal blocks are dark red.
But notice well, skin and thickness are the co-relation of these two but 1 (dark red color).
So, skin and thickness are actually the same thing, just the difference of the unit. Can't believe it?
Do one thing, then multiply each value of thickness by 0.0393701 and you will see that you are getting the value of skin. 1 millimeter = 0.0373701 inch Now you can tell which unit is actually what?
Got calprit, now dataset cleaning
From the above work it is understood that we are columns of the same type. The feature of Tidy Data was that each column must be unique. Leave one of the duplicates and remove the rest from the dataset.
I will delete the skin variable here, you can delete it or the thickness if you wantRen, completely your wish.
# Deleting 'skin' column completely
del data_frame ['skin']
# Checking if the action was successful or not
data_frame.head ()
We were able to drop a duplicate column. The work is not finished yet, the data has to be molded. No worries, this is the last step of data preparation. So cheers!
Data Molding
Data type adjustment
Our dataset should be such that it is suitable for working with all algorithms. Otherwise we have to do data tweaking for each algorithm which is quite a hassle. So we will do the trouble once and for all so that it does not become a cause of headache.
Data type checking
Let's check the datatypes once before data molding.
data_frame.head ()
Once you enter it, you will see some samples of the dataframe again and you will notice that all the values here are float or integer type but one remains Boolean type.
Data type changing
We will make True 1 and False 0. This can be done with the following code snippet,
# Mapping the values
map_diabetes = {True: 1, False: 0
# Setting the map to the data_frame
data_frame ['diabetes'] = data_frame ['diabetes']. map (map_diabetes)
# Let's see what we have done
data_frame.head ()
Congratulations!
With this molded and clean dataset we can work on the algorithm we want.
But?
Data Rule # 3
Rare events are less likely to be predicted with high accuracy
Normal, because Rare event means there will be less such event in your dataset. And the lower the dataset of such an event, the worse the prediction will be. But it is better not to worry about it. First he fixes the traditional prediction, then he fixes the rare event.
Some more analysis.
Checking True / False Ratio
If we want to see, in this dataset, what percentage of people are diabetic and how many are not, they take out a notebook and write the code immediately.
num_true = 0.0
num_false = 0.0
for item in data_frame ['diabetes']:
if item == True:
num_true + = 1
else:
num_false + = 1
percent_true = (num_true / (num_true + num_false)) * 100
percent_false = (num_false / (num_true + num_false)) * 100
print ("Number of True Cases: 0} ({1: 2.2f}%)". format (num_true, percent_true))
print ("Number of False Cases: 0} ({1: 2.2f}%)". format (num_false, percent_false))
Output:
Number of True Cases: 268.0 (34.90%)
Number of False Cases: 500.0 (65.10%)
We can actually write the code in Pythonic Way in four lines.
# Pythonic Way
num_true = len (data_frame.loc [data_frame ['diabetes'] == True])
num_false = len (data_frame.loc [data_frame ['diabetes'] == False])
print ("Number of True Cases: {0} ({1: 2.2f}%)". format (num_true, (num_true / (num_true + num_false)) * 100))
print ("Number of False Cases: {0} ({1: 2.2f}%)". format (num_false, (num_true / (num_true + num_false)) * 100))
Data Rule # 4
Keep and check data manipulation history regularly
There is a way to do this (using Jupyter Notebook)
Versions using control systems, such as: Git, SVN, BitBucket, GitHub, GitLab, etc.
Summary
What did I do in these two episodes?
I read the data with Pandas
I got the idea about co-relation
I deleted the duplicate column
I molded the data
I also checked the True / False ratio



Comments
Post a Comment