Data preparation (data collection and preprocessing) - 1
Machine learning means dealing with data, so it's no surprise that model building spends most of its time collecting and processing data.
The model will build on the collected data, no matter how good your algorithm is, your predictive model will not be good if the data is not effective. It is unanimously recognized. So at this stage of machine learning, you have to be more and more careful.
Again, if the data preparation is good, the model will be much easier to create, there will be no need for repeated tuning and no need for data cleaning. If you can't prepare the data well, then your model will not be good, you will have to use the data again and again for model building.
So it is better to clean the data first and start the model building at a manageable level.
Let's see what we can do in this episode.
Overview
Searching for data
Data Inspection and Unnecessary Parts (Data Cleaning)
Data Exploration
Convert to Tidy Data through data molding
Everything is done in Jupyter Notebook
Let's see first, what is Tidy Data?
Tidy Data
Tidy Data is a dataset that can be easily modeled, easily visualized, and has a specific structure.
Features of Tidy Data
Each variable will be a single column
Each observation will be a single row
Each observational unit will be a table
Taking the collected dataset in Tidy form is somewhat time-consuming.
In machine learning-based projects 50-60% of the time is spent on data collection, cleaning and organizing.
Data collection
What are the good sources of data collection?
If you search on Google, you will find it, but be careful, there may be nonsense, fake and invalid data, they can be used for testing. But if you do a serious project, you must try to collect verified data.
Government database
Government databases are a really good source for data collection. Because here you can get fairly verified data. Some government databases also have good documentation to identify the data.
Professional or company data source
A very good source. Several professional societies share their databases. Twitter reports and shares its own collection of tweets and its own analysis of those tweets. Financial data is available from the company's free API, such as Yahoo! Shares this kind of dataset.
The company you work for
The company you are in can also be a good source of data.
University Data Repository
Several universities offer datasets for free, such as the University of California, Irvine. They have their own data repository from where you can collect data.
Kaggle
You will work on machine learning but you will not know the name of Kaggle. You can call it the code force of data scientists. There are regular contests on data analysis. Incomparable for high grade datasets.
GitHub
Yes, GitHub also has a lot of data. You can check out this Awesome Dataset Collection
All of the above have been discussed
Sometimes it doesn't work with one source of data, then try all the sources. Then you have to integrate all the data and make Tidy data.
Where do I collect our selected problem dataset?
Pima Indian Diabetes Data
Data file
Dataset details
As mentioned earlier, we will collect the diabetes database from the UCI Machine Learning repository.
Some features of this dataset:
Female Patient at least 21 years old
6 Observations (8 Row)
There are 10 columns against each row
9 out of 10 columns are Feature, meaning: Number of pregnancies, blood pressure, glucose, insuline level ... etc.
And the remaining 1 column is: Is there diabetes or not (True / False)
Using this dataset we will find the solution to the problem.
Before that let's take a look at some data rules.
Data Rule # 1
The more clear you are in the dataset, the better
Reading or listening to the rule may seem like this rule and even, it can be understood with common sense.
In fact, it doesn't matter, since we want to find out how likely a person is to have diabetes, this dataset is perfect for our work, because there is a direct directive in the column, whether the person being tested is diabetic.
If you want to solve many problems, you can't find the exact thing you want to predict in the dataset. Then you need to rearrange the dataset and arrange it in such a way that it matches or comes close to your target variable (the attribute you will predict, such as whether or not there is diabetes here).
Data Rule # 2
No matter how beautiful the dataset looks, the way you want to work with it will never be in that format.
So the next task of data collection is data preprocessing. Which we will discuss today.
CSV (Comma Separated Value) data file download and instruction
If you visit the link of UCI, you will see that there are links to two files named .data and .name.
The values in the .data file are comma-separated but the file format is not .csv.
So for your convenience, I have uploaded the .csv file. Where the value as well as which column actually indicates which property is also stated.csv Pima dataset download (original)
CSV Pima dataset download (modified)
Note
original: Here it is said whether there is diabetes or not with 1/0
modified: All 1/0 have been replaced with TRUE / FALSE
Data Exploration with Pandas Library
Do you know anything about ipython notebook? If you don't know, take a look from here.
If it is in Windows, open cmd and open the notebook with the following command
ipython notebook
If that command does not work then try this jupyter notebook
Download the two files and put them on your PC.
Image source:https://ml.howtocode.dev/workflow/data_processing
Import the required library
Before starting work, we added the necessary libraries with the following code
import pandas as PD
import matplotlib.pyplot as plt
import NumPy as np
# Jupiter notebook's magic function for inline plotting (we don't want to show the plot in a separate window)
% matplotlib inline
Data load and review
pd.read_csv (r'file_path ')
We have imported the panda's library here as PD (as), so to call any function of pandas, I don't need to write the word pandas in full, I just have to write PD.
If I were you
import pandas as PANDA
Then I would type PANDA.read_csv ('file_path') to call the function.
Now let's come to the read_csv function, from the function it is understood that its job is to read CSV files.
This function converts the CSV file to Pandas data frame format. We can make various changes with the Pandas Library.
read_csv ('filePath') Here I have given the directory where the CSV file was on my PC in the argument. In your case, you must give the directory where the file is on your PC.
Data_frame.shape
Since the data in the data frame is a matrix (or 2D Array), we have called the shape variable to see its Row and Column numbers.
Output Row - 768 (without label) and Column - 10 hrs
data_frame.head (number)
By calling the data_frame.head (3) function, we printed the first 3 rows of the data frame.
data_frame.tail (number)
By calling the data_frame.tail (4) function, we printed the last 4 rows of the data frame.

Comments
Post a Comment