Data preparation (data collection and preprocessing)

March 26, 2021

Data preparation (data collection and preprocessing) - 1

Machine learning means dealing with data, so it's no surprise that model building spends most of its time collecting and processing data.

The model will build on the collected data, no matter how good your algorithm is, your predictive model will not be good if the data is not effective. It is unanimously recognized. So at this stage of machine learning, you have to be more and more careful.

Again, if the data preparation is good, the model will be much easier to create, there will be no need for repeated tuning and no need for data cleaning. If you can't prepare the data well, then your model will not be good, you will have to use the data again and again for model building.

So it is better to clean the data first and start the model building at a manageable level.

Let's see what we can do in this episode.

Overview

Searching for data

Data Inspection and Unnecessary Parts (Data Cleaning)

Data Exploration

Convert to Tidy Data through data molding

Everything is done in Jupyter Notebook

Let's see first, what is Tidy Data?

Tidy Data

Tidy Data is a dataset that can be easily modeled, easily visualized, and has a specific structure.

Features of Tidy Data

Each variable will be a single column

Each observation will be a single row

Each observational unit will be a table

Taking the collected dataset in Tidy form is somewhat time-consuming.

In machine learning-based projects 50-60% of the time is spent on data collection, cleaning and organizing.

Data collection

What are the good sources of data collection?

Google

If you search on Google, you will find it, but be careful, there may be nonsense, fake and invalid data, they can be used for testing. But if you do a serious project, you must try to collect verified data.

Government database

Government databases are a really good source for data collection. Because here you can get fairly verified data. Some government databases also have good documentation to identify the data.

Professional or company data source

A very good source. Several professional societies share their databases. Twitter reports and shares its own collection of tweets and its own analysis of those tweets. Financial data is available from the company's free API, such as Yahoo! Shares this kind of dataset.

The company you work for

The company you are in can also be a good source of data.

University Data Repository

Several universities offer datasets for free, such as the University of California, Irvine. They have their own data repository from where you can collect data.

Kaggle

You will work on machine learning but you will not know the name of Kaggle. You can call it the code force of data scientists. There are regular contests on data analysis. Incomparable for high grade datasets.

GitHub

Yes, GitHub also has a lot of data. You can check out this Awesome Dataset Collection

All of the above have been discussed

Sometimes it doesn't work with one source of data, then try all the sources. Then you have to integrate all the data and make Tidy data.

Where do I collect our selected problem dataset?

Pima Indian Diabetes Data

Data file

Dataset details

As mentioned earlier, we will collect the diabetes database from the UCI Machine Learning repository.

Some features of this dataset:

Female Patient at least 21 years old

6 Observations (8 Row)

There are 10 columns against each row

9 out of 10 columns are Feature, meaning: Number of pregnancies, blood pressure, glucose, insuline level ... etc.

And the remaining 1 column is: Is there diabetes or not (True / False)

Using this dataset we will find the solution to the problem.

Before that let's take a look at some data rules.

Data Rule # 1

The more clear you are in the dataset, the better

Reading or listening to the rule may seem like this rule and even, it can be understood with common sense.

In fact, it doesn't matter, since we want to find out how likely a person is to have diabetes, this dataset is perfect for our work, because there is a direct directive in the column, whether the person being tested is diabetic.

If you want to solve many problems, you can't find the exact thing you want to predict in the dataset. Then you need to rearrange the dataset and arrange it in such a way that it matches or comes close to your target variable (the attribute you will predict, such as whether or not there is diabetes here).

Data Rule # 2

No matter how beautiful the dataset looks, the way you want to work with it will never be in that format.

So the next task of data collection is data preprocessing. Which we will discuss today.

CSV (Comma Separated Value) data file download and instruction

If you visit the link of UCI, you will see that there are links to two files named .data and .name.

The values in the .data file are comma-separated but the file format is not .csv.

So for your convenience, I have uploaded the .csv file. Where the value as well as which column actually indicates which property is also stated.csv Pima dataset download (original)

CSV Pima dataset download (modified)

Note

original: Here it is said whether there is diabetes or not with 1/0

modified: All 1/0 have been replaced with TRUE / FALSE

Data Exploration with Pandas Library

Do you know anything about ipython notebook? If you don't know, take a look from here.

If it is in Windows, open cmd and open the notebook with the following command

ipython notebook

If that command does not work then try this jupyter notebook

Download the two files and put them on your PC.

Image source:https://ml.howtocode.dev/workflow/data_processing

Import the required library

Before starting work, we added the necessary libraries with the following code

import pandas as PD

import matplotlib.pyplot as plt

import NumPy as np

# Jupiter notebook's magic function for inline plotting (we don't want to show the plot in a separate window)

% matplotlib inline

Data load and review

pd.read_csv (r'file_path ')

We have imported the panda's library here as PD (as), so to call any function of pandas, I don't need to write the word pandas in full, I just have to write PD.

If I were you

import pandas as PANDA

Then I would type PANDA.read_csv ('file_path') to call the function.

Now let's come to the read_csv function, from the function it is understood that its job is to read CSV files.

This function converts the CSV file to Pandas data frame format. We can make various changes with the Pandas Library.

read_csv ('filePath') Here I have given the directory where the CSV file was on my PC in the argument. In your case, you must give the directory where the file is on your PC.

‍‍Data_frame.shape

Since the data in the data frame is a matrix (or 2D Array), we have called the shape variable to see its Row and Column numbers.

Output Row - 768 (without label) and Column - 10 hrs

data_frame.head (number)

By calling the data_frame.head (3) function, we printed the first 3 rows of the data frame.

data_frame.tail (number)

By calling the data_frame.tail (4) function, we printed the last 4 rows of the data frame.

Search This Blog

Everything of Artificial Intelligence (A.I)

Data preparation (data collection and preprocessing) - 1

Comments

Post a Comment