Data is the blood of machine learning, but it’s not the quantity of data. So, proper and optimal data preprocessing is super important before starting developing machine learning models. So in this piece, I would walk through 1 plus 2 critical data preprocessing steps using Python and Scikit Learn. By the end of this piece, you can start working out on your own machine learning, data analysis projects with practical tips and tricks
Table of Contents: Tips for Data Preprocessing Using Python and Scikit Learn
- Formulate the Question
- Importance of Data Source: Inhouse & External Scraping
- Cleaning Data Before start
- Free Scraping API and Full Python script of data cleaning
- Data Science & Machine Learning Couresa Course Recommendation
One: Formulate the Question
First thing first, I would suggest asking why you need to leverage machine learning. Most people I met before might say they like to do data analysis using machine learning. Data analysis is an approach but not the purpose. In terms of purpose from my perspective, basically there are three types on business aspect:
- Find new and optimizable elements that can better business and operational performance
- Automate operation tasks with operational decisions made by machines.
- Build and develop inhouse niche AI models
Two: Importance of Data Source – Inhouse & External Scraping
Now you might clearly align with your team what is the purpose of this model. Then,data discovery and gathering is the next big thing.
To evaluate the data source, quality is always the most important metric to evaluate if the dataset can be used to build a machine learning model. It’s because it would mislead the machine learning process and direct in a wrong way which distorts the learning result.
The definition of quality has two basic options for your reference from my project experience. They are inhouse and scrapable data. Just do not try to purchase dataset from 3rd party
If your company doesn’t have sufficient inhouse data, scraping data is one of the best ways out to gather data. It’s because we could recognise where the data are from and gather the data we need specifically. We could leverage scraper providers like BuyfromLo, or inhouse develop a scraping app.
www.buyfromlo.com/app
Three: Cleaning Data using Python
Once the dataset and data source aspects have been resolved, it’s time to discuss how to clean up and structure the data. No matter scraping data or inhouse data, they might not be formatted in a way to fit in machine learning models, such as Scikit learn linear regression. Here are the tips for you to check and clean the data.
Step 1: Turn your dataset into a DataFrame
DataFrame type is one part of the body soul in machine learning. No matter you input the data using csv or fetch data list using API, here is the line of code FYI
Read CSV
import pandas as pd
dataset = pd.read_csv('planprice3.csv')
Data List
List = [1, 2, 3, 4]
Dataset = pd.DataFrame(List)
For the CSV dataset, sometimes the header might be not the one you set and the one you set is down to the second role. Thus, here is the way to resolve this.
dataset = pd.read_csv('planprice3.csv')
dataset.columns = dataset.iloc[0]
dataset = dataset[1:]
Step2: Know what information covers from the dataset
pp_dataset.shape
(84359, 13)
This tells what the shape is of your dataset. The shape here means how many features or so-call variables (depend plus independent variables), and how many instances the dataset has.
print(dataset.info())
This is super useful method, that tells you more details about the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84359 entries, 1 to 84359
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Purchase Date 84359 non-null object
1 Campaign 84359 non-null object
2 Source Desc 84359 non-null object
3 Source Type 84359 non-null object
4 Country 84359 non-null object
5 Company Name 84358 non-null object
6 Amount ($) 84359 non-null object
7 Base Price ($) 84359 non-null object
8 Gender 84359 non-null object
9 Plan Type 84359 non-null object
10 Activity 84359 non-null object
11 Created By Name 84359 non-null object
12 Created by ID 84359 non-null object
dtypes: object(13)
memory usage: 8.4+ MB
If your dataset are all in integer data type, you can also use this line to see the big picture of the number
dataset.describe()
Step 3: Resolve value missing problem
As you can see above, the number of instances might not be the same and some features might have fewer instances than others. This is not good for working out, so here is the way to make things consistent.
dataset.dropna(inplace=True)
Step 4: Numberise all data
Machine learning is not able to read strings and only integers and numbers are readable. In fact, the real world includes so many strings in the dataset. So we need to convert strings into numbers. Here is the scikit libraries and code to make it done.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
label = le.fit_transform(dataset['Country'])
dataset['Country1'] = label
dataset.drop("Country", axis=1, inplace=True)
For cases if you need to convert more than one feature
dataset
[['Campaign', 'Source Desc', 'Source Type', 'Company Name', 'Gender', 'Plan Type','Activity']] = dataset[['Campaign', 'Source Desc', 'Source Type', 'Company Name', 'Gender', 'Plan Type','Activity']].apply(le.fit_transform)
Step 5: Convert number into Integer data type
Some values might look like numbers but they might be in string data type in fact. In this case, we need to convert them into integer data type. Here is the code sample:
dataset['Amount ($)'] = dataset['Amount ($)'].astype(int)
Free Scraping API and Full Python script of data cleaning
If you are interested in Tips for Data Preprocessing Using Python and Scikit Learn, please subscribe to our newsletter by adding the message ‘DS + Full scripts and scraping API free token. We would send you the script when the up-to-date app script is live.
I hope you enjoy reading Tips for Data Preprocessing Using Python and Scikit Learn. If you did, please support us by doing one of the things listed below, because it always helps out our channel.
- Support and Donate to our channel through PayPal (paypal.me/Easy2digital)
- Subscribe to my channel and turn on the notification bell Easy2Digital Youtube channel.
- Follow and like my page Easy2Digital Facebook page
- Share the article on your social network with the hashtag #easy2digital
- You sign up for our weekly newsletter to receive Easy2Digital latest articles, videos, and discount codes
- Subscribe to our monthly membership through Patreon to enjoy exclusive benefits (www.patreon.com/louisludigital)