Tips for Data Preprocessing Using Python and Scikit Learn

Data is the blood of machine learning, but it’s not the quantity of data. So, proper and optimal data preprocessing is super important before starting developing machine learning models. So in this piece, I would walk through 1 plus 2 critical data preprocessing steps using Python and Scikit Learn. By the end of this piece, you can start working out on your own machine learning, data analysis projects with practical tips and tricks

Data is the blood of machine learning, but it’s not the quantity of data. So, proper and optimal data preprocessing is super important before starting developing machine learning models. So in this piece, I would walk through 1 plus 2 critical data preprocessing steps using Python and Scikit Learn. By the end of this piece, you can start working out on your own machine learning, data analysis projects with practical tips and tricks

Table of Contents: Tips for Data Preprocessing Using Python and Scikit Learn

One: Formulate the Question

First thing first, I would suggest asking why you need to leverage machine learning. Most people I met before might say they like to do data analysis using machine learning. Data analysis is an approach but not the purpose. In terms of purpose from my perspective, basically there are three types on business aspect:

  • Find new and optimizable elements that can better business and operational performance
  • Automate operation tasks with operational decisions made by machines.
  • Build and develop inhouse niche AI models

Two: Importance of Data Source – Inhouse & External Scraping

Now you might clearly align with your team what is the purpose of this model. Then,data discovery and gathering is the next big thing.

To evaluate the data source, quality is always the most important metric to evaluate if the dataset can be used to build a machine learning model. It’s because it would mislead the machine learning process and direct in a wrong way which distorts the learning result.

The definition of quality has two basic options for your reference from my project experience. They are inhouse and scrapable data. Just do not try to purchase dataset from 3rd party 

If your company doesn’t have sufficient inhouse data, scraping data is one of the best ways out to gather data. It’s because we could recognise where the data are from and gather the data we need specifically. We could leverage scraper providers like BuyfromLo, or inhouse develop a scraping app. 

www.buyfromlo.com/api

www.buyfromlo.com/app

Three: Cleaning Data using Python

Once the dataset and data source aspects have been resolved, it’s time to discuss how to clean up and structure the data. No matter scraping data or inhouse data, they might not be formatted in a way to fit in machine learning models, such as Scikit learn linear regression. Here are the tips for you to check and clean the data.

Step 1: Turn your dataset into a DataFrame

DataFrame type is one part of the body soul in machine learning. No matter you input the data using csv or fetch data list using API, here is the line of code FYI

Read CSV

import pandas as pd

dataset = pd.read_csv('planprice3.csv')

Data List

List = [1, 2, 3, 4]

Dataset = pd.DataFrame(List)

For the CSV dataset, sometimes the header might be not the one you set and the one you set is down to the second role. Thus, here is the way to resolve this.

dataset = pd.read_csv('planprice3.csv')

dataset.columns = dataset.iloc[0]

dataset = dataset[1:]

Step2: Know what information covers from the dataset

pp_dataset.shape

(84359, 13)

This tells what the shape is of your dataset. The shape here means how many features or so-call variables (depend plus independent variables), and how many instances the dataset has.

print(dataset.info())

This is super useful method, that tells you more details about the dataset

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 84359 entries, 1 to 84359

Data columns (total 13 columns):

 #   Column           Non-Null Count  Dtype 

---  ------           --------------  ----- 

 0   Purchase Date    84359 non-null  object

 1   Campaign         84359 non-null  object

 2   Source Desc      84359 non-null  object

 3   Source Type      84359 non-null  object

 4   Country          84359 non-null  object

 5   Company Name     84358 non-null  object

 6   Amount ($)       84359 non-null  object

 7   Base Price ($)   84359 non-null  object

 8   Gender           84359 non-null  object

 9   Plan Type        84359 non-null  object

 10  Activity         84359 non-null  object

 11  Created By Name  84359 non-null  object

 12  Created by ID    84359 non-null  object

dtypes: object(13)

memory usage: 8.4+ MB

If your dataset are all in integer data type, you can also use this line to see the big picture of the number

dataset.describe()

Step 3: Resolve value missing problem

As you can see above, the number of instances might not be the same and some features might have fewer instances than others.  This is not good for working out, so here is the way to make things consistent.

dataset.dropna(inplace=True)

Step 4: Numberise all data

Machine learning is not able to read strings and only integers and numbers are readable. In fact, the real world includes so many strings in the dataset. So we need to convert strings into numbers. Here is the scikit libraries and code to make it done.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

label = le.fit_transform(dataset['Country'])

dataset['Country1'] = label

dataset.drop("Country", axis=1, inplace=True)

For cases if you need to convert more than one feature

dataset[['Campaign', 'Source Desc', 'Source Type', 'Company Name', 'Gender', 'Plan Type','Activity']] = dataset[['Campaign', 'Source Desc', 'Source Type', 'Company Name', 'Gender', 'Plan Type','Activity']].apply(le.fit_transform)

Step 5: Convert number into Integer data type

Some values might look like numbers but they might be in string data type in fact. In this case, we need to convert them into integer data type. Here is the code sample:

dataset['Amount ($)'] = dataset['Amount ($)'].astype(int)

Free Scraping API and Full Python script of data cleaning

If you are interested in Tips for Data Preprocessing Using Python and Scikit Learn, please subscribe to our newsletter by adding the message ‘DS + Full scripts and scraping API free token. We would send you the script when the up-to-date app script is live.

I hope you enjoy reading Tips for Data Preprocessing Using Python and Scikit Learn. If you did, please support us by doing one of the things listed below, because it always helps out our channel.

Data Science & Machine Learning Couresa Course Recommendation