Build a Pricing Prediction Model Using Python, ScikitLearn, Linear Regression

In this piece, I would walk you through brieflyf how to predict a variant pricing based on having considered multiple variables that might be correlated to the pricing change. By the end of this piece, you can apply this module to your business actual cases using Python and Scikit learn for generating a score to predict the pricing.

In this piece, I would walk you through brieflyf how to predict a variant pricing based on having considered multiple variables that might be correlated to the pricing change. By the end of this piece, you can apply this module to your business actual cases using Python and Scikit learn for generating a score to predict the pricing.

Ingredients to prepare in advance: Numpy, Pandas, Scikit learn, matplotlib, seaborn, Linear Regression, StandardScaler, RandomForest

Tables of Content: A Variant Price Prediction based on Multiple Variables Using Python, ScikitLearn 

Loading Dataset

This article uses the California housing price dataset as an example. Being said that you can use your business case data as the dataset. Just be sure the dataset should have a certain amount of data which is better to predict the score.

As usual, we can use Pandas to load the dataset and apply info() to have a glance at the dataset conditions.

For better prediction, one of the main principles are the data size amongst different metrics should have the same amount of data rows. As you can see from this sample, obviously totle_bedrooms metrics show some NA in its column. Therefore, we need to drop the NA first.

data.dropna(inplace=True)

Data Exploration

First thing first, we need to set a target variant to predict. In this case, the median housing value is the target variant because basically this experiment is for property purchase decision making. So, we need to drop the metric from the existing table and set the target variant separately as a new variable in the script.

X = data.drop(['median_house_value'], axis=1)

y = data['median_house_value']

Then, we might try to explore the correlation of each variable with our target variant and have a big picture understanding if the dataset makes sense.

Generally we don’t need to use the whole dataset to fulfill this purpose. In this case, we can again leverage the train testing split. We elaborate this method in the previous chapter. If you are interested in, please explore other chapters on Easy2Digital.com

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Then, we can try to show them in the histogram graph using join() and hist() method

Or we can show in a heatmap using corr() method, which is more visual with deep and light color contrast.

Data Preprocessing

We can see a bunch of featured variables. Furthermore, when we look at the histogram distribution above, some features look non-sensible. So we might try to use log() to see if the featured variable distribution can be better.

train_data['total_rooms'] = np.log(train_data['total_rooms'] + 1)

train_data['total_bedrooms'] = np.log(train_data['total_bedrooms'] + 1)

train_data['population'] = np.log(train_data['population'] + 1)

train_data['households'] = np.log(train_data['households'] + 1)

It makes more sense after having implemented the log method in this case. We need to plus one in the log because it’s just in case some of the features might be Zero.

Then, the other critical section of data preprocessing is to convert string data type into integers. It’s because machine learning is a number driven process and it is not able to handle strings directly.

In the dataset, we find that ocean proximity is in the string data type format. Thus, we can use panda get_dummies method to handle this.

pd.get_dummies(train_data.ocean_proximity)

Predict Using Linear Regression Model

Now the dataset is in place and we can try to import a model and test the model accuracy for predicting the housing value by scaling the feature dataset.

from sklearn.linear_model import LinearRegression

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train, y_train = train_data.drop(['median_house_value'], axis=1), train_data['median_house_value']

X_train_s = scaler.fit_transform(X_train)

reg = LinearRegression()

reg.fit(X_train_s, y_train)

LinearRegression()

test_data = X_test.join(y_test)

test_data['total_rooms'] = np.log(test_data['total_rooms'] + 1)

test_data['total_bedrooms'] = np.log(test_data['total_bedrooms'] + 1)

test_data['population'] = np.log(test_data['population'] + 1)

test_data['households'] = np.log(test_data['households'] + 1)

test_data = test_data.join(pd.get_dummies(test_data.ocean_proximity)).drop(['ocean_proximity'], axis=1)

X_test, y_test = test_data.drop(['median_house_value'], axis=1), test_data['median_house_value']

X_test_s = scaler.transform(X_test)

reg.score(X_test_s, y_test)

Full Python Script of building a Price Prediction model based on Multiple Variables Using Python, ScikitLearn 

If you are interested in Build a Pricing Prediction Model Using Python, ScikitLearn, Linear Regression, please subscribe to our newsletter by adding the message ‘price prediction model’. We would send you the script immediately to your mailbox.

I hope you enjoy reading Build a Pricing Prediction Model Using Python, ScikitLearn, Linear Regression. If you did, please support us by doing one of the things listed below, because it always helps out our channel.

Data Science & Machine Learning Couresa Course Recommendation