Chapter 76 – Generate the Object Feature Importance Using Scikit learn and Random Forest

The random forest algorithm has been applied across a number of industries, allowing them to make better business decisions. Some use cases include high credit risk analysis and product recommendation for cross-sell purposes.

In this piece, I would briefly walk you through several methods of generating feature importance by using classic red wine quality validator dataset. By the end of this chapter, you can have a basic concept to use Random forest applied to your projects and compare the result amongst different methods.

random forest

The random forest algorithm has been applied across a number of industries, allowing them to make better business decisions. Some use cases include high credit risk analysis and product recommendation for cross-sell purposes.

In this piece, I would briefly walk you through several methods of generating feature importance by using classic red wine quality validator dataset. By the end of this chapter, you can have a basic concept to use Random forest applied to your projects and compare the result amongst different methods.

Table of Contents: Generate the Object Feature Importance Using Scikit learn and Random Forest in Machine Learning

Red wine dataset and data training split

For any machine learning model, getting a proper dataset or preprocess the data is critical. Kaggle is one of the most popular platforms for you to look up proper dataset. Here is the link for the red wine quality project.

https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009

First thing thing, processing the data using Pandas and Sklearn train_test_split is the first step.

url = "winequality-red.csv"

wine_data = pd.read_csv(url, sep=";")

x = wine_data.drop('quality', axis=1)

y = wine_data['quality']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=50)

Built-in Feature Importance with Scikit-learn

Scikit-learn provides a built-in feature importance method for Random Forest models. According to the documentation, this method is based on the decrease in node impurity.

scikit learn in-built random forest

In a Random Forest, the questions are like the features in the model. Some questions help you eliminate more possibilities than others. The assumption is that features that help you eliminate more possibilities quickly are more important because they help you get closer to the correct answer faster. It’s very simple to get these feature importances with Scikit-learn:

rf = RandomForestRegressor(n_estimators=100, random_state=50)

rf.fit(x_train, y_train)

inbuilt_importances = pd.Series(rf.feature_importances_, index=x_train.columns)

inbuilt_importances.sort_values(ascending=True, inplace=True)

inbuilt_importances.plot.barh(color='black')

Built-in Scikit-learn Method with a Random Feature

The most simple way to advance this method is to add a random feature to the dataset and see if the result might be deviated more than the 1st one without random.

If a real feature has lower importance than the random feature, it could indicate that its importance is just due to chance.

def randomMethod():

   X_train_random = x_train.copy()

   X_train_random["RANDOM"] = np.random.RandomState(42).randn(x_train.shape[0])

   rf_random = RandomForestRegressor(n_estimators=100, random_state=42)

   rf_random.fit(X_train_random, y_train)

   importances_random = pd.Series(rf_random.feature_importances_, index=X_train_random.columns)

   importances_random.sort_values(ascending=True, inplace=True)

   importances_random.plot.barh(color='blue')

   plt.xlabel("Importance")

   plt.ylabel("Feature")

   plt.title("Feature Importance - Scikit Learn Built-in with random")

   plt.show()

   return

Permutation Feature Importance

Permutation feature importance is another technique to estimate the importance of each feature in a Random Forest model by measuring the change in the model’s performance when the feature’s values are randomly shuffled.

One of the advantages of this method is that it can be used with any model, not just Random Forests, which makes the results between models more comparable.

Random Forest Feature Importance with SNAP

SHAP is a method for interpreting the output of machine learning models based on game theory.

It provides a unified measure of feature importance that, like the permutation importance, can be applied to any model.

The main drawback of it is that it can be computationally expensive, especially for large datasets or complex models.

Random Forest Path Feature Importance

Random Forest Path Feature Importance

Another way to understand how each feature contributes to the Random Forest predictions is to look at the decision tree paths that each instance takes.

It calculates the difference between the prediction value at the leaf node and the prediction values at the nodes that precede it to get the estimated contribution of each feature.

Full Python Script of Feature importance generator

If you are interested in Chapter 76 – Generate the Object Feature Importance Using Scikit learn and Random Forest, please subscribe to our newsletter by adding the message ‘Chapter 75 + notion api’. We would send you the script immediately to your mailbox.

I hope you enjoy reading Chapter 76 – Generate the Object Feature Importance Using Scikit learn and Random Forest. If you did, please support us by doing one of the things listed below, because it always helps out our channel.

Data Science & Machine Learning Couresa Course Recommendation