The random forest algorithm has been applied across a number of industries, allowing them to make better business decisions. Some use cases include high credit risk analysis and product recommendation for cross-sell purposes.
In this piece, I would briefly walk you through several methods of generating feature importance by using classic red wine quality validator dataset. By the end of this chapter, you can have a basic concept to use Random forest applied to your projects and compare the result amongst different methods.
Table of Contents: Generate the Object Feature Importance Using Scikit learn and Random Forest in Machine Learning
- Red wine dataset and data training split
- Built-in Feature Importance with Scikit-learn
- Built-in Scikit-learn Method with a Random Feature
- Permutation Feature Importance
- Random Forest Feature Importance with SNAP
- Random Forest Path Feature Importance
- Full Python Scripts of Feature importance generator
- Data Science & Machine Learning Couresa Course Recommendation
Red wine dataset and data training split
For any machine learning model, getting a proper dataset or preprocess the data is critical. Kaggle is one of the most popular platforms for you to look up proper dataset. Here is the link for the red wine quality project.
First thing thing, processing the data using Pandas and Sklearn train_test_split is the first step.
url = "winequality-red.csv"
wine_data = pd.read_csv(url, sep=";")
x = wine_data.drop('quality', axis=1)
y = wine_data['quality']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=50)
Built-in Feature Importance with Scikit-learn
Scikit-learn provides a built-in feature importance method for Random Forest models. According to the documentation, this method is based on the decrease in node impurity.
In a Random Forest, the questions are like the features in the model. Some questions help you eliminate more possibilities than others. The assumption is that features that help you eliminate more possibilities quickly are more important because they help you get closer to the correct answer faster. It’s very simple to get these feature importances with Scikit-learn:
rf = RandomForestRegressor(n_estimators=100, random_state=50)
inbuilt_importances = pd.Series(rf.feature_importances_, index=x_train.columns)
Built-in Scikit-learn Method with a Random Feature
The most simple way to advance this method is to add a random feature to the dataset and see if the result might be deviated more than the 1st one without random.
If a real feature has lower importance than the random feature, it could indicate that its importance is just due to chance.
X_train_random = x_train.copy()
X_train_random["RANDOM"] = np.random.RandomState(42).randn(x_train.shape)
rf_random = RandomForestRegressor(n_estimators=100, random_state=42)
importances_random = pd.Series(rf_random.feature_importances_, index=X_train_random.columns)
plt.title("Feature Importance - Scikit Learn Built-in with random")
Permutation Feature Importance
Permutation feature importance is another technique to estimate the importance of each feature in a Random Forest model by measuring the change in the model’s performance when the feature’s values are randomly shuffled.
One of the advantages of this method is that it can be used with any model, not just Random Forests, which makes the results between models more comparable.
Random Forest Feature Importance with SNAP
SHAP is a method for interpreting the output of machine learning models based on game theory.
It provides a unified measure of feature importance that, like the permutation importance, can be applied to any model.
The main drawback of it is that it can be computationally expensive, especially for large datasets or complex models.
Random Forest Path Feature Importance
Another way to understand how each feature contributes to the Random Forest predictions is to look at the decision tree paths that each instance takes.
It calculates the difference between the prediction value at the leaf node and the prediction values at the nodes that precede it to get the estimated contribution of each feature.
Full Python Script of Feature importance generator
If you are interested in Chapter 76 – Generate the Object Feature Importance Using Scikit learn and Random Forest, please subscribe to our newsletter by adding the message ‘Chapter 75 + notion api’. We would send you the script immediately to your mailbox.
I hope you enjoy reading Chapter 76 – Generate the Object Feature Importance Using Scikit learn and Random Forest. If you did, please support us by doing one of the things listed below, because it always helps out our channel.
- Support and Donate to our channel through PayPal (paypal.me/Easy2digital)
- Subscribe to my channel and turn on the notification bell Easy2Digital Youtube channel.
- Follow and like my page Easy2Digital Facebook page
- Share the article on your social network with the hashtag #easy2digital
- You sign up for our weekly newsletter to receive Easy2Digital latest articles, videos, and discount codes
- Subscribe to our monthly membership through Patreon to enjoy exclusive benefits (www.patreon.com/louisludigital)