5.3. Feature Selection#

5.3.1. Calculate Variance Inflation Factor (VIF)#

How to detect Multicollinearity?

Multicollinearity is a statistical phenomenon that occurs when two or more predictor variables in a multiple regression model are highly correlated. This can lead to unstable and inconsistent coefficients, making it difficult to interpret the model’s results.

To measure multicollinearity, you can use the π•πšπ«π’πšπ§πœπž 𝐈𝐧𝐟π₯𝐚𝐭𝐒𝐨𝐧 π…πšπœπ­π¨π« (VIF)

VIF is defined as the ratio of the variance of an estimated regression coefficient to the variance of the coefficient when the predictor variables are not correlated.

A high VIF value (VIF > 5 or > 10) indicates that multicollinearity is present and may be a problem.

To calculate the VIF for a predictor variable, you can fit a multiple regression model with all of the predictor variables except for that variable, and then calculate the VIF using the following formula:

VIF = 1 / (1 - R^2)

where R^2 is the coefficient of determination from the regression model.

You can repeat this process for each predictor variable and compare the VIF values to determine which predictor variables contribute to multicollinearity.

Now you could drop the predictor variables with high VIF and calculate the VIF for the remaining again to see, how their VIF has changed.

Below you can see how to calculate VIF with 𝐬𝐭𝐚𝐭𝐬𝐦𝐨𝐝𝐞π₯𝐬.

import pandas as pd
from sklearn.datasets import load_boston
from statsmodels.stats.outliers_influence import variance_inflation_factor

boston = load_boston()

X = pd.DataFrame(boston.data, columns = boston.feature_names)

vif = pd.DataFrame()
vif["Predictor"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif)
'''
   Predictor        VIF
0       CRIM   2.100373
1         ZN   2.844013
2      INDUS  14.485758
3       CHAS   1.152952
4        NOX  73.894947
5         RM  77.948283
6        AGE  21.386850
7        DIS  14.699652
8        RAD  15.167725
9        TAX  61.227274
10   PTRATIO  85.029547
11         B  20.104943
12     LSTAT  11.102025
'''

5.3.2. Check for new categories in test set with Deepchecks#

Always check if your test set has new categories when training a Machine Learning Model.

Some algorithms like CatBoost can handle unknown categories.

But when you have more and more unknown categories, it will harm your model.

Instead, check the mismatch beforehand with Deepchecks’ CategoryMismatchTrainTest.

It will show you if there are new categories so you can handle them appropriate.

from deepchecks.tabular.checks.train_test_validation import CategoryMismatchTrainTest
checker = CategoryMismatchTrainTest()

X_train = pd.DataFrame([["A", "B", "C"], ["B", "B", "A"]], columns=["Col1", "Col2", "Col3"])
X_test = pd.DataFrame([["B", "C", "D"], ["D", "A", "B", ]], columns=["Col1", "Col2", "Col3"])

checker.run(X_train, X_test)

5.3.3. Get Permutation Importance with eli5#

Use Permutation Importance method to obtain feature importances.

Permutation Importance calculates feature importance by randomly shuffling the values of a feature and observing how the model’s performance changes.

In comparison to Feature Importance, Permutation Importance works for every model (and not only for tree-based models).

With eli5, you can calculate Permutation Importance with ease.

show_weights() will show you the features which hurts the performance the most, so they are more important.

import eli5
from eli5.sklearn import PermutationImportance
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris() 
X = iris.data 
target = iris.target 
names = iris.target_names

X_train, X_test, y_train, y_test = train_test_split(X, target)

svc = SVC().fit(X_train, y_train)
perm = PermutationImportance(svc).fit(X_test, y_test)
eli5.show_weights(perm, feature_names= ["Feature_1", "Feature_2", "Feature_3", "Feature_4"])

5.3.4. Find the Most Predictive Variables for Your Target Variable#

You know about Correlation. But do you know the Predictive Power Score?

Predictive Power Score (PPS) is a data-type-agnostic score that can detect linear and non-linear relationships between two columns, with an output ranging from 0 to 1.

So, a PPS of 1 means Column A is very likely to predict the values of Column B.

You can use it to identify which variables are most useful to predict the target variable.

In Python, you can use the ppscore library.

It can calculate the PPS of all the features in a dataframe against a target.

!pip install ppscore
import ppscore
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()

df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])
ppscore.predictors(df, "target")

5.3.5. Feature Selection at Scale with mrmr#

Do you want to do Feature Selection automatically?

Try mrmr.

mrmr (minimum-Redundancy-Maximum-Relevance) is a minimal-optimal feature selection algorithm at scale.

It means mrmr will find the smallest relevant subset of features your ML Model needs.

mrmr supports common tools like Pandas, Polars and Spark.

See below how we want to select the best K features.

The output is a ranked list of the relevant features.

!pip install mrmr_selection
import pandas as pd
from sklearn.datasets import make_classification
from mrmr import mrmr_classif

X, y = make_classification(n_samples = 1000, n_features = 50, n_informative = 10, n_redundant = 40)
X = pd.DataFrame(X)
y = pd.Series(y)

selected_features = mrmr_classif(X=X, y=y, K=10)