Feature Selection

5.4. Feature Selection#

5.4.1. Calculate Variance Inflation Factor (VIF)#

How to detect Multicollinearity?

Multicollinearity is a statistical phenomenon that occurs when two or more predictor variables in a multiple regression model are highly correlated. This can lead to unstable and inconsistent coefficients, making it difficult to interpret the model’s results.

To measure multicollinearity, you can use the 𝐕𝐚𝐫𝐢𝐚𝐧𝐜𝐞 𝐈𝐧𝐟𝐥𝐚𝐭𝐢𝐨𝐧 𝐅𝐚𝐜𝐭𝐨𝐫 (VIF)

VIF is defined as the ratio of the variance of an estimated regression coefficient to the variance of the coefficient when the predictor variables are not correlated.

A high VIF value (VIF > 5 or > 10) indicates that multicollinearity is present and may be a problem.

To calculate the VIF for a predictor variable, you can fit a multiple regression model with all of the predictor variables except for that variable, and then calculate the VIF using the following formula:

VIF = 1 / (1 - R^2)

where R^2 is the coefficient of determination from the regression model.

You can repeat this process for each predictor variable and compare the VIF values to determine which predictor variables contribute to multicollinearity.

Now you could drop the predictor variables with high VIF and calculate the VIF for the remaining again to see, how their VIF has changed.

Below you can see how to calculate VIF with 𝐬𝐭𝐚𝐭𝐬𝐦𝐨𝐝𝐞𝐥𝐬.

import pandas as pd
from sklearn.datasets import load_boston
from statsmodels.stats.outliers_influence import variance_inflation_factor

boston = load_boston()

X = pd.DataFrame(boston.data, columns = boston.feature_names)

vif = pd.DataFrame()
vif["Predictor"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif)

'''
   Predictor        VIF
     CRIM   2.100373
       ZN   2.844013
    INDUS  14.485758
     CHAS   1.152952
      NOX  73.894947
       RM  77.948283
      AGE  21.386850
      DIS  14.699652
      RAD  15.167725
      TAX  61.227274
 PTRATIO  85.029547
       B  20.104943
   LSTAT  11.102025
'''

5.4.2. Check for new categories in test set with `Deepchecks`#

Always check if your test set has new categories when training a Machine Learning Model.

Some algorithms like CatBoost can handle unknown categories.

But when you have more and more unknown categories, it will harm your model.

Instead, check the mismatch beforehand with Deepchecks’ CategoryMismatchTrainTest.

It will show you if there are new categories so you can handle them appropriate.

from deepchecks.tabular.checks.train_test_validation import CategoryMismatchTrainTest
checker = CategoryMismatchTrainTest()

X_train = pd.DataFrame([["A", "B", "C"], ["B", "B", "A"]], columns=["Col1", "Col2", "Col3"])
X_test = pd.DataFrame([["B", "C", "D"], ["D", "A", "B", ]], columns=["Col1", "Col2", "Col3"])

checker.run(X_train, X_test)

5.4.3. Get Permutation Importance with `eli5`#

Use Permutation Importance method to obtain feature importances.

Permutation Importance calculates feature importance by randomly shuffling the values of a feature and observing how the model’s performance changes.

In comparison to Feature Importance, Permutation Importance works for every model (and not only for tree-based models).

With eli5, you can calculate Permutation Importance with ease.

show_weights() will show you the features which hurts the performance the most, so they are more important.

import eli5
from eli5.sklearn import PermutationImportance
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris() 
X = iris.data 
target = iris.target 
names = iris.target_names

X_train, X_test, y_train, y_test = train_test_split(X, target)

svc = SVC().fit(X_train, y_train)
perm = PermutationImportance(svc).fit(X_test, y_test)
eli5.show_weights(perm, feature_names= ["Feature_1", "Feature_2", "Feature_3", "Feature_4"])

5.4.4. Find the Most Predictive Variables for Your Target Variable#

You know about Correlation. But do you know the Predictive Power Score?

Predictive Power Score (PPS) is a data-type-agnostic score that can detect linear and non-linear relationships between two columns, with an output ranging from 0 to 1.

So, a PPS of 1 means Column A is very likely to predict the values of Column B.

You can use it to identify which variables are most useful to predict the target variable.

In Python, you can use the ppscore library.

It can calculate the PPS of all the features in a dataframe against a target.

!pip install ppscore

import ppscore
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()

df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])

ppscore.predictors(df, "target")

5.4.5. Feature Selection at Scale with `mrmr`#

Do you want to do Feature Selection automatically?

Try mrmr.

mrmr (minimum-Redundancy-Maximum-Relevance) is a minimal-optimal feature selection algorithm at scale.

It means mrmr will find the smallest relevant subset of features your ML Model needs.

mrmr supports common tools like Pandas, Polars and Spark.

See below how we want to select the best K features.

The output is a ranked list of the relevant features.

!pip install mrmr_selection

import pandas as pd
from sklearn.datasets import make_classification
from mrmr import mrmr_classif

X, y = make_classification(n_samples = 1000, n_features = 50, n_informative = 10, n_redundant = 40)
X = pd.DataFrame(X)
y = pd.Series(y)

selected_features = mrmr_classif(X=X, y=y, K=10)

Feature Selection

Contents

5.4. Feature Selection#

5.4.1. Calculate Variance Inflation Factor (VIF)#

5.4.2. Check for new categories in test set with Deepchecks#

5.4.3. Get Permutation Importance with eli5#

5.4.4. Find the Most Predictive Variables for Your Target Variable#

5.4.5. Feature Selection at Scale with mrmr#

5.4.2. Check for new categories in test set with `Deepchecks`#

5.4.3. Get Permutation Importance with `eli5`#

5.4.5. Feature Selection at Scale with `mrmr`#