5.4. Feature Selection#
5.4.1. Calculate Variance Inflation Factor (VIF)#
How to detect Multicollinearity?
Multicollinearity is a statistical phenomenon that occurs when two or more predictor variables in a multiple regression model are highly correlated. This can lead to unstable and inconsistent coefficients, making it difficult to interpret the modelβs results.
To measure multicollinearity, you can use the πππ«π’ππ§ππ ππ§ππ₯πππ’π¨π§ π ππππ¨π« (VIF)
VIF is defined as the ratio of the variance of an estimated regression coefficient to the variance of the coefficient when the predictor variables are not correlated.
A high VIF value (VIF > 5 or > 10) indicates that multicollinearity is present and may be a problem.
To calculate the VIF for a predictor variable, you can fit a multiple regression model with all of the predictor variables except for that variable, and then calculate the VIF using the following formula:
VIF = 1 / (1 - R^2)
where R^2 is the coefficient of determination from the regression model.
You can repeat this process for each predictor variable and compare the VIF values to determine which predictor variables contribute to multicollinearity.
Now you could drop the predictor variables with high VIF and calculate the VIF for the remaining again to see, how their VIF has changed.
Below you can see how to calculate VIF with π¬ππππ¬π¦π¨πππ₯π¬.
import pandas as pd
from sklearn.datasets import load_boston
from statsmodels.stats.outliers_influence import variance_inflation_factor
boston = load_boston()
X = pd.DataFrame(boston.data, columns = boston.feature_names)
vif = pd.DataFrame()
vif["Predictor"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)
'''
Predictor VIF
0 CRIM 2.100373
1 ZN 2.844013
2 INDUS 14.485758
3 CHAS 1.152952
4 NOX 73.894947
5 RM 77.948283
6 AGE 21.386850
7 DIS 14.699652
8 RAD 15.167725
9 TAX 61.227274
10 PTRATIO 85.029547
11 B 20.104943
12 LSTAT 11.102025
'''
5.4.2. Check for new categories in test set with Deepchecks
#
Always check if your test set has new categories when training a Machine Learning Model.
Some algorithms like CatBoost can handle unknown categories.
But when you have more and more unknown categories, it will harm your model.
Instead, check the mismatch beforehand with Deepchecksβ
CategoryMismatchTrainTest
.
It will show you if there are new categories so you can handle them appropriate.
from deepchecks.tabular.checks.train_test_validation import CategoryMismatchTrainTest
checker = CategoryMismatchTrainTest()
X_train = pd.DataFrame([["A", "B", "C"], ["B", "B", "A"]], columns=["Col1", "Col2", "Col3"])
X_test = pd.DataFrame([["B", "C", "D"], ["D", "A", "B", ]], columns=["Col1", "Col2", "Col3"])
checker.run(X_train, X_test)
5.4.3. Get Permutation Importance with eli5
#
Use Permutation Importance method to obtain feature importances.
Permutation Importance calculates feature importance by randomly shuffling the values of a feature and observing how the modelβs performance changes.
In comparison to Feature Importance, Permutation Importance works for every model (and not only for tree-based models).
With eli5
, you can calculate Permutation Importance with ease.
show_weights()
will show you the features which hurts the performance the most, so they are more important.
import eli5
from eli5.sklearn import PermutationImportance
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X = iris.data
target = iris.target
names = iris.target_names
X_train, X_test, y_train, y_test = train_test_split(X, target)
svc = SVC().fit(X_train, y_train)
perm = PermutationImportance(svc).fit(X_test, y_test)
eli5.show_weights(perm, feature_names= ["Feature_1", "Feature_2", "Feature_3", "Feature_4"])
5.4.4. Find the Most Predictive Variables for Your Target Variable#
You know about Correlation. But do you know the Predictive Power Score?
Predictive Power Score (PPS) is a data-type-agnostic score that can detect linear and non-linear relationships between two columns, with an output ranging from 0 to 1.
So, a PPS of 1 means Column A is very likely to predict the values of Column B.
You can use it to identify which variables are most useful to predict the target variable.
In Python, you can use the ppscore
library.
It can calculate the PPS of all the features in a dataframe against a target.
!pip install ppscore
import ppscore
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
ppscore.predictors(df, "target")
5.4.5. Feature Selection at Scale with mrmr
#
Do you want to do Feature Selection automatically?
Try mrmr
.
mrmr
(minimum-Redundancy-Maximum-Relevance) is a minimal-optimal feature selection algorithm at scale.
It means mrmr
will find the smallest relevant subset of features your ML Model needs.
mrmr
supports common tools like Pandas, Polars and Spark.
See below how we want to select the best K features.
The output is a ranked list of the relevant features.
!pip install mrmr_selection
import pandas as pd
from sklearn.datasets import make_classification
from mrmr import mrmr_classif
X, y = make_classification(n_samples = 1000, n_features = 50, n_informative = 10, n_redundant = 40)
X = pd.DataFrame(X)
y = pd.Series(y)
selected_features = mrmr_classif(X=X, y=y, K=10)