5.6. Outlier Detection#

5.6.1. Ensembling for Outlier Detection#

Due to its unsupervised nature, outlier detection methods often suffer from model instability.

So, why not combine various models?

Try PyOD!

PyOD is an easy-to-use library for outlier detection.

It includes more than 30 algorithms like density-based methods or ensembles.

PyOD also supports combining multiple methods like

  • Average of scores

  • Maximization of scores

  • Average of Maximum of scores

  • Maximum of Average of scores

  • Majority Vote

To combine multiple models in Python, consider the example below.

  • We define 3 outlier detectors.

  • We calculate the labels for every detector (0=inliner, 1=outlier).

  • We use majority_vote() method to calculate the highest-voted label for each sample.

!pip install pyod
import numpy as np
from pyod.models.combination import majority_vote
from pyod.models.knn import KNN
from pyod.models.abod import ABOD
from pyod.models.iforest import IForest
from pyod.utils.data import generate_data

X, _= generate_data(train_only=True)

models = [KNN(), ABOD(), IForest()]
n_models = len(models)

labels = np.zeros([X.shape[0], n_models])

for i in range(n_models):
    model = models[i]

    model.fit(X)

    labels[:, i] = model.labels_
    
majority_vote(labels)

5.6.2. Robust Outlier Detection with puncc#

Outlier Detection is notoriously hard.

But it doesn’t have to.

puncc offers outlier detection, powered by Conformal Prediction, where the detection threshold will be calibrated.

So, false alarms are reduced.

!pip install puncc
from sklearn.ensemble import IsolationForest
from deel.puncc.anomaly_detection import SplitCAD
from deel.puncc.api.prediction import BasePredictor

# We need to redefine the predict to output the nonconformity scores.
class ADPredictor(BasePredictor):
    def predict(self, X):
        return -self.model.score_samples(X)

# Wrap Isolation Forest in a predictor
if_predictor = ADPredictor(IsolationForest())

# Instantiate CAD on top of IF predictor
if_cad = SplitCAD(if_predictor, train=True)


if_cad.fit(z=dataset, fit_ratio=0.7)

# Maximum false detection rate
alpha = 0.01

results = if_cad.predict(new_data, alpha=alpha)