5.5. Model Training#

5.5.1. Compute Class Weights#

To handle class imbalance in Machine Learning, there are several methods.

One of them is adjusting the class weights.

By giving higher weights to the minority class and lower weights to the majority class, we can regularize the loss function.

Misclassifying the minority class will result in a higher loss due to the higher weight.

To incorporate class weights in Tensorflow, use scikit-learnโ€™s compute_class_weight function

import numpy as np
import tensorflow as tf
from sklearn.utils import compute_class_weight

X, y = ...

# will return an array with weights for each class, e.g. [0.6, 0.6, 1.]
class_weights = compute_class_weight(
  class_weight="balanced",
  classes=np.unique(y),
  y=y
)

# to get a dictionary with {<class>:<weight>}
class_weights = dict(enumerate(class_weights))

model = tf.keras.Sequential(...)
model.compile(...)

# using class_weights in the .fit() method
model.fit(X, y, class_weight=class_weights, ...)

5.5.2. Reset TensorFlow/Keras Global State#

In Tensorflow/Keras, when you create multiple models in a loop, you will need tf.keras.backend.clear_session().

Keras manages a global state, which includes configurations and the current values (weights and biases) of the models.

So when you create a model in a loop, the global state gets bigger and bigger with every created model. To clear the state, ๐๐ž๐ฅ ๐ฆ๐จ๐๐ž๐ฅ will not work because it will only delete the Python variable.

So tf.keras.backend.clear_session() is a better option. It will reset the state of a model and helps avoid clutter from old models.

See the first example below. Each iteration of this loop will increase the size of the global state and of your memory.

In the second example, the memory consumption stays constant by clearing the state with every iteration.

import tensorflow as tf

def create_model():
  model = tf.keras.Sequential(...)
  return model

# without clearing session
for _ in range(20):
  model = create_model()
  
# with clearing session
for _ in range(20):
  tf.keras.backend.clear_session()
  model = create_model

5.5.3. Find dirty labels with cleanlab#

Do you want to identify noisy labels in your dataset?

Try cleanlab for Python.

cleanlab is a data-centric AI package to automatically detect noisy labels and address dataset issues to fix them via confident learning algorithms.

It works with nearly every model possible:

  • XGBoost

  • scikit-learn models

  • Tensorflow

  • PyTorch

  • HuggingFace

  • etc.

!pip install cleanlab
import cleanlab
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

clf = RandomForestClassifier(n_estimators=100)

cl = cleanlab.classification.CleanLearning(clf)

label_issues = cl.find_label_issues(X, y)

print(label_issues.query('is_label_issue == True'))

5.5.4. Evaluate your Classifier with sklearnโ€™s classification_report#

Would you like to evaluate your Machine Learning model quickly?

Try classification_report from scikit-learn

With classification_report, you can quickly assess the performance of your model.

It summarizes Precision, Recall, F1-Score, and Support for each class.

# make a small script where sklearns classification_report is used
from sklearn.metrics import classification_report

y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]

target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))
"""
             precision    recall  f1-score   support

     class 0       0.50      1.00      0.67         1
     class 1       0.00      0.00      0.00         1
     class 2       1.00      0.67      0.80         3

    accuracy                           0.60         5
   macro avg       0.50      0.56      0.49         5
weighted avg       0.70      0.60      0.61         5
"""

5.5.5. Obtain Reproducible Optimizations Results in Optuna#

Optuna is a powerful hyperparameter optimization framework that supports many machine learning frameworks, including TensorFlow, PyTorch, and XGBoost.

But you need to be careful with reproducible results for hyperparameter tuning.tuple

To achieve reproducible results, you need to set the seed for your Sampler.

Below you can see how it is done for TPESampler.

import optuna
from optuna.samplers import TPESampler

def objective(trial):
    ...
    
sampler = TPESampler(seed=42)
study = optuna.create_study(sampler=sampler)
study.optimize(objective, n_trials=100)

5.5.6. Find bad labels with doubtlab#

Do you want to find bad labels in your data?

Try doubtlab for Python.

With doubtlab, you can define reasons to doubt your labels and take a closer look.

Reasons to doubt your labels can be for example:

  • ๐๐ซ๐จ๐›๐š๐‘๐ž๐š๐ฌ๐จ๐ง: When the confidence values are low for any label

  • ๐–๐ซ๐จ๐ง๐ ๐๐ซ๐ž๐๐ข๐œ๐ญ๐ข๐จ๐ง๐‘๐ž๐š๐ฌ๐จ๐ง: When a model cannot predict the listed label

  • ๐ƒ๐ข๐ฌ๐š๐ ๐ซ๐ž๐ž๐‘๐ž๐š๐ฌ๐จ๐ง: When two models disagree on a prediction.

  • ๐‘๐ž๐ฅ๐š๐ญ๐ข๐ฏ๐ž๐ƒ๐ข๐Ÿ๐Ÿ๐ž๐ซ๐ž๐ง๐œ๐ž๐‘๐ž๐š๐ฌ๐จ๐ง: When the relative difference between label and prediction is too high

So, identify your noisy labels and fix them.

!pip install doubtlab
from doubtlab.ensemble import DoubtEnsemble
from doubtlab.reason import ProbaReason, WrongPredictionReason
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

X, y = load_iris(return_X_y=True)
model = LogisticRegression()
model.fit(X, y)

# Define reasons to check
reasons = {
    'proba': ProbaReason(model=model),
    'wrong_pred': WrongPredictionReason(model=model),
}

# Pass reasons to DoubtLab instance
doubt = DoubtEnsemble(**reasons)

# Returns DataFrame with reasoning
predicates = doubt.get_predicates(X, y)

5.5.7. Get notified when your model is finished with training#

Never stare at your screen, waiting for your model to finish training.

Try knockknock for Python.

knockknock is a library that notifies you when your training is finished.

You only need to add a decorator.

Currently, you can get a notification through 12 different channels like:

  • Email

  • Slack

  • Telegram

  • Discord

  • MS Teams

Use it for your future model training and donโ€™t stick to your screen.

!pip install knockknock
from knockknock import email_sender

@email_sender(recipient_emails=["coolmail@python.com", "2coolmail@python.com"], sender_email="anothercoolmail@python.com")
def train_model(model, X, y):
    model.fit(X, y)

5.5.8. Get Model Summary in PyTorch with torchinfo#

Do you want a Model summary in PyTorch?

Like in Keras with model.summary()?

Use torchinfo.

With torchinfo, you can get a model summary as you know it from Keras.

Just add one line of code.

!pip install torchinfo
import torch
from torchinfo import summary

class MyModel(torch.nn.Module)
  ...
  
model = MyModel()

BATCH_SIZE = 16
summary(model, input_size=(BATCH_SIZE, 1, 28, 28))
'''
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
Net                                      [16, 10]                  --
โ”œโ”€Sequential: 1-1                        [16, 4, 7, 7]             --
โ”‚    โ””โ”€Conv2d: 2-1                       [16, 4, 28, 28]           40
โ”‚    โ””โ”€BatchNorm2d: 2-2                  [16, 4, 28, 28]           8
โ”‚    โ””โ”€ReLU: 2-3                         [16, 4, 28, 28]           --
โ”‚    โ””โ”€MaxPool2d: 2-4                    [16, 4, 14, 14]           --
โ”‚    โ””โ”€Conv2d: 2-5                       [16, 4, 14, 14]           148
โ”‚    โ””โ”€BatchNorm2d: 2-6                  [16, 4, 14, 14]           8
โ”‚    โ””โ”€ReLU: 2-7                         [16, 4, 14, 14]           --
โ”‚    โ””โ”€MaxPool2d: 2-8                    [16, 4, 7, 7]             --
โ”œโ”€Sequential: 1-2                        [16, 10]                  --
โ”‚    โ””โ”€Linear: 2-9                       [16, 10]                  1,970
==========================================================================================
Total params: 2,174
Trainable params: 2,174
Non-trainable params: 0
Total mult-adds (M): 1.00
==========================================================================================
Input size (MB): 0.05
Forward/backward pass size (MB): 1.00
Params size (MB): 0.01
Estimated Total Size (MB): 1.06
==========================================================================================
'''

5.5.9. Boost scikit-learns performance with Intel Extension#

Scikit-learn is one of the most popular ML packages for Python.

But, to be honest, their algorithms are not the fastest ones.

With Intelโ€™s Extension for scikit-learn, scikit-learn-intelex. you can speed up training time for some favourite algorithms like:

  • Support Vector Classifier/Regressor

  • Random Forest Classifier/Regressor

  • LASSO

  • DBSCAN

Just add two lines of code.

!pip install scikit-learn-intelex
from sklearnex import patch_sklearn
patch_sklearn()

from sklearn.svm import SVR
from sklearn.datasets import make_regression

X, y = make_regression(
n_samples=100000, 
n_features=10, 
noise=0.5)

svr = SVR()

svr.fit(X, y)

5.5.10. Incorportate Domain Knowledge into XGBoost with Feature Interaction Constraints#

Want to incorporate your domain knowledge into ๐—๐†๐๐จ๐จ๐ฌ๐ญ?

Try using ๐…๐ž๐š๐ญ๐ฎ๐ซ๐ž ๐ˆ๐ง๐ญ๐ž๐ซ๐š๐œ๐ญ๐ข๐จ๐ง ๐‚๐จ๐ง๐ฌ๐ญ๐ซ๐š๐ข๐ง๐ญ๐ฌ.

Feature Interaction Constraints allow you to control which features are allowed to interact with each other and which are not while building the trees.

For example, the constraint [0, 1] means that Feature_0 and Feature_1 are allowed to interact with each other but with no other variable. Similarly, [3, 5, 9] means that Feature_3, Feature_5, and Feature_9 are allowed to interact with each other but with no other variable.

With this in mind, you can define feature interaction constraints:

  • Based on domain knowledge, when you know that some features interactions will lead to better results

  • Based on regulatory constraints in your industry/company where some features can not interact with each other.

import xgboost as xgb

X, y = ...

dmatrix = xgb.DMatrix(X, label=y)

params = {
    "objective": "reg:squarederror",
    "eval_metric": "rmse",
    "interaction_constraints": [[0,2 ], [1, 3, 4]]
}

model_with_constraints = xgb.train(params, dmatrix)

5.5.11. Powerful AutoML with FLAML#

Do you always hear about AutoML?

And want to try it out?

Use FLAML for Python.

FLAML (Fast and Lightweight AutoML) is an AutoML package developed by Microsoft.

It can do Model Selection, Hyperparameter tuning, and Feature Engineering automatically.

Thus, it removes the pain of choosing the best model and parameters so that you can focus more on your data.

Per default, its estimator list contains only tree-based models like XGBoost, CatBoost, and LightGBM. But you can also add custom models.

A powerful library!

!pip install flaml
from flaml import AutoML
automl = AutoML()
automl.fit(X_train, y_train, task="classification")

5.5.12. Aspect-based Seniment Analysis with PyABSA#

Traditional sentiment analysis focuses on determining the overall sentiment of a piece of text.

For example, the sentence :

โ€œThe food was bad and the staff was rudeโ€

would output only a negative sentiment.

But, what if I want to extract, which aspects have a negative or positive sentiment?

Thatโ€™s the responsibility of aspect-based sentiment analysis.

It aims to identify and extract the sentiment expressed towards specific aspects of a text.

For the sentence:

โ€The battery life is excellent but the camera quality is bad.โ€

a modelโ€™s output would be:

  • Battery life: positive

  • Camera quality: negative

With aspect-based sentiment analysis, you can understand the opinions and feelings expressed about specific aspects.

To do that in Python, use the package PyABSA.

It contains pre-trained models with an easy-to-use API for aspect-term extraction and sentiment classification.

PyABSA can be used for a variety of applications, such as:

  • Customer feedback analysis

  • Product reviews analysis

  • Social media monitoring

!pip install pyabsa==1.16.27
from pyabsa import ATEPCCheckpointManager

extractor = ATEPCCheckpointManager.get_aspect_extractor(
                  checkpoint="multilingual",
                  auto_device=False
)
                                                        
example = ["Location and food were excellent but stuff was very unfriendly."]
result = extractor.extract_aspect(inference_source=example, pred_sentiment=True)

print(result)

5.5.13. Use XGBoost for Random Forests#

Are you still using Random Forests from sklearn?

XGBoost implements Random Forests too, and much faster than sklearn.

from xgboost import XGBRFRegressor

xgbrf = XGBRFRegressor(n_estimators=100)

X = np.random.rand(100000, 10)
y = np.random.rand(100000)

xgbrf.fit(X, y)

5.5.14. Identify problematic images with cleanvision#

Your Deep Learning Model doesnโ€™t perform?

Itโ€™s probably because of your data.

With cleanvision, you can detect issues in image data.

cleanvision is a relatively new data-centric AI package to find problems in your image dataset.

It can detect issues like:

  • Exact or Near Duplicates

  • Blurry Images

  • Odd Aspect Ratios

  • Irregularly Dark/Light images

  • Images lacking content

A good first step to try before applying crazy Vision Transformers.

!wget - nc 'https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip'
!unzip -q image_files.zip
!pip install cleanvision
from cleanvision.imagelab import Imagelab

# Path to your dataset, you can specify your own dataset path
dataset_path = "./image_files/"

# Initialize imagelab with your dataset
imagelab = Imagelab(data_path=dataset_path)

# Find issues
imagelab.find_issues()
# Get summary of issues with prevalence per issue
imagelab.issue_summary
# Visualize Top examples for blurry images
imagelab.visualize(issue_types=['blurry'])

5.5.15. Select the optimal Regularization Parameter#

How do you choose your Regularization Parameter?

Your modelโ€™s complexity decreases with a higher Regularization Parameter (Alpha).

It shouldnโ€™t be too high or too low.

Yellowbrickโ€™s ๐€๐ฅ๐ฉ๐ก๐š๐’๐ž๐ฅ๐ž๐œ๐ญ๐ข๐จ๐ง can help you to find the best Alpha.

It takes your model and visualizes the Alpha/Error curve so you can see how the modelโ€™s error responds to different alpha values.

Below you can see how to do it with scikit-learnโ€™s LassoCV.

import numpy as np
from sklearn.linear_model import LassoCV
from yellowbrick.datasets import load_concrete
from yellowbrick.regressor import AlphaSelection

X, y = load_concrete()

# Create a list of alphas to cross-validate against
alphas = np.linspace(0, 10, 30)

model = LassoCV(alphas=alphas)
visualizer = AlphaSelection(model)
visualizer.fit(X, y)
visualizer.show()

5.5.16. Decision Forests in TensorFlow#

Did you know there are Decision Forests from TensorFlow?

tensorflow_decision_forests implements decision forest models like Random Forest or GBDT for classification, regression, and ranking.

!pip install tensorflow_decision_forests
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_decision_forests as tfdf
dataset_path = tf.keras.utils.get_file(
      "adult.csv",
      "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/"
      "main/yggdrasil_decision_forests/test_data/dataset/adult.csv")

dataset_df = pd.read_csv(dataset_path)
test_indices = np.random.rand(len(dataset_df)) < 0.30
test_ds_pd = dataset_df[test_indices]
train_ds_pd = dataset_df[~test_indices]


train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label="income")
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label="income")
model = tfdf.keras.GradientBoostedTreesModel(verbose=2)
model.fit(train_ds)

print(model.summary())

5.5.17. AutoML with AutoGluon#

Do you always hear about AutoML?

And want to try it out?

Use AutoGluon for Python.

AutoGluon is a Python package from AWS.

It lets you perform AutoML on:

  • Tabular Data (Classification, Regression)

  • Time Series Data

  • Multimodal Data (Images + Text + Tabular)

Thus, it removes the pain of choosing the best model and best parameter.

AutoGluon also offers utilities for EDA, like:

  • Detecting Covariate Shift

  • Target Variable Analysis

  • Feature Interaction Charts

See below for a quickstart for tabular data.

!pip install autogluon
from autogluon.tabular import TabularDataset, TabularPredictor

train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')

predictor = TabularPredictor(label='class').fit(train_data, time_limit=240)
predictor.leaderboard(test_data)

5.5.18. Visualize Keras Models with visualkeras#

Do you want some cool visualization for your Deep Learning Models?

Try visualkeras.

visualkeras visualizes your Keras models (as an alternative to model.summary())

!pip install visualkeras
import visualkeras

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 3)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))

visualkeras.layered_view(model, legend=True, to_file='output.png').show()

5.5.19. Perform Multilabel Stratified KFold with iterative-stratification#

When doing Cross-Validation for classification,

StratifiedKFold from scikit-learn is a common choice.

Stratification aims to guarantee that every fold represents all strata of the data.

But, scikit-learn doesnโ€™t support stratifying multilabel data.

For this use case, try the iterative-stratification package.

It offers implementations for stratyfing multilabel data in different ways.

See below how we can use MultilabelStratifiedKFold.

!pip install iterative-stratification
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
import numpy as np

X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])

mskf = MultilabelStratifiedKFold(n_splits=2, shuffle=True, random_state=0)

for train_index, test_index in mskf.split(X, y):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

5.5.20. Interpret your Model with Shapash#

Nobody cares about your SOTA ML Model if nobody canโ€™t understand the predictions.

Therefore, interpretability of ML Models is a crucial point in industry cases.

To overcome this hurdle, use Shapash for Python.

Shapash offers several types of interpretability methods to understand your modelโ€™s predictions like:

  • Feature Importance

  • Feature Contribution

  • LIME

  • SHAP

It comes also with an intuitive GUI to interact with.

Check it out! Link is in the comments section.

!pip install shapash
import pandas as pd
from category_encoders import OrdinalEncoder
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor

from shapash.data.data_loader import data_loading
house_df, house_dict = data_loading('house_prices')

y_df=house_df['SalePrice'].to_frame()
X_df=house_df[house_df.columns.difference(['SalePrice'])]

from category_encoders import OrdinalEncoder

categorical_features = [col for col in X_df.columns if X_df[col].dtype == 'object']

encoder = OrdinalEncoder(
    cols=categorical_features,
    handle_unknown='ignore',
    return_df=True).fit(X_df)

X_df=encoder.transform(X_df)

Xtrain, Xtest, ytrain, ytest = train_test_split(X_df, y_df, train_size=0.75, random_state=1)

regressor = LGBMRegressor(n_estimators=100).fit(Xtrain,ytrain)

from shapash import SmartExplainer

xpl = SmartExplainer(
    model=regressor,
    preprocessing=encoder,  
    features_dict=house_dict  
)
xpl.compile(x=Xtest,
            y_target=ytest 
           )
app = xpl.run_app(title_story='House Prices', port=8020)

5.5.21. Validate Your Model and Data with Deepchecks#

Validating your Model and Data is crucial in ML.

Not testing them will cause huge problems in production.

To change that, use deepchecks.

deepchecks is an open-source solution which offers a suite for detailed validation methods.

It will calculate and visualize a bunch of things like:

  • Train/Test Performance

  • Predictive Power Score

  • Feature Drift

  • Label Drift

  • Weak Segments for your model

A powerful tool to consider for testing your models and datasets.

!pip install deepchecks
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

from deepchecks.tabular.datasets.classification import iris
from deepchecks.tabular import Dataset
from deepchecks.tabular.suites import full_suite

# Load Data
iris_df = iris.load_data(data_format='Dataframe', as_train_test=False)
label_col = 'target'
df_train, df_test = train_test_split(iris_df, stratify=iris_df[label_col], random_state=0)

# Train Model
rf_clf = RandomForestClassifier(random_state=0)
rf_clf.fit(df_train.drop(label_col, axis=1), df_train[label_col])


ds_train = Dataset(df_train, label=label_col, cat_features=[])
ds_test =  Dataset(df_test,  label=label_col, cat_features=[])

suite = full_suite()

suite.run(train_dataset=ds_train, test_dataset=ds_test, model=rf_clf)

5.5.22. Visualize high-performance Features with Optuna#

Optuna released a new feature for detecting high-performing parameters.

Its plot_rank() function visualizes different parameters, with individual points representing individual trials.

Since the plot is interactive, you can also hover over it and dive deeper into analysing your hyperparameter optimization.

from sklearn.ensemble import RandomForestClassifier

import optuna


def objective(trial):
    clf = RandomForestClassifier(
        n_estimators=50,
        criterion="gini",
        max_depth=trial.suggest_int('Mdpth', 2, 32, log=True),
        min_samples_split=trial.suggest_int('mspl', 2, 32, log=True),
        min_samples_leaf=trial.suggest_int('mlfs', 1, 32, log=True),
        min_weight_fraction_leaf=trial.suggest_float('mwfr', 0.0, 0.5),
        max_features=trial.suggest_int("Mfts", 1, 15),
        max_leaf_nodes=trial.suggest_int('Mnods', 4, 100, log=True),
        min_impurity_decrease=trial.suggest_float('mid', 0.0, 0.5),
    )
    clf.fit(X_train, y_train)
    return clf.score(X_test, y_test)

# Optimize
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=30)

# Get parameters sorted by the importance values
importances = optuna.importance.get_param_importances(study)
params_sorted = list(importances.keys())

# Plot
fig = optuna.visualization.plot_rank(study, params=params_sorted[:4])
fig.show()

5.5.23. Model Ensembling with combo#

Looking at the top solutions on Kaggle you will notice one thing:

There is usually some sort of combination of various ML models involved.

With combo for Python, you can combine

  • Multiple Classifiers

  • Multiple Anomaly Detection Models

  • Multiple Clustering Models

combo also offers multiple combination methods for every category.

!pip install combo
from combo.models.cluster_comb import ClustererEnsemble

estimators = [KMeans(n_clusters=n_clusters),
              MiniBatchKMeans(n_clusters=n_clusters),
              AgglomerativeClustering(n_clusters=n_clusters)]

clf = ClustererEnsemble(estimators, n_clusters=n_clusters)
clf.fit(X)

aligned_labels = clf.aligned_labels_
predicted_labels = clf.labels_

5.5.24. Residual Plots with yellowbrick#

To analyze the variance of the error of your Regression model

Use ResidualPlot from yellowbrick.

With Residual Plots, you can see how well-fitted your model is.

If the data points exhibit a random distribution along the horizontal axis, a linear regression model is typically suitable, whereas in cases of non-random dispersion, a non-linear model is a better choice.

See below how you can easily implement that with yellowbrick.

!pip install yellowbrick
from yellowbrick.datasets import load_concrete
from yellowbrick.regressor import ResidualsPlot

model = Lasso()
visualizer = ResidualsPlot(model)

visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.show() 

5.5.25. Powerful and Distributed Hyperparameter Optimization with ray.tune#

Do you need hyperparameter tuning on steroids?

Try tune from ray.

tune performs distributed hyperparameter tuning with multi-GPU and multi-node support, utilizing all the hardware you have.

It supports the most popular ML libraries and integrates many other common hyperparameter optimization tools like Optuna or Hyperopt.

!pip install "ray[tune]"
# !pip install "ray[tune]"
import sklearn.datasets
import sklearn.metrics
import sklearn.datasets
import sklearn.metrics
import xgboost as xgb
from ray import train, tune
from sklearn.model_selection import train_test_split


def train_breast_cancer(config):
    data, labels = sklearn.datasets.load_breast_cancer(return_X_y=True)
    train_x, test_x, train_y, test_y = train_test_split(data, labels, test_size=0.25)
    train_set = xgb.DMatrix(train_x, label=train_y)
    test_set = xgb.DMatrix(test_x, label=test_y)
    results = {}
    xgb.train(
        config,
        train_set,
        evals=[(test_set, "eval")],
        evals_result=results,
        verbose_eval=False,
    )
    accuracy = 1.0 - results["eval"]["error"][-1]
    train.report({"mean_accuracy": accuracy, "done": True})


config = {
    "objective": "binary:logistic",
    "eval_metric": ["logloss", "error"],
    "min_child_weight": tune.choice([1, 2, 3]),
    "subsample": tune.uniform(0.5, 1.0),
}

tuner = tune.Tuner(
    train_breast_cancer,
    tune_config=tune.TuneConfig(
        num_samples=10,
    ),
    param_space=config,
)
results = tuner.fit()
print(results.get_best_result(metric="mean_accuracy", mode="max").config)

5.5.26. Use PyTorch with scikit-learn API with skorch#

PyTorch and scikit-learn are one of the most popular libraries for ML/DL.

So, why not combine PyTorch with scikit-learn?

Try skorch!

skorch is a high-level library for PyTorch that provides a scikit-learn-compatible neural network module.

It allows you to use the simple scikit-learn interface for PyTorch.

Therefore you can integrate PyTorch models into scikit-learn workflows.

See below for an example.

!pip install skorch
from torch import nn
from skorch import NeuralNetClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

class MyModule(nn.Module):
    def __init__(self, num_units=10, nonlin=nn.ReLU()):
        super().__init__()

        self.dense = nn.Linear(20, num_units)
        self.nonlin = nonlin
        self.output = nn.Linear(num_units, 2)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, X, **kwargs):
        X = self.nonlin(self.dense(X))
        X = self.dropout(X)
        X = self.softmax(self.output(X))
        return X

net = NeuralNetClassifier(
    MyModule,
    max_epochs=10,
    lr=0.1,
    iterator_train__shuffle=True,
)

pipe = Pipeline([
    ('scale', StandardScaler()),
    ('net', net),
])

pipe.fit(X, y)
y_proba = pipe.predict_proba(X)

5.5.27. Online ML with river#

Do you want ML models that learn on-the-fly from massive datasets?

Try river.

river is a library for online machine learning.

You can continuously update your model with streaming data without using the full dataset for training again.

It provides online implementations for many algorithms like KNN, Tree-based models and Recommender systems.

!pip install river
from river import compose
from river import linear_model
from river import metrics
from river import preprocessing
from river import datasets

dataset = datasets.Phishing()

model = compose.Pipeline(
    preprocessing.StandardScaler(),
    linear_model.LogisticRegression()
)

metric = metrics.Accuracy()

for x, y in dataset:
    y_pred = model.predict_one(x)     
    metric.update(y, y_pred)           
    model.learn_one(x, y)              

5.5.28. SOTA Computer Vision Models with timm#

Do you want to use SOTA computer vision models?

Try timm.

timm (PyTorch Image Models) is a library which contains multiple computer vision models, layers, optimizers, etc.

It provides models like Vision Transformer, MobileNet, Swin Transformer, ConvNeXt, DenseNet, and more.

You just have to define the name of the model and if you want to have the pretrained weights of it.

!pip install timm
import torch
import timm

print(timm.list_models())

model = timm.create_model('densenet121', pretrained=True)
output = model(torch.randn(2, 3, 224, 224))

5.5.29. Generate Guaranteed Prediction Intervals and Sets with MAPIE#

For quantifying uncertainties of your models, use MAPIE.

MAPIE (Model Agnostic Prediction Interval Estimator) takes your sklearn-/tensorflow-/pytorch-compatible model and generate prediction intervals or sets with guaranteed coverage.

!pip install mapie
from mapie.regression import MapieRegressor
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples=500, n_features=1, noise=20, random_state=59)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
regressor = LinearRegression()

mapie_regressor = MapieRegressor(regressor)
mapie_regressor.fit(X_train, y_train)

alpha = [0.05, 0.20]
y_pred, y_pis = mapie_regressor.predict(X_test, alpha=alpha)

5.5.30. Extra Components For scikit-learn with scikit-lego#

scikit-learn is one of the most popular ML libraries.

While itโ€™s easy to write custom components, it would be nice to have all of them in a single place.

scikit-lego is such a library which contains many custom components like:

  • DebugPipeline, which adds debug information to pipelines

  • ImbalancedLinearRegression to punish over-/underestimation of a model

  • add_lags to add lag values to a DataFrame

  • ZeroInflatedRegressor which predicts zero or applies a regression based on a classifier

and many more!

!pip install scikit-lego
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklego.preprocessing import RandomAdder
from sklego.mixture import GMMClassifier

...

pipeline = Pipeline([
    ("scale", StandardScaler()),
    ("random_noise", RandomAdder()),
    ("model", GMMClassifier())
])

...

5.5.31. Quantize your Models with torchao#

Quantizing your Deep Learning models was never easier.

With torchao, you can quantize and sparsify your models with 1 line of code.

If you are unsure which method to use, you can even use the autoquant method to quantize your layers automatically.

!pip install torchao
import torchao

model = torchao.autoquant(torch.compile(model, mode='max-autotune'))