Model Training

5.6. Model Training#

5.6.1. Compute Class Weights#

To handle class imbalance in Machine Learning, there are several methods.

One of them is adjusting the class weights.

By giving higher weights to the minority class and lower weights to the majority class, we can regularize the loss function.

Misclassifying the minority class will result in a higher loss due to the higher weight.

To incorporate class weights in Tensorflow, use scikit-learn’s compute_class_weight function

import numpy as np
import tensorflow as tf
from sklearn.utils import compute_class_weight

X, y = ...

# will return an array with weights for each class, e.g. [0.6, 0.6, 1.]
class_weights = compute_class_weight(
  class_weight="balanced",
  classes=np.unique(y),
  y=y
)

# to get a dictionary with {<class>:<weight>}
class_weights = dict(enumerate(class_weights))

model = tf.keras.Sequential(...)
model.compile(...)

# using class_weights in the .fit() method
model.fit(X, y, class_weight=class_weights, ...)

5.6.2. Reset TensorFlow/Keras Global State#

In Tensorflow/Keras, when you create multiple models in a loop, you will need tf.keras.backend.clear_session().

Keras manages a global state, which includes configurations and the current values (weights and biases) of the models.

So when you create a model in a loop, the global state gets bigger and bigger with every created model. To clear the state, 𝐝𝐞𝐥 𝐦𝐨𝐝𝐞𝐥 will not work because it will only delete the Python variable.

So tf.keras.backend.clear_session() is a better option. It will reset the state of a model and helps avoid clutter from old models.

See the first example below. Each iteration of this loop will increase the size of the global state and of your memory.

In the second example, the memory consumption stays constant by clearing the state with every iteration.

import tensorflow as tf

def create_model():
  model = tf.keras.Sequential(...)
  return model

# without clearing session
for _ in range(20):
  model = create_model()
  
# with clearing session
for _ in range(20):
  tf.keras.backend.clear_session()
  model = create_model

5.6.3. Find dirty labels with `cleanlab`#

Do you want to identify noisy labels in your dataset?

Try cleanlab for Python.

cleanlab is a data-centric AI package to automatically detect noisy labels and address dataset issues to fix them via confident learning algorithms.

It works with nearly every model possible:

XGBoost
scikit-learn models
Tensorflow
PyTorch
HuggingFace
etc.

!pip install cleanlab

import cleanlab
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

clf = RandomForestClassifier(n_estimators=100)

cl = cleanlab.classification.CleanLearning(clf)

label_issues = cl.find_label_issues(X, y)

print(label_issues.query('is_label_issue == True'))

5.6.4. Evaluate your Classifier with sklearn’s `classification_report`#

Would you like to evaluate your Machine Learning model quickly?

Try classification_report from scikit-learn

With classification_report, you can quickly assess the performance of your model.

It summarizes Precision, Recall, F1-Score, and Support for each class.

# make a small script where sklearns classification_report is used
from sklearn.metrics import classification_report

y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]

target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))

"""
             precision    recall  f1-score   support

     class 0       0.50      1.00      0.67         1
     class 1       0.00      0.00      0.00         1
     class 2       1.00      0.67      0.80         3

    accuracy                           0.60         5
   macro avg       0.50      0.56      0.49         5
weighted avg       0.70      0.60      0.61         5
"""

5.6.5. Obtain Reproducible Optimizations Results in Optuna#

Optuna is a powerful hyperparameter optimization framework that supports many machine learning frameworks, including TensorFlow, PyTorch, and XGBoost.

But you need to be careful with reproducible results for hyperparameter tuning.tuple

To achieve reproducible results, you need to set the seed for your Sampler.

Below you can see how it is done for TPESampler.

import optuna
from optuna.samplers import TPESampler

def objective(trial):
    ...
    
sampler = TPESampler(seed=42)
study = optuna.create_study(sampler=sampler)
study.optimize(objective, n_trials=100)

5.6.6. Find bad labels with `doubtlab`#

Do you want to find bad labels in your data?

Try doubtlab for Python.

With doubtlab, you can define reasons to doubt your labels and take a closer look.

Reasons to doubt your labels can be for example:

𝐏𝐫𝐨𝐛𝐚𝐑𝐞𝐚𝐬𝐨𝐧: When the confidence values are low for any label
𝐖𝐫𝐨𝐧𝐠𝐏𝐫𝐞𝐝𝐢𝐜𝐭𝐢𝐨𝐧𝐑𝐞𝐚𝐬𝐨𝐧: When a model cannot predict the listed label
𝐃𝐢𝐬𝐚𝐠𝐫𝐞𝐞𝐑𝐞𝐚𝐬𝐨𝐧: When two models disagree on a prediction.
𝐑𝐞𝐥𝐚𝐭𝐢𝐯𝐞𝐃𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐜𝐞𝐑𝐞𝐚𝐬𝐨𝐧: When the relative difference between label and prediction is too high

So, identify your noisy labels and fix them.

!pip install doubtlab

from doubtlab.ensemble import DoubtEnsemble
from doubtlab.reason import ProbaReason, WrongPredictionReason
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

X, y = load_iris(return_X_y=True)
model = LogisticRegression()
model.fit(X, y)

# Define reasons to check
reasons = {
    'proba': ProbaReason(model=model),
    'wrong_pred': WrongPredictionReason(model=model),
}

# Pass reasons to DoubtLab instance
doubt = DoubtEnsemble(**reasons)

# Returns DataFrame with reasoning
predicates = doubt.get_predicates(X, y)

5.6.7. Get notified when your model is finished with training#

Never stare at your screen, waiting for your model to finish training.

Try knockknock for Python.

knockknock is a library that notifies you when your training is finished.

You only need to add a decorator.

Currently, you can get a notification through 12 different channels like:

Email
Slack
Telegram
Discord
MS Teams

Use it for your future model training and don’t stick to your screen.

!pip install knockknock

from knockknock import email_sender

@email_sender(recipient_emails=["coolmail@python.com", "2coolmail@python.com"], sender_email="anothercoolmail@python.com")
def train_model(model, X, y):
    model.fit(X, y)

5.6.8. Get Model Summary in PyTorch with `torchinfo`#

Do you want a Model summary in PyTorch?

Like in Keras with model.summary()?

Use torchinfo.

With torchinfo, you can get a model summary as you know it from Keras.

Just add one line of code.

!pip install torchinfo

import torch
from torchinfo import summary

class MyModel(torch.nn.Module)
  ...
  
model = MyModel()

BATCH_SIZE = 16
summary(model, input_size=(BATCH_SIZE, 1, 28, 28))

'''
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
Net                                      [16, 10]                  --
├─Sequential: 1-1                        [16, 4, 7, 7]             --
│    └─Conv2d: 2-1                       [16, 4, 28, 28]           40
│    └─BatchNorm2d: 2-2                  [16, 4, 28, 28]           8
│    └─ReLU: 2-3                         [16, 4, 28, 28]           --
│    └─MaxPool2d: 2-4                    [16, 4, 14, 14]           --
│    └─Conv2d: 2-5                       [16, 4, 14, 14]           148
│    └─BatchNorm2d: 2-6                  [16, 4, 14, 14]           8
│    └─ReLU: 2-7                         [16, 4, 14, 14]           --
│    └─MaxPool2d: 2-8                    [16, 4, 7, 7]             --
├─Sequential: 1-2                        [16, 10]                  --
│    └─Linear: 2-9                       [16, 10]                  1,970
==========================================================================================
Total params: 2,174
Trainable params: 2,174
Non-trainable params: 0
Total mult-adds (M): 1.00
==========================================================================================
Input size (MB): 0.05
Forward/backward pass size (MB): 1.00
Params size (MB): 0.01
Estimated Total Size (MB): 1.06
==========================================================================================
'''

5.6.9. Boost scikit-learns performance with Intel Extension#

Scikit-learn is one of the most popular ML packages for Python.

But, to be honest, their algorithms are not the fastest ones.

With Intel’s Extension for scikit-learn, scikit-learn-intelex. you can speed up training time for some favourite algorithms like:

Support Vector Classifier/Regressor
Random Forest Classifier/Regressor
LASSO
DBSCAN

Just add two lines of code.

!pip install scikit-learn-intelex

from sklearnex import patch_sklearn
patch_sklearn()

from sklearn.svm import SVR
from sklearn.datasets import make_regression

X, y = make_regression(
n_samples=100000, 
n_features=10, 
noise=0.5)

svr = SVR()

svr.fit(X, y)

5.6.10. Incorportate Domain Knowledge into XGBoost with Feature Interaction Constraints#

Want to incorporate your domain knowledge into 𝐗𝐆𝐁𝐨𝐨𝐬𝐭?

Try using 𝐅𝐞𝐚𝐭𝐮𝐫𝐞 𝐈𝐧𝐭𝐞𝐫𝐚𝐜𝐭𝐢𝐨𝐧 𝐂𝐨𝐧𝐬𝐭𝐫𝐚𝐢𝐧𝐭𝐬.

Feature Interaction Constraints allow you to control which features are allowed to interact with each other and which are not while building the trees.

For example, the constraint [0, 1] means that Feature_0 and Feature_1 are allowed to interact with each other but with no other variable. Similarly, [3, 5, 9] means that Feature_3, Feature_5, and Feature_9 are allowed to interact with each other but with no other variable.

With this in mind, you can define feature interaction constraints:

Based on domain knowledge, when you know that some features interactions will lead to better results
Based on regulatory constraints in your industry/company where some features can not interact with each other.

import xgboost as xgb

X, y = ...

dmatrix = xgb.DMatrix(X, label=y)

params = {
    "objective": "reg:squarederror",
    "eval_metric": "rmse",
    "interaction_constraints": [[0,2 ], [1, 3, 4]]
}

model_with_constraints = xgb.train(params, dmatrix)

5.6.11. Powerful AutoML with `FLAML`#

Do you always hear about AutoML?

And want to try it out?

Use FLAML for Python.

FLAML (Fast and Lightweight AutoML) is an AutoML package developed by Microsoft.

It can do Model Selection, Hyperparameter tuning, and Feature Engineering automatically.

Thus, it removes the pain of choosing the best model and parameters so that you can focus more on your data.

Per default, its estimator list contains only tree-based models like XGBoost, CatBoost, and LightGBM. But you can also add custom models.

A powerful library!

!pip install flaml

from flaml import AutoML
automl = AutoML()
automl.fit(X_train, y_train, task="classification")

5.6.12. Aspect-based Seniment Analysis with `PyABSA`#

Traditional sentiment analysis focuses on determining the overall sentiment of a piece of text.

For example, the sentence :

“The food was bad and the staff was rude”

would output only a negative sentiment.

But, what if I want to extract, which aspects have a negative or positive sentiment?

That’s the responsibility of aspect-based sentiment analysis.

It aims to identify and extract the sentiment expressed towards specific aspects of a text.

For the sentence:

”The battery life is excellent but the camera quality is bad.”

a model’s output would be:

Battery life: positive
Camera quality: negative

With aspect-based sentiment analysis, you can understand the opinions and feelings expressed about specific aspects.

To do that in Python, use the package PyABSA.

It contains pre-trained models with an easy-to-use API for aspect-term extraction and sentiment classification.

PyABSA can be used for a variety of applications, such as:

Customer feedback analysis
Product reviews analysis
Social media monitoring

!pip install pyabsa==1.16.27

from pyabsa import ATEPCCheckpointManager

extractor = ATEPCCheckpointManager.get_aspect_extractor(
                  checkpoint="multilingual",
                  auto_device=False
)
                                                        
example = ["Location and food were excellent but stuff was very unfriendly."]
result = extractor.extract_aspect(inference_source=example, pred_sentiment=True)

print(result)

5.6.13. Use XGBoost for Random Forests#

Are you still using Random Forests from sklearn?

XGBoost implements Random Forests too, and much faster than sklearn.

from xgboost import XGBRFRegressor

xgbrf = XGBRFRegressor(n_estimators=100)

X = np.random.rand(100000, 10)
y = np.random.rand(100000)

xgbrf.fit(X, y)

5.6.14. Identify problematic images with `cleanvision`#

Your Deep Learning Model doesn’t perform?

It’s probably because of your data.

With cleanvision, you can detect issues in image data.

cleanvision is a relatively new data-centric AI package to find problems in your image dataset.

It can detect issues like:

Exact or Near Duplicates
Blurry Images
Odd Aspect Ratios
Irregularly Dark/Light images
Images lacking content

A good first step to try before applying crazy Vision Transformers.

!wget - nc 'https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip'

!unzip -q image_files.zip

!pip install cleanvision

from cleanvision.imagelab import Imagelab

# Path to your dataset, you can specify your own dataset path
dataset_path = "./image_files/"

# Initialize imagelab with your dataset
imagelab = Imagelab(data_path=dataset_path)

# Find issues
imagelab.find_issues()

# Get summary of issues with prevalence per issue
imagelab.issue_summary

# Visualize Top examples for blurry images
imagelab.visualize(issue_types=['blurry'])

5.6.15. Select the optimal Regularization Parameter#

How do you choose your Regularization Parameter?

Your model’s complexity decreases with a higher Regularization Parameter (Alpha).

It shouldn’t be too high or too low.

Yellowbrick’s 𝐀𝐥𝐩𝐡𝐚𝐒𝐞𝐥𝐞𝐜𝐭𝐢𝐨𝐧 can help you to find the best Alpha.

It takes your model and visualizes the Alpha/Error curve so you can see how the model’s error responds to different alpha values.

Below you can see how to do it with scikit-learn’s LassoCV.

import numpy as np
from sklearn.linear_model import LassoCV
from yellowbrick.datasets import load_concrete
from yellowbrick.regressor import AlphaSelection

X, y = load_concrete()

# Create a list of alphas to cross-validate against
alphas = np.linspace(0, 10, 30)

model = LassoCV(alphas=alphas)
visualizer = AlphaSelection(model)
visualizer.fit(X, y)
visualizer.show()

5.6.16. Decision Forests in TensorFlow#

Did you know there are Decision Forests from TensorFlow?

tensorflow_decision_forests implements decision forest models like Random Forest or GBDT for classification, regression, and ranking.

!pip install tensorflow_decision_forests

import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_decision_forests as tfdf

dataset_path = tf.keras.utils.get_file(
      "adult.csv",
      "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/"
      "main/yggdrasil_decision_forests/test_data/dataset/adult.csv")

dataset_df = pd.read_csv(dataset_path)
test_indices = np.random.rand(len(dataset_df)) < 0.30
test_ds_pd = dataset_df[test_indices]
train_ds_pd = dataset_df[~test_indices]


train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label="income")
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label="income")

model = tfdf.keras.GradientBoostedTreesModel(verbose=2)
model.fit(train_ds)

print(model.summary())

5.6.17. AutoML with `AutoGluon`#

Do you always hear about AutoML?

And want to try it out?

Use AutoGluon for Python.

AutoGluon is a Python package from AWS.

It lets you perform AutoML on:

Tabular Data (Classification, Regression)
Time Series Data
Multimodal Data (Images + Text + Tabular)

Thus, it removes the pain of choosing the best model and best parameter.

AutoGluon also offers utilities for EDA, like:

Detecting Covariate Shift
Target Variable Analysis
Feature Interaction Charts

See below for a quickstart for tabular data.

!pip install autogluon

from autogluon.tabular import TabularDataset, TabularPredictor

train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')

predictor = TabularPredictor(label='class').fit(train_data, time_limit=240)
predictor.leaderboard(test_data)

5.6.18. Visualize Keras Models with `visualkeras`#

Do you want some cool visualization for your Deep Learning Models?

Try visualkeras.

visualkeras visualizes your Keras models (as an alternative to model.summary())

!pip install visualkeras

import visualkeras

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 3)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))

visualkeras.layered_view(model, legend=True, to_file='output.png').show()

5.6.19. Perform Multilabel Stratified KFold with `iterative-stratification`#

When doing Cross-Validation for classification,

StratifiedKFold from scikit-learn is a common choice.

Stratification aims to guarantee that every fold represents all strata of the data.

But, scikit-learn doesn’t support stratifying multilabel data.

For this use case, try the iterative-stratification package.

It offers implementations for stratyfing multilabel data in different ways.

See below how we can use MultilabelStratifiedKFold.

!pip install iterative-stratification

from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
import numpy as np

X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])

mskf = MultilabelStratifiedKFold(n_splits=2, shuffle=True, random_state=0)

for train_index, test_index in mskf.split(X, y):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

5.6.20. Interpret your Model with `Shapash`#

Nobody cares about your SOTA ML Model if nobody can’t understand the predictions.

Therefore, interpretability of ML Models is a crucial point in industry cases.

To overcome this hurdle, use Shapash for Python.

Shapash offers several types of interpretability methods to understand your model’s predictions like:

Feature Importance
Feature Contribution
LIME
SHAP

It comes also with an intuitive GUI to interact with.

Check it out! Link is in the comments section.

!pip install shapash

import pandas as pd
from category_encoders import OrdinalEncoder
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor

from shapash.data.data_loader import data_loading
house_df, house_dict = data_loading('house_prices')

y_df=house_df['SalePrice'].to_frame()
X_df=house_df[house_df.columns.difference(['SalePrice'])]

from category_encoders import OrdinalEncoder

categorical_features = [col for col in X_df.columns if X_df[col].dtype == 'object']

encoder = OrdinalEncoder(
    cols=categorical_features,
    handle_unknown='ignore',
    return_df=True).fit(X_df)

X_df=encoder.transform(X_df)

Xtrain, Xtest, ytrain, ytest = train_test_split(X_df, y_df, train_size=0.75, random_state=1)

regressor = LGBMRegressor(n_estimators=100).fit(Xtrain,ytrain)

from shapash import SmartExplainer

xpl = SmartExplainer(
    model=regressor,
    preprocessing=encoder,  
    features_dict=house_dict  
)

xpl.compile(x=Xtest,
            y_target=ytest 
           )

app = xpl.run_app(title_story='House Prices', port=8020)

5.6.21. Validate Your Model and Data with `Deepchecks`#

Validating your Model and Data is crucial in ML.

Not testing them will cause huge problems in production.

To change that, use deepchecks.

deepchecks is an open-source solution which offers a suite for detailed validation methods.

It will calculate and visualize a bunch of things like:

Train/Test Performance
Predictive Power Score
Feature Drift
Label Drift
Weak Segments for your model

A powerful tool to consider for testing your models and datasets.

!pip install deepchecks

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

from deepchecks.tabular.datasets.classification import iris
from deepchecks.tabular import Dataset
from deepchecks.tabular.suites import full_suite

# Load Data
iris_df = iris.load_data(data_format='Dataframe', as_train_test=False)
label_col = 'target'
df_train, df_test = train_test_split(iris_df, stratify=iris_df[label_col], random_state=0)

# Train Model
rf_clf = RandomForestClassifier(random_state=0)
rf_clf.fit(df_train.drop(label_col, axis=1), df_train[label_col])


ds_train = Dataset(df_train, label=label_col, cat_features=[])
ds_test =  Dataset(df_test,  label=label_col, cat_features=[])

suite = full_suite()

suite.run(train_dataset=ds_train, test_dataset=ds_test, model=rf_clf)

5.6.22. Visualize high-performance Features with `Optuna`#

Optuna released a new feature for detecting high-performing parameters.

Its plot_rank() function visualizes different parameters, with individual points representing individual trials.

Since the plot is interactive, you can also hover over it and dive deeper into analysing your hyperparameter optimization.

from sklearn.ensemble import RandomForestClassifier

import optuna


def objective(trial):
    clf = RandomForestClassifier(
        n_estimators=50,
        criterion="gini",
        max_depth=trial.suggest_int('Mdpth', 2, 32, log=True),
        min_samples_split=trial.suggest_int('mspl', 2, 32, log=True),
        min_samples_leaf=trial.suggest_int('mlfs', 1, 32, log=True),
        min_weight_fraction_leaf=trial.suggest_float('mwfr', 0.0, 0.5),
        max_features=trial.suggest_int("Mfts", 1, 15),
        max_leaf_nodes=trial.suggest_int('Mnods', 4, 100, log=True),
        min_impurity_decrease=trial.suggest_float('mid', 0.0, 0.5),
    )
    clf.fit(X_train, y_train)
    return clf.score(X_test, y_test)

# Optimize
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=30)

# Get parameters sorted by the importance values
importances = optuna.importance.get_param_importances(study)
params_sorted = list(importances.keys())

# Plot
fig = optuna.visualization.plot_rank(study, params=params_sorted[:4])
fig.show()

5.6.23. Model Ensembling with `combo`#

Looking at the top solutions on Kaggle you will notice one thing:

There is usually some sort of combination of various ML models involved.

With combo for Python, you can combine

Multiple Classifiers
Multiple Anomaly Detection Models
Multiple Clustering Models

combo also offers multiple combination methods for every category.

!pip install combo

from combo.models.cluster_comb import ClustererEnsemble

estimators = [KMeans(n_clusters=n_clusters),
              MiniBatchKMeans(n_clusters=n_clusters),
              AgglomerativeClustering(n_clusters=n_clusters)]

clf = ClustererEnsemble(estimators, n_clusters=n_clusters)
clf.fit(X)

aligned_labels = clf.aligned_labels_
predicted_labels = clf.labels_

5.6.24. Residual Plots with `yellowbrick`#

To analyze the variance of the error of your Regression model

Use ResidualPlot from yellowbrick.

With Residual Plots, you can see how well-fitted your model is.

If the data points exhibit a random distribution along the horizontal axis, a linear regression model is typically suitable, whereas in cases of non-random dispersion, a non-linear model is a better choice.

See below how you can easily implement that with yellowbrick.

!pip install yellowbrick

from yellowbrick.datasets import load_concrete
from yellowbrick.regressor import ResidualsPlot

model = Lasso()
visualizer = ResidualsPlot(model)

visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.show() 

5.6.25. Powerful and Distributed Hyperparameter Optimization with `ray.tune`#

Do you need hyperparameter tuning on steroids?

Try tune from ray.

tune performs distributed hyperparameter tuning with multi-GPU and multi-node support, utilizing all the hardware you have.

It supports the most popular ML libraries and integrates many other common hyperparameter optimization tools like Optuna or Hyperopt.

!pip install "ray[tune]"

# !pip install "ray[tune]"
import sklearn.datasets
import sklearn.metrics
import sklearn.datasets
import sklearn.metrics
import xgboost as xgb
from ray import train, tune
from sklearn.model_selection import train_test_split


def train_breast_cancer(config):
    data, labels = sklearn.datasets.load_breast_cancer(return_X_y=True)
    train_x, test_x, train_y, test_y = train_test_split(data, labels, test_size=0.25)
    train_set = xgb.DMatrix(train_x, label=train_y)
    test_set = xgb.DMatrix(test_x, label=test_y)
    results = {}
    xgb.train(
        config,
        train_set,
        evals=[(test_set, "eval")],
        evals_result=results,
        verbose_eval=False,
    )
    accuracy = 1.0 - results["eval"]["error"][-1]
    train.report({"mean_accuracy": accuracy, "done": True})


config = {
    "objective": "binary:logistic",
    "eval_metric": ["logloss", "error"],
    "min_child_weight": tune.choice([1, 2, 3]),
    "subsample": tune.uniform(0.5, 1.0),
}

tuner = tune.Tuner(
    train_breast_cancer,
    tune_config=tune.TuneConfig(
        num_samples=10,
    ),
    param_space=config,
)
results = tuner.fit()
print(results.get_best_result(metric="mean_accuracy", mode="max").config)

5.6.26. Use PyTorch with scikit-learn API with `skorch`#

PyTorch and scikit-learn are one of the most popular libraries for ML/DL.

So, why not combine PyTorch with scikit-learn?

Try skorch!

skorch is a high-level library for PyTorch that provides a scikit-learn-compatible neural network module.

It allows you to use the simple scikit-learn interface for PyTorch.

Therefore you can integrate PyTorch models into scikit-learn workflows.

See below for an example.

!pip install skorch

from torch import nn
from skorch import NeuralNetClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

class MyModule(nn.Module):
    def __init__(self, num_units=10, nonlin=nn.ReLU()):
        super().__init__()

        self.dense = nn.Linear(20, num_units)
        self.nonlin = nonlin
        self.output = nn.Linear(num_units, 2)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, X, **kwargs):
        X = self.nonlin(self.dense(X))
        X = self.dropout(X)
        X = self.softmax(self.output(X))
        return X

net = NeuralNetClassifier(
    MyModule,
    max_epochs=10,
    lr=0.1,
    iterator_train__shuffle=True,
)

pipe = Pipeline([
    ('scale', StandardScaler()),
    ('net', net),
])

pipe.fit(X, y)
y_proba = pipe.predict_proba(X)

5.6.27. Online ML with `river`#

Do you want ML models that learn on-the-fly from massive datasets?

Try river.

river is a library for online machine learning.

You can continuously update your model with streaming data without using the full dataset for training again.

It provides online implementations for many algorithms like KNN, Tree-based models and Recommender systems.

!pip install river

from river import compose
from river import linear_model
from river import metrics
from river import preprocessing
from river import datasets

dataset = datasets.Phishing()

model = compose.Pipeline(
    preprocessing.StandardScaler(),
    linear_model.LogisticRegression()
)

metric = metrics.Accuracy()

for x, y in dataset:
    y_pred = model.predict_one(x)     
    metric.update(y, y_pred)           
    model.learn_one(x, y)              

5.6.28. SOTA Computer Vision Models with `timm`#

Do you want to use SOTA computer vision models?

Try timm.

timm (PyTorch Image Models) is a library which contains multiple computer vision models, layers, optimizers, etc.

It provides models like Vision Transformer, MobileNet, Swin Transformer, ConvNeXt, DenseNet, and more.

You just have to define the name of the model and if you want to have the pretrained weights of it.

!pip install timm

import torch
import timm

print(timm.list_models())

model = timm.create_model('densenet121', pretrained=True)
output = model(torch.randn(2, 3, 224, 224))

5.6.29. Generate Guaranteed Prediction Intervals and Sets with `MAPIE`#

For quantifying uncertainties of your models, use MAPIE.

MAPIE (Model Agnostic Prediction Interval Estimator) takes your sklearn-/tensorflow-/pytorch-compatible model and generate prediction intervals or sets with guaranteed coverage.

!pip install mapie

from mapie.regression import MapieRegressor
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples=500, n_features=1, noise=20, random_state=59)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
regressor = LinearRegression()

mapie_regressor = MapieRegressor(regressor)
mapie_regressor.fit(X_train, y_train)

alpha = [0.05, 0.20]
y_pred, y_pis = mapie_regressor.predict(X_test, alpha=alpha)

5.6.30. Extra Components For scikit-learn with `scikit-lego`#

scikit-learn is one of the most popular ML libraries.

While it’s easy to write custom components, it would be nice to have all of them in a single place.

scikit-lego is such a library which contains many custom components like:

DebugPipeline, which adds debug information to pipelines
ImbalancedLinearRegression to punish over-/underestimation of a model
add_lags to add lag values to a DataFrame
ZeroInflatedRegressor which predicts zero or applies a regression based on a classifier

and many more!

!pip install scikit-lego

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklego.preprocessing import RandomAdder
from sklego.mixture import GMMClassifier

...

pipeline = Pipeline([
    ("scale", StandardScaler()),
    ("random_noise", RandomAdder()),
    ("model", GMMClassifier())
])

...

5.6.31. Quantize your Models with `torchao`#

Quantizing your Deep Learning models was never easier.

With torchao, you can quantize and sparsify your models with 1 line of code.

If you are unsure which method to use, you can even use the autoquant method to quantize your layers automatically.

!pip install torchao

import torchao

model = torchao.autoquant(torch.compile(model, mode='max-autotune'))

Model Training

Contents

5.6. Model Training#

5.6.1. Compute Class Weights#

5.6.2. Reset TensorFlow/Keras Global State#

5.6.3. Find dirty labels with cleanlab#

5.6.4. Evaluate your Classifier with sklearn’s classification_report#

5.6.5. Obtain Reproducible Optimizations Results in Optuna#

5.6.6. Find bad labels with doubtlab#

5.6.7. Get notified when your model is finished with training#

5.6.8. Get Model Summary in PyTorch with torchinfo#

5.6.9. Boost scikit-learns performance with Intel Extension#

5.6.10. Incorportate Domain Knowledge into XGBoost with Feature Interaction Constraints#

5.6.11. Powerful AutoML with FLAML#

5.6.12. Aspect-based Seniment Analysis with PyABSA#

5.6.13. Use XGBoost for Random Forests#

5.6.14. Identify problematic images with cleanvision#

5.6.15. Select the optimal Regularization Parameter#

5.6.16. Decision Forests in TensorFlow#

5.6.17. AutoML with AutoGluon#

5.6.18. Visualize Keras Models with visualkeras#

5.6.19. Perform Multilabel Stratified KFold with iterative-stratification#

5.6.20. Interpret your Model with Shapash#

5.6.21. Validate Your Model and Data with Deepchecks#

5.6.22. Visualize high-performance Features with Optuna#

5.6.23. Model Ensembling with combo#

5.6.24. Residual Plots with yellowbrick#

5.6.25. Powerful and Distributed Hyperparameter Optimization with ray.tune#

5.6.26. Use PyTorch with scikit-learn API with skorch#

5.6.27. Online ML with river#

5.6.28. SOTA Computer Vision Models with timm#

5.6.29. Generate Guaranteed Prediction Intervals and Sets with MAPIE#

5.6.30. Extra Components For scikit-learn with scikit-lego#

5.6.31. Quantize your Models with torchao#