5.6. Model Training#
5.6.1. Compute Class Weights#
To handle class imbalance in Machine Learning, there are several methods.
One of them is adjusting the class weights.
By giving higher weights to the minority class and lower weights to the majority class, we can regularize the loss function.
Misclassifying the minority class will result in a higher loss due to the higher weight.
To incorporate class weights in Tensorflow, use scikit-learn
โs compute_class_weight
function
import numpy as np
import tensorflow as tf
from sklearn.utils import compute_class_weight
X, y = ...
# will return an array with weights for each class, e.g. [0.6, 0.6, 1.]
class_weights = compute_class_weight(
class_weight="balanced",
classes=np.unique(y),
y=y
)
# to get a dictionary with {<class>:<weight>}
class_weights = dict(enumerate(class_weights))
model = tf.keras.Sequential(...)
model.compile(...)
# using class_weights in the .fit() method
model.fit(X, y, class_weight=class_weights, ...)
5.6.2. Reset TensorFlow/Keras Global State#
In Tensorflow/Keras, when you create multiple models in a loop, you will need tf.keras.backend.clear_session()
.
Keras manages a global state, which includes configurations and the current values (weights and biases) of the models.
So when you create a model in a loop, the global state gets bigger and bigger with every created model. To clear the state, ๐๐๐ฅ ๐ฆ๐จ๐๐๐ฅ will not work because it will only delete the Python variable.
So tf.keras.backend.clear_session()
is a better option. It will reset the state of a model and helps avoid clutter from old models.
See the first example below. Each iteration of this loop will increase the size of the global state and of your memory.
In the second example, the memory consumption stays constant by clearing the state with every iteration.
import tensorflow as tf
def create_model():
model = tf.keras.Sequential(...)
return model
# without clearing session
for _ in range(20):
model = create_model()
# with clearing session
for _ in range(20):
tf.keras.backend.clear_session()
model = create_model
5.6.3. Find dirty labels with cleanlab
#
Do you want to identify noisy labels in your dataset?
Try cleanlab
for Python.
cleanlab
is a data-centric AI package to automatically detect noisy labels and address dataset issues to fix them via confident learning algorithms.
It works with nearly every model possible:
XGBoost
scikit-learn models
Tensorflow
PyTorch
HuggingFace
etc.
!pip install cleanlab
import cleanlab
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
clf = RandomForestClassifier(n_estimators=100)
cl = cleanlab.classification.CleanLearning(clf)
label_issues = cl.find_label_issues(X, y)
print(label_issues.query('is_label_issue == True'))
5.6.4. Evaluate your Classifier with sklearnโs classification_report
#
Would you like to evaluate your Machine Learning model quickly?
Try classification_report
from scikit-learn
With classification_report
, you can quickly assess the performance of your model.
It summarizes Precision, Recall, F1-Score, and Support for each class.
# make a small script where sklearns classification_report is used
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))
"""
precision recall f1-score support
class 0 0.50 1.00 0.67 1
class 1 0.00 0.00 0.00 1
class 2 1.00 0.67 0.80 3
accuracy 0.60 5
macro avg 0.50 0.56 0.49 5
weighted avg 0.70 0.60 0.61 5
"""
5.6.5. Obtain Reproducible Optimizations Results in Optuna#
Optuna is a powerful hyperparameter optimization framework that supports many machine learning frameworks, including TensorFlow, PyTorch, and XGBoost.
But you need to be careful with reproducible results for hyperparameter tuning.tuple
To achieve reproducible results, you need to set the seed for your Sampler.
Below you can see how it is done for TPESampler
.
import optuna
from optuna.samplers import TPESampler
def objective(trial):
...
sampler = TPESampler(seed=42)
study = optuna.create_study(sampler=sampler)
study.optimize(objective, n_trials=100)
5.6.6. Find bad labels with doubtlab
#
Do you want to find bad labels in your data?
Try doubtlab
for Python.
With doubtlab
, you can define reasons to doubt your labels and take a closer look.
Reasons to doubt your labels can be for example:
๐๐ซ๐จ๐๐๐๐๐๐ฌ๐จ๐ง: When the confidence values are low for any label
๐๐ซ๐จ๐ง๐ ๐๐ซ๐๐๐ข๐๐ญ๐ข๐จ๐ง๐๐๐๐ฌ๐จ๐ง: When a model cannot predict the listed label
๐๐ข๐ฌ๐๐ ๐ซ๐๐๐๐๐๐ฌ๐จ๐ง: When two models disagree on a prediction.
๐๐๐ฅ๐๐ญ๐ข๐ฏ๐๐๐ข๐๐๐๐ซ๐๐ง๐๐๐๐๐๐ฌ๐จ๐ง: When the relative difference between label and prediction is too high
So, identify your noisy labels and fix them.
!pip install doubtlab
from doubtlab.ensemble import DoubtEnsemble
from doubtlab.reason import ProbaReason, WrongPredictionReason
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
X, y = load_iris(return_X_y=True)
model = LogisticRegression()
model.fit(X, y)
# Define reasons to check
reasons = {
'proba': ProbaReason(model=model),
'wrong_pred': WrongPredictionReason(model=model),
}
# Pass reasons to DoubtLab instance
doubt = DoubtEnsemble(**reasons)
# Returns DataFrame with reasoning
predicates = doubt.get_predicates(X, y)
5.6.7. Get notified when your model is finished with training#
Never stare at your screen, waiting for your model to finish training.
Try knockknock
for Python.
knockknock
is a library that notifies you when your training is finished.
You only need to add a decorator.
Currently, you can get a notification through 12 different channels like:
Email
Slack
Telegram
Discord
MS Teams
Use it for your future model training and donโt stick to your screen.
!pip install knockknock
from knockknock import email_sender
@email_sender(recipient_emails=["coolmail@python.com", "2coolmail@python.com"], sender_email="anothercoolmail@python.com")
def train_model(model, X, y):
model.fit(X, y)
5.6.8. Get Model Summary in PyTorch with torchinfo
#
Do you want a Model summary in PyTorch?
Like in Keras with model.summary()
?
Use torchinfo
.
With torchinfo
, you can get a model summary as you know it from
Keras.
Just add one line of code.
!pip install torchinfo
import torch
from torchinfo import summary
class MyModel(torch.nn.Module)
...
model = MyModel()
BATCH_SIZE = 16
summary(model, input_size=(BATCH_SIZE, 1, 28, 28))
'''
==========================================================================================
Layer (type:depth-idx) Output Shape Param #
==========================================================================================
Net [16, 10] --
โโSequential: 1-1 [16, 4, 7, 7] --
โ โโConv2d: 2-1 [16, 4, 28, 28] 40
โ โโBatchNorm2d: 2-2 [16, 4, 28, 28] 8
โ โโReLU: 2-3 [16, 4, 28, 28] --
โ โโMaxPool2d: 2-4 [16, 4, 14, 14] --
โ โโConv2d: 2-5 [16, 4, 14, 14] 148
โ โโBatchNorm2d: 2-6 [16, 4, 14, 14] 8
โ โโReLU: 2-7 [16, 4, 14, 14] --
โ โโMaxPool2d: 2-8 [16, 4, 7, 7] --
โโSequential: 1-2 [16, 10] --
โ โโLinear: 2-9 [16, 10] 1,970
==========================================================================================
Total params: 2,174
Trainable params: 2,174
Non-trainable params: 0
Total mult-adds (M): 1.00
==========================================================================================
Input size (MB): 0.05
Forward/backward pass size (MB): 1.00
Params size (MB): 0.01
Estimated Total Size (MB): 1.06
==========================================================================================
'''
5.6.9. Boost scikit-learns performance with Intel Extension#
Scikit-learn is one of the most popular ML packages for Python.
But, to be honest, their algorithms are not the fastest ones.
With Intelโs Extension for scikit-learn, scikit-learn-intelex
. you can speed up training time for some favourite algorithms like:
Support Vector Classifier/Regressor
Random Forest Classifier/Regressor
LASSO
DBSCAN
Just add two lines of code.
!pip install scikit-learn-intelex
from sklearnex import patch_sklearn
patch_sklearn()
from sklearn.svm import SVR
from sklearn.datasets import make_regression
X, y = make_regression(
n_samples=100000,
n_features=10,
noise=0.5)
svr = SVR()
svr.fit(X, y)
5.6.10. Incorportate Domain Knowledge into XGBoost with Feature Interaction Constraints#
Want to incorporate your domain knowledge into ๐๐๐๐จ๐จ๐ฌ๐ญ
?
Try using ๐ ๐๐๐ญ๐ฎ๐ซ๐ ๐๐ง๐ญ๐๐ซ๐๐๐ญ๐ข๐จ๐ง ๐๐จ๐ง๐ฌ๐ญ๐ซ๐๐ข๐ง๐ญ๐ฌ.
Feature Interaction Constraints allow you to control which features are allowed to interact with each other and which are not while building the trees.
For example, the constraint [0, 1] means that Feature_0 and Feature_1 are allowed to interact with each other but with no other variable. Similarly, [3, 5, 9] means that Feature_3, Feature_5, and Feature_9 are allowed to interact with each other but with no other variable.
With this in mind, you can define feature interaction constraints:
Based on domain knowledge, when you know that some features interactions will lead to better results
Based on regulatory constraints in your industry/company where some features can not interact with each other.
import xgboost as xgb
X, y = ...
dmatrix = xgb.DMatrix(X, label=y)
params = {
"objective": "reg:squarederror",
"eval_metric": "rmse",
"interaction_constraints": [[0,2 ], [1, 3, 4]]
}
model_with_constraints = xgb.train(params, dmatrix)
5.6.11. Powerful AutoML with FLAML
#
Do you always hear about AutoML?
And want to try it out?
Use FLAML
for Python.
FLAML
(Fast and Lightweight AutoML) is an AutoML package developed by Microsoft.
It can do Model Selection, Hyperparameter tuning, and Feature Engineering automatically.
Thus, it removes the pain of choosing the best model and parameters so that you can focus more on your data.
Per default, its estimator list contains only tree-based models like XGBoost, CatBoost, and LightGBM. But you can also add custom models.
A powerful library!
!pip install flaml
from flaml import AutoML
automl = AutoML()
automl.fit(X_train, y_train, task="classification")
5.6.12. Aspect-based Seniment Analysis with PyABSA
#
Traditional sentiment analysis focuses on determining the overall sentiment of a piece of text.
For example, the sentence :
โThe food was bad and the staff was rudeโ
would output only a negative sentiment.
But, what if I want to extract, which aspects have a negative or positive sentiment?
Thatโs the responsibility of aspect-based sentiment analysis.
It aims to identify and extract the sentiment expressed towards specific aspects of a text.
For the sentence:
โThe battery life is excellent but the camera quality is bad.โ
a modelโs output would be:
Battery life: positive
Camera quality: negative
With aspect-based sentiment analysis, you can understand the opinions and feelings expressed about specific aspects.
To do that in Python, use the package PyABSA
.
It contains pre-trained models with an easy-to-use API for aspect-term extraction and sentiment classification.
PyABSA
can be used for a variety of applications, such as:
Customer feedback analysis
Product reviews analysis
Social media monitoring
!pip install pyabsa==1.16.27
from pyabsa import ATEPCCheckpointManager
extractor = ATEPCCheckpointManager.get_aspect_extractor(
checkpoint="multilingual",
auto_device=False
)
example = ["Location and food were excellent but stuff was very unfriendly."]
result = extractor.extract_aspect(inference_source=example, pred_sentiment=True)
print(result)
5.6.13. Use XGBoost for Random Forests#
Are you still using Random Forests from sklearn?
XGBoost implements Random Forests too, and much faster than sklearn.
from xgboost import XGBRFRegressor
xgbrf = XGBRFRegressor(n_estimators=100)
X = np.random.rand(100000, 10)
y = np.random.rand(100000)
xgbrf.fit(X, y)
5.6.14. Identify problematic images with cleanvision
#
Your Deep Learning Model doesnโt perform?
Itโs probably because of your data.
With cleanvision
, you can detect issues in image data.
cleanvision
is a relatively new data-centric AI package to find problems in your image dataset.
It can detect issues like:
Exact or Near Duplicates
Blurry Images
Odd Aspect Ratios
Irregularly Dark/Light images
Images lacking content
A good first step to try before applying crazy Vision Transformers.
!wget - nc 'https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip'
!unzip -q image_files.zip
!pip install cleanvision
from cleanvision.imagelab import Imagelab
# Path to your dataset, you can specify your own dataset path
dataset_path = "./image_files/"
# Initialize imagelab with your dataset
imagelab = Imagelab(data_path=dataset_path)
# Find issues
imagelab.find_issues()
# Get summary of issues with prevalence per issue
imagelab.issue_summary
# Visualize Top examples for blurry images
imagelab.visualize(issue_types=['blurry'])
5.6.15. Select the optimal Regularization Parameter#
How do you choose your Regularization Parameter?
Your modelโs complexity decreases with a higher Regularization Parameter (Alpha).
It shouldnโt be too high or too low.
Yellowbrickโs ๐๐ฅ๐ฉ๐ก๐๐๐๐ฅ๐๐๐ญ๐ข๐จ๐ง
can help you to find the best Alpha.
It takes your model and visualizes the Alpha/Error curve so you can see how the modelโs error responds to different alpha values.
Below you can see how to do it with scikit-learnโs LassoCV.
import numpy as np
from sklearn.linear_model import LassoCV
from yellowbrick.datasets import load_concrete
from yellowbrick.regressor import AlphaSelection
X, y = load_concrete()
# Create a list of alphas to cross-validate against
alphas = np.linspace(0, 10, 30)
model = LassoCV(alphas=alphas)
visualizer = AlphaSelection(model)
visualizer.fit(X, y)
visualizer.show()
5.6.16. Decision Forests in TensorFlow#
Did you know there are Decision Forests from TensorFlow?
tensorflow_decision_forests
implements decision forest models like Random Forest or GBDT for classification, regression, and ranking.
!pip install tensorflow_decision_forests
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_decision_forests as tfdf
dataset_path = tf.keras.utils.get_file(
"adult.csv",
"https://raw.githubusercontent.com/google/yggdrasil-decision-forests/"
"main/yggdrasil_decision_forests/test_data/dataset/adult.csv")
dataset_df = pd.read_csv(dataset_path)
test_indices = np.random.rand(len(dataset_df)) < 0.30
test_ds_pd = dataset_df[test_indices]
train_ds_pd = dataset_df[~test_indices]
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label="income")
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label="income")
model = tfdf.keras.GradientBoostedTreesModel(verbose=2)
model.fit(train_ds)
print(model.summary())
5.6.17. AutoML with AutoGluon
#
Do you always hear about AutoML?
And want to try it out?
Use AutoGluon
for Python.
AutoGluon
is a Python package from AWS.
It lets you perform AutoML on:
Tabular Data (Classification, Regression)
Time Series Data
Multimodal Data (Images + Text + Tabular)
Thus, it removes the pain of choosing the best model and best parameter.
AutoGluon also offers utilities for EDA, like:
Detecting Covariate Shift
Target Variable Analysis
Feature Interaction Charts
See below for a quickstart for tabular data.
!pip install autogluon
from autogluon.tabular import TabularDataset, TabularPredictor
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
predictor = TabularPredictor(label='class').fit(train_data, time_limit=240)
predictor.leaderboard(test_data)
5.6.18. Visualize Keras Models with visualkeras
#
Do you want some cool visualization for your Deep Learning Models?
Try visualkeras
.
visualkeras
visualizes your Keras models (as an alternative to model.summary())
!pip install visualkeras
import visualkeras
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 3)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))
visualkeras.layered_view(model, legend=True, to_file='output.png').show()
5.6.19. Perform Multilabel Stratified KFold with iterative-stratification
#
When doing Cross-Validation for classification,
StratifiedKFold from scikit-learn is a common choice.
Stratification aims to guarantee that every fold represents all strata of the data.
But, scikit-learn doesnโt support stratifying multilabel data.
For this use case, try the iterative-stratification
package.
It offers implementations for stratyfing multilabel data in different ways.
See below how we can use MultilabelStratifiedKFold.
!pip install iterative-stratification
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
import numpy as np
X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])
mskf = MultilabelStratifiedKFold(n_splits=2, shuffle=True, random_state=0)
for train_index, test_index in mskf.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
5.6.20. Interpret your Model with Shapash
#
Nobody cares about your SOTA ML Model if nobody canโt understand the predictions.
Therefore, interpretability of ML Models is a crucial point in industry cases.
To overcome this hurdle, use Shapash
for Python.
Shapash
offers several types of interpretability methods to understand your modelโs predictions like:
Feature Importance
Feature Contribution
LIME
SHAP
It comes also with an intuitive GUI to interact with.
Check it out! Link is in the comments section.
!pip install shapash
import pandas as pd
from category_encoders import OrdinalEncoder
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from shapash.data.data_loader import data_loading
house_df, house_dict = data_loading('house_prices')
y_df=house_df['SalePrice'].to_frame()
X_df=house_df[house_df.columns.difference(['SalePrice'])]
from category_encoders import OrdinalEncoder
categorical_features = [col for col in X_df.columns if X_df[col].dtype == 'object']
encoder = OrdinalEncoder(
cols=categorical_features,
handle_unknown='ignore',
return_df=True).fit(X_df)
X_df=encoder.transform(X_df)
Xtrain, Xtest, ytrain, ytest = train_test_split(X_df, y_df, train_size=0.75, random_state=1)
regressor = LGBMRegressor(n_estimators=100).fit(Xtrain,ytrain)
from shapash import SmartExplainer
xpl = SmartExplainer(
model=regressor,
preprocessing=encoder,
features_dict=house_dict
)
xpl.compile(x=Xtest,
y_target=ytest
)
app = xpl.run_app(title_story='House Prices', port=8020)
5.6.21. Validate Your Model and Data with Deepchecks
#
Validating your Model and Data is crucial in ML.
Not testing them will cause huge problems in production.
To change that, use deepchecks
.
deepchecks
is an open-source solution which offers a suite for detailed validation methods.
It will calculate and visualize a bunch of things like:
Train/Test Performance
Predictive Power Score
Feature Drift
Label Drift
Weak Segments for your model
A powerful tool to consider for testing your models and datasets.
!pip install deepchecks
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from deepchecks.tabular.datasets.classification import iris
from deepchecks.tabular import Dataset
from deepchecks.tabular.suites import full_suite
# Load Data
iris_df = iris.load_data(data_format='Dataframe', as_train_test=False)
label_col = 'target'
df_train, df_test = train_test_split(iris_df, stratify=iris_df[label_col], random_state=0)
# Train Model
rf_clf = RandomForestClassifier(random_state=0)
rf_clf.fit(df_train.drop(label_col, axis=1), df_train[label_col])
ds_train = Dataset(df_train, label=label_col, cat_features=[])
ds_test = Dataset(df_test, label=label_col, cat_features=[])
suite = full_suite()
suite.run(train_dataset=ds_train, test_dataset=ds_test, model=rf_clf)
5.6.22. Visualize high-performance Features with Optuna
#
Optuna released a new feature for detecting high-performing parameters.
Its plot_rank()
function visualizes different parameters, with individual points representing individual trials.
Since the plot is interactive, you can also hover over it and dive deeper into analysing your hyperparameter optimization.
from sklearn.ensemble import RandomForestClassifier
import optuna
def objective(trial):
clf = RandomForestClassifier(
n_estimators=50,
criterion="gini",
max_depth=trial.suggest_int('Mdpth', 2, 32, log=True),
min_samples_split=trial.suggest_int('mspl', 2, 32, log=True),
min_samples_leaf=trial.suggest_int('mlfs', 1, 32, log=True),
min_weight_fraction_leaf=trial.suggest_float('mwfr', 0.0, 0.5),
max_features=trial.suggest_int("Mfts", 1, 15),
max_leaf_nodes=trial.suggest_int('Mnods', 4, 100, log=True),
min_impurity_decrease=trial.suggest_float('mid', 0.0, 0.5),
)
clf.fit(X_train, y_train)
return clf.score(X_test, y_test)
# Optimize
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=30)
# Get parameters sorted by the importance values
importances = optuna.importance.get_param_importances(study)
params_sorted = list(importances.keys())
# Plot
fig = optuna.visualization.plot_rank(study, params=params_sorted[:4])
fig.show()
5.6.23. Model Ensembling with combo
#
Looking at the top solutions on Kaggle you will notice one thing:
There is usually some sort of combination of various ML models involved.
With combo
for Python, you can combine
Multiple Classifiers
Multiple Anomaly Detection Models
Multiple Clustering Models
combo
also offers multiple combination methods for every category.
!pip install combo
from combo.models.cluster_comb import ClustererEnsemble
estimators = [KMeans(n_clusters=n_clusters),
MiniBatchKMeans(n_clusters=n_clusters),
AgglomerativeClustering(n_clusters=n_clusters)]
clf = ClustererEnsemble(estimators, n_clusters=n_clusters)
clf.fit(X)
aligned_labels = clf.aligned_labels_
predicted_labels = clf.labels_
5.6.24. Residual Plots with yellowbrick
#
To analyze the variance of the error of your Regression model
Use ResidualPlot
from yellowbrick
.
With Residual Plots, you can see how well-fitted your model is.
If the data points exhibit a random distribution along the horizontal axis, a linear regression model is typically suitable, whereas in cases of non-random dispersion, a non-linear model is a better choice.
See below how you can easily implement that with yellowbrick
.
!pip install yellowbrick
from yellowbrick.datasets import load_concrete
from yellowbrick.regressor import ResidualsPlot
model = Lasso()
visualizer = ResidualsPlot(model)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.show()
5.6.25. Powerful and Distributed Hyperparameter Optimization with ray.tune
#
Do you need hyperparameter tuning on steroids?
Try tune
from ray
.
tune
performs distributed hyperparameter tuning with multi-GPU and multi-node support, utilizing all the hardware you have.
It supports the most popular ML libraries and integrates many other common hyperparameter optimization tools like Optuna or Hyperopt.
!pip install "ray[tune]"
# !pip install "ray[tune]"
import sklearn.datasets
import sklearn.metrics
import sklearn.datasets
import sklearn.metrics
import xgboost as xgb
from ray import train, tune
from sklearn.model_selection import train_test_split
def train_breast_cancer(config):
data, labels = sklearn.datasets.load_breast_cancer(return_X_y=True)
train_x, test_x, train_y, test_y = train_test_split(data, labels, test_size=0.25)
train_set = xgb.DMatrix(train_x, label=train_y)
test_set = xgb.DMatrix(test_x, label=test_y)
results = {}
xgb.train(
config,
train_set,
evals=[(test_set, "eval")],
evals_result=results,
verbose_eval=False,
)
accuracy = 1.0 - results["eval"]["error"][-1]
train.report({"mean_accuracy": accuracy, "done": True})
config = {
"objective": "binary:logistic",
"eval_metric": ["logloss", "error"],
"min_child_weight": tune.choice([1, 2, 3]),
"subsample": tune.uniform(0.5, 1.0),
}
tuner = tune.Tuner(
train_breast_cancer,
tune_config=tune.TuneConfig(
num_samples=10,
),
param_space=config,
)
results = tuner.fit()
print(results.get_best_result(metric="mean_accuracy", mode="max").config)
5.6.26. Use PyTorch with scikit-learn API with skorch
#
PyTorch and scikit-learn are one of the most popular libraries for ML/DL.
So, why not combine PyTorch with scikit-learn?
Try skorch
!
skorch
is a high-level library for PyTorch that provides a scikit-learn-compatible neural network module.
It allows you to use the simple scikit-learn interface for PyTorch.
Therefore you can integrate PyTorch models into scikit-learn workflows.
See below for an example.
!pip install skorch
from torch import nn
from skorch import NeuralNetClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
class MyModule(nn.Module):
def __init__(self, num_units=10, nonlin=nn.ReLU()):
super().__init__()
self.dense = nn.Linear(20, num_units)
self.nonlin = nonlin
self.output = nn.Linear(num_units, 2)
self.softmax = nn.Softmax(dim=-1)
def forward(self, X, **kwargs):
X = self.nonlin(self.dense(X))
X = self.dropout(X)
X = self.softmax(self.output(X))
return X
net = NeuralNetClassifier(
MyModule,
max_epochs=10,
lr=0.1,
iterator_train__shuffle=True,
)
pipe = Pipeline([
('scale', StandardScaler()),
('net', net),
])
pipe.fit(X, y)
y_proba = pipe.predict_proba(X)
5.6.27. Online ML with river
#
Do you want ML models that learn on-the-fly from massive datasets?
Try river
.
river
is a library for online machine learning.
You can continuously update your model with streaming data without using the full dataset for training again.
It provides online implementations for many algorithms like KNN, Tree-based models and Recommender systems.
!pip install river
from river import compose
from river import linear_model
from river import metrics
from river import preprocessing
from river import datasets
dataset = datasets.Phishing()
model = compose.Pipeline(
preprocessing.StandardScaler(),
linear_model.LogisticRegression()
)
metric = metrics.Accuracy()
for x, y in dataset:
y_pred = model.predict_one(x)
metric.update(y, y_pred)
model.learn_one(x, y)
5.6.28. SOTA Computer Vision Models with timm
#
Do you want to use SOTA computer vision models?
Try timm
.
timm
(PyTorch Image Models) is a library which contains multiple computer vision models, layers, optimizers, etc.
It provides models like Vision Transformer, MobileNet, Swin Transformer, ConvNeXt, DenseNet, and more.
You just have to define the name of the model and if you want to have the pretrained weights of it.
!pip install timm
import torch
import timm
print(timm.list_models())
model = timm.create_model('densenet121', pretrained=True)
output = model(torch.randn(2, 3, 224, 224))
5.6.29. Generate Guaranteed Prediction Intervals and Sets with MAPIE
#
For quantifying uncertainties of your models, use MAPIE.
MAPIE
(Model Agnostic Prediction Interval Estimator) takes your sklearn-/tensorflow-/pytorch-compatible model and generate prediction intervals or sets with guaranteed coverage.
!pip install mapie
from mapie.regression import MapieRegressor
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(n_samples=500, n_features=1, noise=20, random_state=59)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
regressor = LinearRegression()
mapie_regressor = MapieRegressor(regressor)
mapie_regressor.fit(X_train, y_train)
alpha = [0.05, 0.20]
y_pred, y_pis = mapie_regressor.predict(X_test, alpha=alpha)
5.6.30. Extra Components For scikit-learn with scikit-lego
#
scikit-learn is one of the most popular ML libraries.
While itโs easy to write custom components, it would be nice to have all of them in a single place.
scikit-lego
is such a library which contains many custom components like:
DebugPipeline
, which adds debug information to pipelinesImbalancedLinearRegression
to punish over-/underestimation of a modeladd_lags
to add lag values to a DataFrameZeroInflatedRegressor
which predicts zero or applies a regression based on a classifier
and many more!
!pip install scikit-lego
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklego.preprocessing import RandomAdder
from sklego.mixture import GMMClassifier
...
pipeline = Pipeline([
("scale", StandardScaler()),
("random_noise", RandomAdder()),
("model", GMMClassifier())
])
...
5.6.31. Quantize your Models with torchao
#
Quantizing your Deep Learning models was never easier.
With torchao
, you can quantize and sparsify your models with 1 line of code.
If you are unsure which method to use, you can even use the autoquant method to quantize your layers automatically.
!pip install torchao
import torchao
model = torchao.autoquant(torch.compile(model, mode='max-autotune'))