5.7. Time Series#
5.7.1. Check Seasonality automatically with darts
#
Seasonality describes a pattern that repeats regularly over time.
Identifying and understanding the seasonality in time series can boost the performance of your model.
But you don’t have to find the seasonality effect and period by yourself.
Instead, you can use check_seasonality()
from darts
in Python.
It will check if the time series is seasonal and returns also the period, which is inferred from the Auto-correlation Function.
In the example below, it will return a seasonal period of 12 (Air Passenger Dataset has a monthly frequency).
!pip install darts
from darts.utils.statistics import check_seasonality
from darts.datasets import AirPassengersDataset
ts = AirPassangersDataset().load()
is_seasonal, period = check_seasonality(ts)
5.7.2. Cross-validation for Time Series Data with TimeSeriesSplit
#
How to do Cross-Validation with Time Series?
Using standard K-Fold Cross-Validation will not work.
In this case, you would simply partition the data into k folds, and then train and evaluate the model k times, each time using a different fold as the test set and the rest of the data as the training set.
But, this can lead to issues because the model will be trained on data that is both before and after the test data.
This can result in overfitting or biased estimates of model performance
Instead, use TimeSeriesSplit
from scikit-learn.
TimeSeriesSplit
ensures that the model is only trained on the past values and tested on future data.
This gives you a more accurate and less biased assessment of the model’s performance.
from sklearn.model_selection import TimeSeriesSplit, cross_validate
from sklearn.ensemble import GradientBoostingRegressor
X, y = ...
model = GradientBoostingRegressor()
ts_cv = TimeSeriesSplit(n_splits=3)
scores = cross_validate(model, X, y, cv=ts_cv, scoring='neg_mean_squared_error')
5.7.3. More Cross-Validation with tscv
#
How to do Cross-Validation with Time Series?
Using standard K-Fold Cross-Validation will not work.
In this case, you would simply partition the data into k folds, and then train and evaluate the model k times, each time using a different fold as the test set and the rest of the data as the training set.
But, this can lead to issues because the model will be trained on data that is both before and after the test data.
This can result in overfitting or biased estimates of model performance.
Instead, use tscv
package for Python.
tscv
offers methods for correct splitting of your data with 3 classes implemented:
GapLeavePOut
GapKFold
GapRollForward
This gives you a more accurate and less biased assessment of the model’s performance.
!pip install tscv
from tscv import GapRollForward
cv = GapRollForward(min_train_size=3, gap_size=1, max_test_size=2)
for train, test in cv.split(range(10)):
print("train:", train, "test:", test)
5.7.4. Time Series Forecasting with Machine Learning with mlforecast
#
Do you want to perform powerful time series forecasting?
Try mlforecast
by Nixtla.
mlforecast
lets you run Machine Learning models for time series forecasting, even on remote clusters like Ray or Spark.
Feature Engineering, support for exogenous variables, and probabilistic forecasting are also included.
!pip install mlforecast
import lightgbm as lgb
from mlforecast import MLForecast
from sklearn.linear_model import LinearRegression
mlf = MLForecast(
models = [LinearRegression(), lgb.LGBMRegressor()],
lags=[1, 12],
freq = 'M'
)
mlf.fit(df)
mlf.predict(12)
5.7.5. Lightning Fast Time Series Forecasting with statsforecast
#
Do you want to perform lightning fast time series forecasting?
Try statsforecast
by Nixtla.
statsforecast
lets you run statistical models on your time series data.
It’s up to 20x faster than existing libraries like pmdarima and statsmodels.
!pip install statsforecast
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA
from statsforecast.utils import AirPassengersDF
df = AirPassengersDF
sf = StatsForecast(
models = [AutoARIMA(season_length = 12)],
freq = 'M'
)
sf.fit(df)
sf.predict(h=12, level=[95])
5.7.6. Time Series with Polars Backend with functime
#
Fast time-series forecasting with functime
.
functime
is a Python library for time series forecasting and feature extraction, built with Polars.
Since it uses lazy Polars dataframes, functime
speeds up forecasting and feature engineering.
Backtesting, cross-validation splitters and metrics are included too.
It even comes with a LLM agent to analyze and describe your forecasts.
Check it out!
!pip install functime
import polars as pl
from functime.cross_validation import train_test_split
from functime.forecasting import linear_model
from functime.metrics import mase
y_train, y_test = y.pipe(train_test_split(test_size=3))
forecaster = linear_model(freq="1mo", lags=24)
forecaster.fit(y=y_train)
y_pred = forecaster.predict(fh=3)
y_pred = linear_model(freq="1mo", lags=24)(y=y_train, fh=3)
scores = mase(y_true=y_test, y_pred=y_pred, y_train=y_train)
5.7.7. Time Series Forecasting with Deep Learning with neuralforecast
#
Do you want to perform powerful time series forecasting?
Try neuralforecast
by nixtla.
neuralforecast
lets you run Deep Learning models for time series forecasting with models like N-BEATS or N-HiTS.
Support for exogenous variables and probabilistic forecasting are also included.
Check the example below!
!pip install neuralforecast
import pandas as pd
from neuralforecast import NeuralForecast
from neuralforecast.models import NBEATS, NHITS
from neuralforecast.utils import AirPassengersDF
Y_df = AirPassengersDF
Y_train_df = Y_df[Y_df.ds<='1959-12-31']
Y_test_df = Y_df[Y_df.ds>'1959-12-31']
horizon = 12
models = [NBEATS(input_size=2 * horizon, h=horizon, max_steps=50),
NHITS(input_size=2 * horizon, h=horizon, max_steps=50)]
nf = NeuralForecast(models=models, freq='M')
nf.fit(df=Y_train_df)
Y_hat_df = nf.predict().reset_index()
5.7.8. Efficient Preprocessing and Feature Engineering with temporian
#
temporian
is a Python library for preprocessing and feature engineering temporal data to feed into ML libraries like XGBoost, Scikit-learn or PyTorch.
It handles various types of temporal data like single- and multivariate data or flat- and multi-index data.
!pip install temporian
import temporian as tp
sales = tp.from_csv("sales.csv")
sales_per_store = sales.add_index("store")
days = sales_per_store.tick_calendar(hour=22)
work_days = (days.calendar_day_of_week() <= 5).filter()
daily_revenue = sales_per_store["revenue"].moving_sum(
tp.duration.days(1),
sampling=work_days)
5.7.9. Change Point Detection with ruptures
#
Change point detection was never easier in Python with `ruptures``
ruptures
is a library which provides methods for detecting and displaying off-line change points.
It offers multiple exact and approximation detection methods.
!pip install ruptures
import matplotlib.pyplot as plt
import ruptures as rpt
# Generate signal
n_samples, dim, sigma = 1000, 3, 4
n_breakpoints = 4
signal, bkps = rpt.pw_constant(n_samples, dim, n_breakpoints, noise_std=sigma)
# Detection
algo = rpt.Pelt(model="rbf").fit(signal)
result = algo.predict(pen=10)
# Display
rpt.display(signal, bkps, result)
plt.show()
5.7.10. Probabilistic Machine Learning with skpro
#
Use supervised probabilistic prediction like a pro with skpro
.
skpro
is a scikit-learn-like library for probabilistic predictions and evaluations.
It supports tabular regressors, survival prediction, and reductions to turn scikit-learn regressors into probabilistic skpro
regressors.
!pip install skpro
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from skpro.regression.residual import ResidualDouble
X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_new, y_train, _ = train_test_split(X, y)
reg_mean = RandomForestRegressor()
reg_resid = LinearRegression()
reg_proba = ResidualDouble(reg_mean, reg_resid)
reg_proba.fit(X_train, y_train)
y_pred_proba = reg_proba.predict_proba(X_new)
y_pred_interval = reg_proba.predict_interval(X_new, coverage=0.9)
y_pred_quantiles = reg_proba.predict_quantiles(X_new, alpha=[0.05, 0.5, 0.95])
y_pred_var = reg_proba.predict_var(X_new)
y_pred_mean = reg_proba.predict(X_new)