Hello :) Today is Day 230!
A quick summary of today:
- Protecting against temporal data leaks and skrub’s TableVectorizer
- Embeddings in scikit-learn pipelines
- Little about timeseries
- Sparse data pipelines
- GMM for outlier detection
Protecting against temporal data leaks and skrub’s TableVectorizer
-
protecting against data leaks - when creating time series models, it’s crucial to ensure that the model learns patterns based only on past data and doesn’t ‘cheat’ by accessing future information. However, it’s easy to accidentally introduce data leakage, where the model unintentionally uses future data during training. To help data scientists avoid this mistake, Vincent (the probable instructor) has developed a library that safeguards against such errors by ensuring that the model only learns from past data
-
TableVectorizer for quick feature preprocessing - this comes from skrub (a library that seems to be maintained by some sklearn devs) and we can include it in a pipeline with some model, and then get a working pipeline. The TV automatically determines data types and what kind of processing it will do to them - ordinal encoding, OHE, scaling, etc. It is a good tool to get us started. As Vincent said - some of his sklearn colleagues get a new dataset, put it through a TableVectorizer and a HistGradientBoostingClassifier (or Regressor) and get a pretty good model that can serve as an initial checkpoint
Embeddings in scikit-learn pipelines
Vincent has been maintaining an embedding library that is built upon scikit-learn classes so it directly integrates with other scikit-learn components - embetter
Taken from the github repo, with just a few lines of code we can get started with a text sentiment classifier:
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from embetter.grab import ColumnGrabber
from embetter.text import SentenceEncoder
# This pipeline grabs the `text` column from a dataframe
# which then get fed into Sentence-Transformers' all-MiniLM-L6-v2.
text_emb_pipeline = make_pipeline(
ColumnGrabber("text"),
SentenceEncoder('all-MiniLM-L6-v2')
)
# This pipeline can also be trained to make predictions, using
# the embedded features.
text_clf_pipeline = make_pipeline(
text_emb_pipeline,
LogisticRegression()
)
dataf = pd.DataFrame({
"text": ["positive sentiment", "super negative"],
"label_col": ["pos", "neg"]
})
X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
text_clf_pipeline.fit(dataf, dataf['label_col']).predict(dataf)
It is not here to replace pytorch or tensorflow as they are powerful libraries. Its purpose is for an easy setup and start up to get a prototype going.
Little about timeseries
For periodic features in scikit-learn there is this SplineTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, SplineTransformer
pipe = make_pipeline(
FunctionTransformer(datetime_feats),
SplineTransformer(n_knots=12, extrapolation="periodic")
)
With 12 knots we get the below. The data is daily weather data.
Then using this as a feature pipeline, and adding a Ridge on top we get:
To get something with quantiles, we can use the QuantileRegressor
:
from sklearn.linear_model import QuantileRegressor
feat_pipe = make_pipeline(
FunctionTransformer(datetime_feats),
SplineTransformer(n_knots=6, extrapolation="periodic")
)
mod_pipe_q_lower = make_pipeline(feat_pipe, QuantileRegressor(quantile=0.1, alpha=0.001))
mod_pipe_q_upper = make_pipeline(feat_pipe, QuantileRegressor(quantile=0.9, alpha=0.001))
mod_pipe_q_lower.fit(X, y)
mod_pipe_q_upper.fit(X, y)
Finally, Vincent (the instructor) just talked about being careful when using lags in fear of data leakage.
Sparse data pipelines
Dense Matrices
- matrices where most elements are non-zero
- requires
O(m * n)
memory for anm x n
matrix - optimized for general operations, but inefficient for large matrices with many zeros
Sparse Matrices
- matrices with most elements being zero
- represented by
scipy.sparse
(e.g., CSR, CSC, COO formats) - only stores non-zero elements and their positions, leading to significant memory savings
- efficient for operations involving non-zero elements; some operations may require conversion to dense format
Use Cases
- Dense: Use when the matrix is small or fully populated.
- Sparse: Use for large matrices with many zeros (e.g., term-document matrices, graph algorithms).
Example use case of sparse matrices is with text. And scikit-learn uses sparse matrices off the shelf
from sklearn.feature_extraction.text import CountVectorizer
from joblib import dump
from pathlib import Path
out = CountVectorizer().fit_transform(texts) # .todense()
dump(out, 'tmp.pickle')
Path('tmp.pickle').stat().st_size
The size we would get is: ~2.2m bytes. If we use a dense matrix (by adding .todense()
to the code), we would get 1.3b bytes. So we can clearly see the efficiency of sparse matrices when it comes to memory if we have to save something to disk.
Expanding on GMMs
Yesterday I learned about using PCA for outlier detection, today Guassian Mixture Models.
import numpy as np
import matplotlib.pylab as plt
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklego.mixture import GMMOutlierDetector
n = 1000
X = make_moons(n)[0] + np.random.normal(n, 0.12, (n, 2))
X = StandardScaler().fit_transform(X)
U = np.random.uniform(-2, 2, (10000, 2))
mod = GMMOutlierDetector(n_components=16, threshold=0.95).fit(X)
plt.figure(figsize=(14, 5))
plt.subplot(121)
plt.scatter(X[:, 0], X[:, 1], c=mod.score_samples(X), s=8)
plt.title("likelihood of points given mixture of 16 gaussians");
plt.subplot(122)
plt.scatter(U[:, 0], U[:, 1], c=mod.predict(U), s=8)
plt.title("outlier selection");
Depending on the threshold:
In the scikit-lego library maintained by Vincent there is also a GMMClassifier
The streams are 1h long each so it takes some time to watch. Nevertheless its quite interesting!
That is all for today!
See you tomorrow :)