(Day 230) Watching more educational videos from probabl

Ivan Ivanov · August 18, 2024

Hello :) Today is Day 230!

A quick summary of today:

  • Protecting against temporal data leaks and skrub’s TableVectorizer
  • Embeddings in scikit-learn pipelines
  • Little about timeseries
  • Sparse data pipelines
  • GMM for outlier detection

Protecting against temporal data leaks and skrub’s TableVectorizer

  • protecting against data leaks - when creating time series models, it’s crucial to ensure that the model learns patterns based only on past data and doesn’t ‘cheat’ by accessing future information. However, it’s easy to accidentally introduce data leakage, where the model unintentionally uses future data during training. To help data scientists avoid this mistake, Vincent (the probable instructor) has developed a library that safeguards against such errors by ensuring that the model only learns from past data

  • TableVectorizer for quick feature preprocessing - this comes from skrub (a library that seems to be maintained by some sklearn devs) and we can include it in a pipeline with some model, and then get a working pipeline. The TV automatically determines data types and what kind of processing it will do to them - ordinal encoding, OHE, scaling, etc. It is a good tool to get us started. As Vincent said - some of his sklearn colleagues get a new dataset, put it through a TableVectorizer and a HistGradientBoostingClassifier (or Regressor) and get a pretty good model that can serve as an initial checkpoint

Embeddings in scikit-learn pipelines

Vincent has been maintaining an embedding library that is built upon scikit-learn classes so it directly integrates with other scikit-learn components - embetter

Taken from the github repo, with just a few lines of code we can get started with a text sentiment classifier:

import pandas as pd
from sklearn.pipeline import make_pipeline 
from sklearn.linear_model import LogisticRegression

from embetter.grab import ColumnGrabber
from embetter.text import SentenceEncoder

# This pipeline grabs the `text` column from a dataframe
# which then get fed into Sentence-Transformers' all-MiniLM-L6-v2.
text_emb_pipeline = make_pipeline(
  ColumnGrabber("text"),
  SentenceEncoder('all-MiniLM-L6-v2')
)

# This pipeline can also be trained to make predictions, using
# the embedded features. 
text_clf_pipeline = make_pipeline(
  text_emb_pipeline,
  LogisticRegression()
)

dataf = pd.DataFrame({
  "text": ["positive sentiment", "super negative"],
  "label_col": ["pos", "neg"]
})
X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
text_clf_pipeline.fit(dataf, dataf['label_col']).predict(dataf)

It is not here to replace pytorch or tensorflow as they are powerful libraries. Its purpose is for an easy setup and start up to get a prototype going.

Little about timeseries

For periodic features in scikit-learn there is this SplineTransformer

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, SplineTransformer

pipe = make_pipeline(
    FunctionTransformer(datetime_feats),
    SplineTransformer(n_knots=12, extrapolation="periodic")
)

With 12 knots we get the below. The data is daily weather data.

image

Then using this as a feature pipeline, and adding a Ridge on top we get:

image

To get something with quantiles, we can use the QuantileRegressor:

from sklearn.linear_model import QuantileRegressor

feat_pipe = make_pipeline(
    FunctionTransformer(datetime_feats),
    SplineTransformer(n_knots=6, extrapolation="periodic")
)

mod_pipe_q_lower = make_pipeline(feat_pipe, QuantileRegressor(quantile=0.1, alpha=0.001))
mod_pipe_q_upper = make_pipeline(feat_pipe, QuantileRegressor(quantile=0.9, alpha=0.001))

mod_pipe_q_lower.fit(X, y)
mod_pipe_q_upper.fit(X, y)

image

Finally, Vincent (the instructor) just talked about being careful when using lags in fear of data leakage.

Sparse data pipelines

Dense Matrices

  • matrices where most elements are non-zero
  • requires O(m * n) memory for an m x n matrix
  • optimized for general operations, but inefficient for large matrices with many zeros

Sparse Matrices

  • matrices with most elements being zero
  • represented by scipy.sparse (e.g., CSR, CSC, COO formats)
  • only stores non-zero elements and their positions, leading to significant memory savings
  • efficient for operations involving non-zero elements; some operations may require conversion to dense format

Use Cases

  • Dense: Use when the matrix is small or fully populated.
  • Sparse: Use for large matrices with many zeros (e.g., term-document matrices, graph algorithms).

Example use case of sparse matrices is with text. And scikit-learn uses sparse matrices off the shelf

from sklearn.feature_extraction.text import CountVectorizer
from joblib import dump
from pathlib import Path

out = CountVectorizer().fit_transform(texts) # .todense()
dump(out, 'tmp.pickle')
Path('tmp.pickle').stat().st_size

The size we would get is: ~2.2m bytes. If we use a dense matrix (by adding .todense() to the code), we would get 1.3b bytes. So we can clearly see the efficiency of sparse matrices when it comes to memory if we have to save something to disk.

Expanding on GMMs

Yesterday I learned about using PCA for outlier detection, today Guassian Mixture Models.

import numpy as np
import matplotlib.pylab as plt

from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler

from sklego.mixture import GMMOutlierDetector

n = 1000
X = make_moons(n)[0] + np.random.normal(n, 0.12, (n, 2))
X = StandardScaler().fit_transform(X)
U = np.random.uniform(-2, 2, (10000, 2))

mod = GMMOutlierDetector(n_components=16, threshold=0.95).fit(X)

plt.figure(figsize=(14, 5))
plt.subplot(121)
plt.scatter(X[:, 0], X[:, 1], c=mod.score_samples(X), s=8)
plt.title("likelihood of points given mixture of 16 gaussians");

plt.subplot(122)
plt.scatter(U[:, 0], U[:, 1], c=mod.predict(U), s=8)
plt.title("outlier selection");

image

Depending on the threshold:

image

In the scikit-lego library maintained by Vincent there is also a GMMClassifier


The streams are 1h long each so it takes some time to watch. Nevertheless its quite interesting!

That is all for today!

See you tomorrow :)