(Day 357) MLOps + data drift + starting some summaries for the final blog post

Ivan Ivanov · December 23, 2024

Hello :) Today is Day 357!

A quick summary of today:

a bit more of MadeWithML’s MLOps course
reading a paper - Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift
blog posts word viz
starting a random Apache Flink course on udemy

Continued with the MadeWithML course

Distributed Data Processing

Using a single machine is fine when the dataset can fit in memory. But for Big Data we need a distributed system like Ray, Spark, Dask, Modin.

The course chooses to use Ray

After importing a dataset with ray, and splitting it, when running the cell below for a second we can see the logical steps in the distributed computation

(all of the code is on MadeWithML’s repo)

Distributed training

Intuition

Start with a random (chance) model
Develop a rule-based approach using if-else statements, regular expressions, etc
Slowly add complexity by addressing limitations and motivating representations and model architectures
Weigh tradeoffs (performance, latency, size, etc.) between performant baselines
Revisit and iterate on baselines as your dataset grows and new model architectures are developed

Distributed training

With the rapid increase is data and model sizes, we need to be able to distribute our training across multiple machines in order to train in a reasonable amount of time. And we want to be able to do this without having to:

set up a cluster by individually provisioning compute resources
writing complex code to distribute our training across multiple machines
worry about communication and resource utilization between our different distributed compute resources
worry about fault tolerance and recovery from our large training workloads

Thankfully Ray Train helps us with that.

With distributed training, there will be a head node that’s responsible for orchestrating the training process. While the worker nodes that will be responsible for training the model and communicating results back to the head node. From a user’s perspective, Ray abstracts away all of this complexity and we can simply define our training functionality with minimal changes to our code (as if we were training on a single machine).

After following the setup for using one of the BERT models, the training can be seen on Ray’s UI:

Experiment Tracking

Experiment tracking is the process of managing all the different experiments and their components, such as parameters, metrics, models and other artifacts and it enables us to:

organize all the necessary components of a specific experiment. It’s important to have everything in one place and know where it is so you can use them later
reproduce past results (easily) using saved experiments
log iterative improvements across time, data, ideas, teams, etc

The prime example - Mlflow - it’s amazing and free!

Hyperparam tuning

Using Ray Tune (others are Optuna, Hyperopt).

I haven’t used this lib before so here is some of the setup that looks interesting:

# Hyperparameters to start with
initial_params = [{"train_loop_config": {"dropout_p": 0.5, "lr": 1e-4, "lr_factor": 0.8, "lr_patience": 3}}]
search_alg = HyperOptSearch(points_to_evaluate=initial_params)
search_alg = ConcurrencyLimiter(search_alg, max_concurrent=2)

# Parameter space
param_space = {
    "train_loop_config": {
        "dropout_p": tune.uniform(0.3, 0.9),
        "lr": tune.loguniform(1e-5, 5e-4),
        "lr_factor": tune.uniform(0.1, 0.9),
        "lr_patience": tune.uniform(1, 10),
    }
}

# Scheduler to prune unpromising trials
scheduler = AsyncHyperBandScheduler(
    max_t=train_loop_config["num_epochs"],  # max epoch (<time_attr>) per trial
    grace_period=5,  # min epoch (<time_attr>) per trial
)

# Finally the tune config
tune_config = tune.TuneConfig(
    metric="val_loss",
    mode="min",
    search_alg=search_alg,
    scheduler=scheduler,
    num_samples=num_runs,
)

# Tuner
tuner = Tuner(
    trainable=trainer,
    run_config=run_config,
    param_space=param_space,
    tune_config=tune_config,
)

# Tune
results = tuner.fit()

And we can get a df of the results with results.get_dataframe() or get the best with results.get_best_result(metric="val_loss", mode="min")

Evaluation

the usual - precision, recall, f1
fine-grained - the above but per class
confusion matrices
confidence learning: this is something I have not seen before and it is interesting.

This cleanlab package helps with this. With this we can focus on the labeling quality. The package focuses on identifying noisy labels (with calibration), which can then be properly relabeled and used for training. Here is a picture from the cleanlab docs:

slicing

Slicing breaks a dataset into subsets to evaluate model performance on specific groups, uncovering hidden issues and biases. Key slices include target/predicted classes, features, metadata, and priority groups. It helps ensure fairness and identify underperforming areas.

To address bias (like overgeneralization of CNNs for vision or transformers for NLP), teams can:

  * collect more diverse data
  * oversample incorrect predictions
  * mask features contributing to bias
  * slicing functions define these subsets, enabling targeted improvements for fair and robust models

interpretability with libs like SHAP and LIME
behavioral testing: the process of testing input data and expected outputs while treating the model as a black box
online evaluation

A/B test:

Canary tests:

Shadow tests:

Serving

When choosing a framework we would want to consider:

is it Pythonic, do we need to learn another framework to deploy our models
is it framework agnostic
is (auto)scaling easy
can we combine multiple models into the service
is it integrated with popular API frameworks like FastAPI

Logging

Logging is the process of tracking and recording key events that occur in our applications for the purpose of inspection, debugging, etc. They’re a whole lot more powerful than print statements because they allow us to send specific pieces of information to specific locations with custom formatting, shared interfaces, etc. This makes logging a key proponent in being able to surface insightful information from the internal processes of our application.

There are 3 things to know:

Logger: emits the log messages from our application
Handler: sends log records to a specific location
Formatter: formats and styles the log records

Documentation

Found this VScode extension for automatic python docstring generation. Nowadays copilot might help with that as well but its results might not always be consistent and the below package provides a popular format.

Testing

There are four majors types of tests:

unit: tests on individual components that each have a single responsibility (ex. function that filters a list)
integration: tests on the combined functionality of individual components (ex. data processing)
system: tests on the design of a system for expected outputs given inputs (ex. training, inference, etc)
acceptance: tests to verify that requirements have been met, usually referred to as User Acceptance Testing (UAT)
regression: tests based on errors we’ve seen before to ensure new changes don’t reintroduce them

How should we test

Using the Arrange Act Assert methodology (also mentioned in the Effective Machine Learning Teams book):

arrange: set up the different inputs to test on
act: apply the inputs on the component we want to test
assert: confirm that we received the expected output

What should we test?

inputs: data types, format, length, edge cases (min/max, small/large, etc.)
outputs: data types, formats, exceptions, intermediary and final outputs

Best practices

atomic: when creating functions and classes, we need to ensure that they have a single responsibility so that we can easily test them. If not, we’ll need to split them into more granular components
compose: when we create new components, we want to compose tests to validate their functionality. It’s a great way to ensure reliability and catch errors early on. reuse: we should maintain central repositories where core functionality is tested at the source and reused across many projects. This significantly reduces testing efforts for each new project’s code base
regression: we want to account for new errors we come across with a regression test so we can ensure we don’t reintroduce the same errors in the future
coverage: we want to ensure 100% coverage for our codebase. This doesn’t mean writing a test for every single line of code but rather accounting for every single line
automate: in the event we forget to run our tests before committing to a repository, we want to auto run tests when we make changes to our codebase. We’ll learn how to do this locally using pre-commit hooks and remotely via GitHub actions in subsequent lessons

Monitoring

System health

Performance

Drift

data dirft

target drift
concept drift

Locating drift

The constraints we need to consider:

reference window: the set of points to compare production data distributions with to identify drift
test window: the set of points to compare with the reference window to determine if drift has occurred

Honestly, this is such a great resource and definitely one that I will refer to in the future when doing a project!

Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift

I found this paper from the MadeWithML reference list and decided to have a look ~

The paper addresses the issue of ‘silent failures’ in ML systems.

The research utilizes a statistical two-sample testing framework to detect shifts but since applying such tests directly to high-dim data is computationally expensive and statistically weak, the authors first use various dimensionality reduction techniques.

dimensionality reduction: transforming both source (training) and target (new) data into lower-dimensional representations using various methods
two-sample testing: applying appropriate statistical hypothesis tests to compare the distributions of the source and target data in the reduced space

Dimensionality Reduction Techniques

Different techniques are explored:

no reduction: a baseline using raw features
PCA: a linear method capturing maximum variance
Random Projections (SRP): projecting data onto random lower-dimensional subspaces
Autoencoders (TAE and UAE): trained (TAE) and untrained (UAE) autoencoders for non-linear representation learning
Label Classifiers (BBSDs and BBSDh): inspired by Black Box Shift Detection (BBSD), using softmax outputs (BBSDs) or hard predictions (BBSDh) of a pre-trained label classifier
Domain Classifier (Classif): explicitly training a classifier to distinguish between source and target data

Statistical Hypothesis Tests

Depending on the data type and dimensionality, different statistical tests are applied:

multivariate: Maximum Mean Discrepancy (MMD) for multi-dimensional representations
univariate: Kolmogorov-Smirnov (KS) test for continuous data and Chi-Squared test for discrete data, applied across individual dimensions with Bonferroni correction
Binomial test: assessing the accuracy of the domain classifier against random chance

Key Findings: Effectiveness of BBSD and Shift Characteristics

BBSD with Softmax Outputs (BBSDs): across various simulated dataset shifts, BBSDs consistently demonstrated superior performance in detecting shifts, even when the label shift assumption was not met.

We note that BBSDs being the best overall method for detecting shift is good news for ML practitioners. When building black-box models with the main purpose of classification, said model can be easily extended to also double as a shift detector.

Certain shifts like Gaussian noise, image transformations, and adversarial examples were easily detected, whereas others like class imbalance (knock-out) proved more challenging.

The ability to detect shifts increased with the intensity of the shift and the proportion of affected data points.

Creating a wordcloud

I decided to try to make a word cloud of my most used words throughout this blog. I setup the code using the WordCloud python library, and here is what came out with

I will re-run the code on day 366 to get a final view.

This is the distribution of categories (tags from every post) as well:

The thing with these tag categories is that if, say, I studied something to MLOps - then as part of that I might have used some model and then that could also bring a tag like GNN, or traditiona-machine-learning (a tag signifying the type of model used) ~ so it’s a bit tricky and makes me wonder if it’s worth posting this picture (an updated version) on the final blog post…

Apache Flink | A Real Time & Hands-On course on Flink

I found this course and because of my geolocation the price was ~7GBP and decided to check it out. It’s apparently the most covered Flink course on Udemy

It covers:

introduction to flink
transformation operations of dataset api
datastream api operations
windows in flink
triggers & evictors
watermarks and late elements
state, checkpointing and fault tolerance

Left to cover:

interacting with real-time data using kafka
case study 1 - twitter data analysis using flink
case study 2 - bank real-time fraud detection
case study 3 - stock data processing in real-time
table & sql api - relational apis of flink
gelly api for graph processing

As it is content that is 100% behind a paywall, I cannot share anything ~ but I watched about half of it today and I think the free Flink/Streaming courses on Confluent were a bit better, but I will see as the course goes until the end ~ Also the most exciting part is the case studied 😆 but I am watching all to recap and double-check my knowledge

That is all for today!

See you tomorrow :)