Hello :) Today is Day 317!
A quick summary of today:
- read two more chapters of Chip Huyen’s book
- streamed for 2hrs
- read a bit of the sklearn docs
Designing ML systems - Chapter 8 - Data distribution shifts and monitoring
Causes of ML system failures
Software system failures
- dependency failure
- deployment failure
- hardware failures
- downtime or crashing
ML-specific failures
- production data differing from train data
- edge cases
- degenerate feedback loops
Data distribution shifts
Types
-
covariate shift (when P(X) changes but P(Y X) remains the same; this refers to the first decomposition of the joint distribution) -
label shift (when P(Y) changes but P(X Y) remains the same; this refers to the second decomposition of the joint distribution) -
concept drift (when P(Y X) changes but P(X) remains the same. This refers to the first decomposition of the joint distribution)
Detecting shifts involves monitoring accuracy metrics, but when ground truth labels are delayed, examining distributions like P(X), P(Y), P(X | Y), and P(Y | X) can help. |
Industry tools often compare summary statistics like mean, variance, or skewness to detect shifts, but more sophisticated two-sample tests (i.e. Kolmogorov–Smirnov test, MMD) offer statistical confirmation of distribution differences. Temporal and spatial shifts add complexity, requiring time-series analysis techniques to detect changes across different time windows.
To address shifts, companies may periodically retrain models, use extensive datasets to enhance robustness, or adapt models to new distributions. Feature selection should balance predictive power and stability, as frequently shifting features (like app ranking) can require frequent retraining. Tailoring models to specific markets can also minimize unnecessary updates. Robust systems should be designed for adaptability, and not all performance issues require ML fixes; human errors remain a common cause.
Monitoring and observability
ML-specific metrics
There are generally 4 artifacts to monitor: a model’s accuracy-related metrics, predictions, features, and raw inputs.
The more transformations an artifact has gone through, the more likely its changes are to be caused by errors in one of those transformations
To monitor ML model performance, tracking user feedback and predictions is crucial. User feedback like clicks, upvotes, and completion rates offers insights into model accuracy. For instance, in a recommendation system, a stable click-through rate but dropping completion rates may indicate performance issues. Feedback can also guide the labeling process, as seen with Google Translate’s upvote/downvote feature.
Prediction monitoring helps detect distribution shifts, signaling input data changes. Consistent abnormal predictions can indicate immediate issues, unlike slower metrics like accuracy.
Feature monitoring validates data consistency but has challenges, such as resource costs and ‘alert fatigue’ due to frequent, non-critical alerts. Versioning schemas helps differentiate between genuine data shifts and schema mismatches.
Finally, raw input monitoring—typically managed by data platforms—is essential to detect early-stage data issues but is often inaccessible to ML teams due to data processing structures. These metrics collectively provide a comprehensive view of ML model health.
Monitoring toolbox
- logs
- dashboards
- alerts
Observability
Monitoring centers around metrics, and metrics are usually aggregated. Observability allows more fine-grain metrics, so that you can know not only when a model’s performance degrades but also for what types of inputs or what subgroups of users or over what period of time the model degrades
In ML, observability encompasses interpretability. Interpretability helps us understand how an ML model works, and observability helps us understand how the entire ML system, which includes the ML model, works. For example, when a model’s performance degrades over the last hour, being able to interpret which feature contributes the most to all the wrong predictions made over the last hour will help with figuring out what went wrong with the system and how to fix it.
Chapter 9 - Continual learning and test in production
Continual Learning
The idea is for a model to update itself as new data comes in. However, if the model updates itself with every single new data point, that could lead to catastrophic forgetting, and is also a waste of compute power.
Companies that employ continual learning in production update their models in micro-batches - i.e. after every 512/1024 examples (the amount is task dependent). And the updated model shouldn’t be deployed until it’s been evaluated.
Stateless retraining vs. Stateful training
Continual learning isn’t about the retraining frequency, but the manner in which the model is retrained. Most companies do stateless retraining—the model is trained from scratch each time. Continual learning means also allowing stateful training—the model continues training on new data. Stateful training is also known as fine-tuning or incremental learning.
Stateful training sounds cool, but how does this work if I want to add a new feature or another layer to my model? To answer this, two types of model updates need to be differentiated:
- model iteration: a new feature is added to an existing model architecture or the model architecture is changed
- data iteration: the model architecture and features remain the same, but you refresh this model with new data
As of today, stateful training is mostly applied for data iteration, as changing your model architecture or adding a new feature still requires training the resulting model from scratch
Why continual learning?
CL enables fast, frequent model updates, addressing several challenges in machine learning:
- adaptation to data shifts: when sudden changes occur, such as an unexpected surge in demand for ride-sharing, continual learning allows models to adjust quickly, improving user experience and preventing revenue loss
- handling rare events: CL helps adapt to infrequent but impactful events, like Black Friday, by learning from fresh data throughout the day, enhancing performance without requiring extensive historical data
- solving continuous cold start: unlike traditional cold start, CL allows models to adapt to both new users and changing user behaviors. For instance, if a user switches devices, the model can still make relevant recommendations, as seen in TikTok’s success with personalized content within minutes of app usage
CL expands on batch learning by offering real-time adaptability for various complex use cases. With advancing MLOps tools, it could soon be as accessible and practical as batch learning, making it a compelling choice for modern ML applications.
CL challenges
- fresh data access challenge
- evaluation challenge
- algorithm challenge (affects only matrix-based and tree-based models that want to be updated very fast - like hourly)
4 stages of CL
- Manual, stateless retraining
- Automated retraining
- Automated, stateful training
- Continual learning
How often to update your models?
- experiment with different time windows: train models on data from different periods and test them on current data. Comparing performance reveals how much fresher data improves results
- fine-grained experiments: adjust timeframes (months, weeks, days) to determine optimal update frequency. For instance, Facebook shifted ad CTR prediction retraining from weekly to daily due to a measurable 1% gain
Test in production
Shadown deployment
- deploy the candidate model in parallel with the existing model
- for each incoming request, route it to both models to make predictions, but only serve the existing model’s prediction to the user
- log the predictions from the new model for analysis purposes
Once the new model’s results have been found satisfactory, we can reaplce the existing model
A/B testing
- deploy the candidate model alongside the existing model
- a percentage of traffic is routed to the new model for predictions; the rest is routed to the existing model for predictions. It’s common for both variants to serve prediction traffic at the same time. However, there are cases where one model’s predictions might affect another model’s predictions— i.e., in ride-sharing’s dynamic pricing, a model’s predicted prices might influence the number of available drivers and riders, which, in turn, influence the other model’s predictions. In those cases, you might have to run your variants alternatively, e.g., serve model A one day and then serve model B the next day
- monitor and analyze the predictions and user feedback, if any, from both models to determine whether the difference in the two models’ performance is statistically significant
Randomisation is essential: traffic must be evenly randomized between models to avoid biases like device differences (e.g., mobile vs. desktop) that can skew results.
Sufficient sample size is crucial for statistical confidence. The test’s outcome should be validated through hypothesis testing, typically with two-sample tests to determine statistical significance. However, even statistically significant results aren’t foolproof, as repeated tests may yield different results. Non-significant results can indicate minimal model differences, making either model acceptable. For more on A/B testing, multiple variant testing (A/B/C or A/B/C/D), and related statistics, readers are advised to explore additional literature.
Canary release
Canary release is a technique to reduce the risk of introducing a new software version in production by slowly rolling out the change to a small subset of users before rolling it out to the entire infrastructure and making it available to everybody
- deploy the candidate model alongside the existing model; the candidate is called the canary
- a portion of the traffic is routed to the candidate
- if its performance is satisfactory, increase the traffic to the candidate. If not, abort the canary and route all the traffic back to the existing model
- stop when either the canary serves all the traffic or when the canary is aborted
Interleaving experiments
Imagine there are 2 recommender systems - A and B, and we want to evaluate which one is better. Each time, a model recommends 10 items users might like. With A/B testing, you’d divide your users into two groups: one group is exposed to A and the other group is exposed to B. Each user will be exposed to the recommendations made by one model.
Instead, the idea behind interleaving experiments is to expose the user to recommendations from both models and see which one they prefer.
Bandits
Bandit algorithms, originating from gambling, are used to balance exploration (testing new options) and exploitation (leveraging known best options). In model evaluation, a multi-armed bandit approach routes traffic based on models’ current performance, unlike stateless A/B testing, which simply divides traffic randomly. This approach requires online prediction capability, rapid feedback loops, and mechanisms to update performance based on feedback, making bandits data-efficient but complex to implement.
Basic bandit strategies include ε-greedy (favoring the top model most of the time), Thompson Sampling (selects based on probability of optimality), and Upper Confidence Bound (adds an ‘exploration bonus’ for uncertainty). Contextual bandits extend these principles, adjusting recommendations by learning from user feedback, which is challenging due to partial feedback and model dependencies.
Although powerful, bandit algorithms are rare outside big tech due to complexity. Good evaluation pipelines are essential to ensure reliable model deployment, reducing failure risks by defining tests, thresholds, and automation for continuous evaluation, similar to CI/CD in software development.
Stream
- first stream
- did some db setup for Zach Wilson’s data eng camp starting this Friday
- read SHAP’s intro docs page
- second stream
After stream I read:
- Pipelines and composite estimators
- Feature extraction
- Preprocessing data
- Imputation of missing values
- Random projection
That is all for today!
See you tomorrow :)