(Day 316) Reading more of the infamous Designing ML systems book

Ivan Ivanov · November 12, 2024

Hello :) Today is Day 316!

A quick summary of today:

read more of Chip Huyen’s book
streamed

Designing ML systems - chapter 5 - Feature engineering

Common feature engineering operations

handling missing values (MNAR/MAR/MCAR)
- delete them
- impute them
scaling
discretisation (binarisation)
encoding cat features
feature crossing
discrete and continuous positional embeddings

Data leakage

Common causes for data leakage

splitting time-correlated data randomly instead of by time
scaling before splitting
filling in missing data with stats from the test split
poor handling of data duplication before splitting
group leakage
leakage from data generation process

Engineering good features

the more features you have, the more opportunities there are for data leakage
too many features can cause overfitting
too many features can increase memory required to serve a model, which, in turn, might require you to use a more expensive machine/instance to serve your model
too many featuers can incrase inference latency when doing online prediction, especially if we need to extract these features from data raw
useless features become technical debt; whenever the data pipe changes, all the affected features need to be adjusted accordingly

Feature importance

default model’s feature importance method
SHAP
InterpretML

Feature generalisation

This ensures that an ML model can make reliable predictions on unseen data by using features that generalise well beyond the training set. Not all features have equal generalisation power. For instance, identifiers specific to each instance (like a comment ID) don’t generalise, while identifiers linked to broader attributes (like usernames) might.

Measuring feature generalisation relies on intuition, domain knowledge, and statistical understanding. Key factors include coverage—the percentage of data samples containing the feature—and value distribution. High coverage suggests better generalisability, while low coverage, especially if values are missing randomly, can reduce utility. However, if missingness is systematic (i.e. the feature only appears in cases with positive labels), even low-coverage features may add value.

Features should ideally have consistent coverage and distributions across both training and testing splits. Variances here might indicate that the splits don’t share the same distribution, potentially hinting at data leakage. Overlapping distributions also help, as seen in time-based features like HOUR_OF_THE_DAY for traffic models, which often cover the same range across training and test sets. More specific features, like IS_RUSH_HOUR, generalise better but may lose critical detail provided by broader features, highlighting a trade-off between generalisation and specificity.

Chapter 6 - Model development and offline evaluation

Model development and training

6 tips for model selection

avoid the state-of-the-art trap
start with the simplest models
avoid human biases in selecting models
evaluate good performance now vs. good performance later
evaluate trade-offs
understand our model’s assumptions (prediction, IID, smoothness, tractability, boundaries, cond. independence, normally distributed)

Ensembles

They are popular solution for competitions like Kaggle, and provide competitive performance. However, they a re less favoured in production because ensembles are more complex to deploy and harder to maintain. However, they are still common for tasks where a small performance boost can lead to a huge financial gain, such as predicting click-through rate for ads.

Ensembles consist of methods such as boosting, bagging and stacking, shown in the pictures below, respectively.

Experiment tracking and Versioning

experiment tracking involves monitoring various aspects of the training process, such as loss curves, performance metrics, and system resource usage
versioning focuses on recording all details of each experiment, including code, data, and hyperparameters, to ensure reproducibility and facilitate comparisons

Challenges of data versioning

handling large datasets can be challenging due to storage and processing limitations
defining and managing data differences adds complexity
legal concerns: issues such as GDPR introduce additional complications in data versioning, particularly concerning user data privacy

Debugging models

silent model failures: models may fail without clear indicators, making issues hard to detect
time-consuming validation: validating bug fixes can be a lengthy process
component interplay; the intricate interaction among different ML components adds complexity

Common causes of failures

theoretical constraints
implementation errors
poor hyperparameter choices
data-related issues
inadequate feature selection

Debugging techniques

start with a simple model and gradually add complexity to isolate issues
overfit a small data batch which helps verify model functionality quickly
set a random seed to ensure reproducibility for consistent results

Distributed training

data parallelism: splits data across multiple machines for training, discussing synchronous and asynchronous stochastic gradient descent (SGD) and their respective challenges, along with batch size and learning rate trade-offs
model parallelism: trains different model components on separate machines
- pipeline parallelism: enhances parallel execution in model parallelism for efficiency

AutoML

AutoML aims to automate portions of the machine learning workflow. Two forms are discussed:

soft AutoML: focuses on hyperparameter tuning
hard AutoML: encompasses architecture search and learned optimizers

Four phases of ML model dev

non-ML solutions: establish a baseline and gain insights into the problem
simple ML models: initial exploration with simpler models
optimization: enhance model performance
complex models: consider complex models if the use case requires it

Model offline evaluation

To assess the effectiveness of ML models, companies need clear evaluation metrics. Without them, determining a model’s accuracy or comparing models becomes challenging, as seen in a case where a company couldn’t measure undetected intrusions by their ML system for drone surveillance. Although not having precise evaluation criteria doesn’t doom an ML project, it hinders finding the best solution and convincing stakeholders to adopt ML. Collaboration with business teams to develop relevant metrics can bridge this gap. Ideally, evaluation methods used in development (where ground truth labels are available) should align with those in production, though this alignment can be difficult due to the lack of labels in production.

Baselines

Evaluation metrics, by themselves, mean little. When evaluating your model, it’s essential to know the baseline we’re evaluating it against. The exact baselines should vary from one use case to another, but here are the five baselines that might be useful across use cases:

random baseline (if our model were to predict at random)
simple heuristic
zero rule baseline (predicting the most common class)
human baseline
existing solutions

Evaluating methods

pertrubation tests
invariance tests (i.e. keep the inputs the same but change the sensitive information , like race)
directional expectation tests (certain changes to the inputs should cause predictable changes in the outputs)
model calibration
confidence measurement
slice-based evaluation

Slice-based evaluation involves breaking down data into subsets(slices) and assessing model performance on each slice individually. This approach highlights performance disparities that may not appear in overall metrics, like accuracy or F1 score. For instance, a model might perform well overall but show significant bias against a minority subgroup if its accuracy is much lower on that slice.

Ignoring slice-based metrics can lead to biased models and overlook areas for improvement. Slice-based evaluation also helps in cases where specific subsets are more critical, such as paid users in churn prediction. Without slicing, important trends, as illustrated by Simpson’s paradox, can be concealed, where combined data shows one trend but individual groups reveal the opposite.

To perform effective slice-based evaluation, three main approaches are recommended:

heuristics-based slicing uses domain knowledge to identify relevant data dimensions
error analysis reviews misclassified examples to find common patterns
automated slice finders uses algorithms to identify and rank slice candidates

Chapter 7 - Model deployment and Prediction service

Production is a spectrum. For some teams, production means generating nice plots in notebooks to show to the business team. For other teams, production means keeping your models up and running for millions of users a day.

ML deployment myths

you only deploy one or two ML models at a time
if we don’t do anything, model performance remains the same
you won’t need to update your models as much
most MLEs don’t need to worry about scale

Batch vs Online prediction

batch prediction, which uses only batch features
online prediction that uses only batch features (i.e. precomputed embeddings)
online prediction that uses both batch features and streaming features; also known as streaming prediction

Imagine we work at DoorDash, we might need to following features to estimate the delivery time:

batch features - the mean prep time of a particular restaurant in the past
streaming features - in the last 10 mins, how many other orders they have, and how many delivery people are available

	Batch prediction (asynchronous)	Online prediction (synchronous)
Frequency	Periodical, such as every four hours	As soon as requests come
Useful for	Processing accumulated data when you don’t need immediate results (such as recommender systems)	When predictions are needed as soon as a data sample is generated (such as fraud detection)
Optimized for	High throughput	Low latency

Model compression

low-rank factorisation: replace high-dimensional tensors with lower-dimensional tensors
knowledge distillation: a small model (student) is trained to mimic a larger model or ensemble of models (teacher)
pruning parameters
quantization: reduces a model’s size by using fewer bits to represent its parameters

ML on the Cloud and on the Edge

As hardware becomes more powerful, ML models will move to online and on the edge

Streamed

I will coninue reading the SHAP and interpretML’s docs in the next few streams because they seem interesting.

That is all for today!

See you tomorrow :)