(Day 316) Reading more of the infamous Designing ML systems book

Ivan Ivanov · November 12, 2024

Hello :) Today is Day 316!

A quick summary of today:

  • read more of Chip Huyen’s book
  • streamed

Designing ML systems - chapter 5 - Feature engineering

Common feature engineering operations

  • handling missing values (MNAR/MAR/MCAR)
    • delete them
    • impute them
  • scaling
  • discretisation (binarisation)
  • encoding cat features
  • feature crossing
  • discrete and continuous positional embeddings

Data leakage

Common causes for data leakage

  • splitting time-correlated data randomly instead of by time
  • scaling before splitting
  • filling in missing data with stats from the test split
  • poor handling of data duplication before splitting
  • group leakage
  • leakage from data generation process

Engineering good features

  • the more features you have, the more opportunities there are for data leakage
  • too many features can cause overfitting
  • too many features can increase memory required to serve a model, which, in turn, might require you to use a more expensive machine/instance to serve your model
  • too many featuers can incrase inference latency when doing online prediction, especially if we need to extract these features from data raw
  • useless features become technical debt; whenever the data pipe changes, all the affected features need to be adjusted accordingly

Feature importance

  • default model’s feature importance method
  • SHAP
  • InterpretML

Feature generalisation

This ensures that an ML model can make reliable predictions on unseen data by using features that generalise well beyond the training set. Not all features have equal generalisation power. For instance, identifiers specific to each instance (like a comment ID) don’t generalise, while identifiers linked to broader attributes (like usernames) might.

Measuring feature generalisation relies on intuition, domain knowledge, and statistical understanding. Key factors include coverage—the percentage of data samples containing the feature—and value distribution. High coverage suggests better generalisability, while low coverage, especially if values are missing randomly, can reduce utility. However, if missingness is systematic (i.e. the feature only appears in cases with positive labels), even low-coverage features may add value.

Features should ideally have consistent coverage and distributions across both training and testing splits. Variances here might indicate that the splits don’t share the same distribution, potentially hinting at data leakage. Overlapping distributions also help, as seen in time-based features like HOUR_OF_THE_DAY for traffic models, which often cover the same range across training and test sets. More specific features, like IS_RUSH_HOUR, generalise better but may lose critical detail provided by broader features, highlighting a trade-off between generalisation and specificity.

Chapter 6 - Model development and offline evaluation

Model development and training

6 tips for model selection

  • avoid the state-of-the-art trap
  • start with the simplest models
  • avoid human biases in selecting models
  • evaluate good performance now vs. good performance later
  • evaluate trade-offs
  • understand our model’s assumptions (prediction, IID, smoothness, tractability, boundaries, cond. independence, normally distributed)

Ensembles

They are popular solution for competitions like Kaggle, and provide competitive performance. However, they a re less favoured in production because ensembles are more complex to deploy and harder to maintain. However, they are still common for tasks where a small performance boost can lead to a huge financial gain, such as predicting click-through rate for ads.

Ensembles consist of methods such as boosting, bagging and stacking, shown in the pictures below, respectively.

image

image

image

Experiment tracking and Versioning

  • experiment tracking involves monitoring various aspects of the training process, such as loss curves, performance metrics, and system resource usage
  • versioning focuses on recording all details of each experiment, including code, data, and hyperparameters, to ensure reproducibility and facilitate comparisons

Challenges of data versioning

  • handling large datasets can be challenging due to storage and processing limitations
  • defining and managing data differences adds complexity
  • legal concerns: issues such as GDPR introduce additional complications in data versioning, particularly concerning user data privacy

Debugging models

  • silent model failures: models may fail without clear indicators, making issues hard to detect
  • time-consuming validation: validating bug fixes can be a lengthy process
  • component interplay; the intricate interaction among different ML components adds complexity

Common causes of failures

  • theoretical constraints
  • implementation errors
  • poor hyperparameter choices
  • data-related issues
  • inadequate feature selection

Debugging techniques

  1. start with a simple model and gradually add complexity to isolate issues
  2. overfit a small data batch which helps verify model functionality quickly
  3. set a random seed to ensure reproducibility for consistent results

Distributed training

  • data parallelism: splits data across multiple machines for training, discussing synchronous and asynchronous stochastic gradient descent (SGD) and their respective challenges, along with batch size and learning rate trade-offs
  • model parallelism: trains different model components on separate machines
    • pipeline parallelism: enhances parallel execution in model parallelism for efficiency

AutoML

AutoML aims to automate portions of the machine learning workflow. Two forms are discussed:

  • soft AutoML: focuses on hyperparameter tuning
  • hard AutoML: encompasses architecture search and learned optimizers

Four phases of ML model dev

  1. non-ML solutions: establish a baseline and gain insights into the problem
  2. simple ML models: initial exploration with simpler models
  3. optimization: enhance model performance
  4. complex models: consider complex models if the use case requires it

Model offline evaluation

To assess the effectiveness of ML models, companies need clear evaluation metrics. Without them, determining a model’s accuracy or comparing models becomes challenging, as seen in a case where a company couldn’t measure undetected intrusions by their ML system for drone surveillance. Although not having precise evaluation criteria doesn’t doom an ML project, it hinders finding the best solution and convincing stakeholders to adopt ML. Collaboration with business teams to develop relevant metrics can bridge this gap. Ideally, evaluation methods used in development (where ground truth labels are available) should align with those in production, though this alignment can be difficult due to the lack of labels in production.

Baselines

Evaluation metrics, by themselves, mean little. When evaluating your model, it’s essential to know the baseline we’re evaluating it against. The exact baselines should vary from one use case to another, but here are the five baselines that might be useful across use cases:

  • random baseline (if our model were to predict at random)
  • simple heuristic
  • zero rule baseline (predicting the most common class)
  • human baseline
  • existing solutions

Evaluating methods

  • pertrubation tests
  • invariance tests (i.e. keep the inputs the same but change the sensitive information , like race)
  • directional expectation tests (certain changes to the inputs should cause predictable changes in the outputs)
  • model calibration
  • confidence measurement
  • slice-based evaluation

Slice-based evaluation involves breaking down data into subsets(slices) and assessing model performance on each slice individually. This approach highlights performance disparities that may not appear in overall metrics, like accuracy or F1 score. For instance, a model might perform well overall but show significant bias against a minority subgroup if its accuracy is much lower on that slice.

Ignoring slice-based metrics can lead to biased models and overlook areas for improvement. Slice-based evaluation also helps in cases where specific subsets are more critical, such as paid users in churn prediction. Without slicing, important trends, as illustrated by Simpson’s paradox, can be concealed, where combined data shows one trend but individual groups reveal the opposite.

To perform effective slice-based evaluation, three main approaches are recommended:

  • heuristics-based slicing uses domain knowledge to identify relevant data dimensions
  • error analysis reviews misclassified examples to find common patterns
  • automated slice finders uses algorithms to identify and rank slice candidates

Chapter 7 - Model deployment and Prediction service

Production is a spectrum. For some teams, production means generating nice plots in notebooks to show to the business team. For other teams, production means keeping your models up and running for millions of users a day.

ML deployment myths

  • you only deploy one or two ML models at a time
  • if we don’t do anything, model performance remains the same
  • you won’t need to update your models as much
  • most MLEs don’t need to worry about scale

Batch vs Online prediction

  • batch prediction, which uses only batch features
  • online prediction that uses only batch features (i.e. precomputed embeddings)
  • online prediction that uses both batch features and streaming features; also known as streaming prediction

image

image

Imagine we work at DoorDash, we might need to following features to estimate the delivery time:

  • batch features - the mean prep time of a particular restaurant in the past
  • streaming features - in the last 10 mins, how many other orders they have, and how many delivery people are available

image

  Batch prediction (asynchronous) Online prediction (synchronous)
Frequency Periodical, such as every four hours As soon as requests come
Useful for Processing accumulated data when you don’t need immediate results (such as recommender systems) When predictions are needed as soon as a data sample is generated (such as fraud detection)
Optimized for High throughput Low latency

Model compression

  • low-rank factorisation: replace high-dimensional tensors with lower-dimensional tensors
  • knowledge distillation: a small model (student) is trained to mimic a larger model or ensemble of models (teacher)
  • pruning parameters
  • quantization: reduces a model’s size by using fewer bits to represent its parameters

ML on the Cloud and on the Edge

image

As hardware becomes more powerful, ML models will move to online and on the edge

Streamed

I will coninue reading the SHAP and interpretML’s docs in the next few streams because they seem interesting.


That is all for today!

See you tomorrow :)