Hello :) Today is Day 316!
A quick summary of today:
- read more of Chip Huyen’s book
- streamed
Designing ML systems - chapter 5 - Feature engineering
Common feature engineering operations
- handling missing values (MNAR/MAR/MCAR)
- delete them
- impute them
- scaling
- discretisation (binarisation)
- encoding cat features
- feature crossing
- discrete and continuous positional embeddings
Data leakage
Common causes for data leakage
- splitting time-correlated data randomly instead of by time
- scaling before splitting
- filling in missing data with stats from the test split
- poor handling of data duplication before splitting
- group leakage
- leakage from data generation process
Engineering good features
- the more features you have, the more opportunities there are for data leakage
- too many features can cause overfitting
- too many features can increase memory required to serve a model, which, in turn, might require you to use a more expensive machine/instance to serve your model
- too many featuers can incrase inference latency when doing online prediction, especially if we need to extract these features from data raw
- useless features become technical debt; whenever the data pipe changes, all the affected features need to be adjusted accordingly
Feature importance
- default model’s feature importance method
- SHAP
- InterpretML
Feature generalisation
This ensures that an ML model can make reliable predictions on unseen data by using features that generalise well beyond the training set. Not all features have equal generalisation power. For instance, identifiers specific to each instance (like a comment ID) don’t generalise, while identifiers linked to broader attributes (like usernames) might.
Measuring feature generalisation relies on intuition, domain knowledge, and statistical understanding. Key factors include coverage—the percentage of data samples containing the feature—and value distribution. High coverage suggests better generalisability, while low coverage, especially if values are missing randomly, can reduce utility. However, if missingness is systematic (i.e. the feature only appears in cases with positive labels), even low-coverage features may add value.
Features should ideally have consistent coverage and distributions across both training and testing splits. Variances here might indicate that the splits don’t share the same distribution, potentially hinting at data leakage. Overlapping distributions also help, as seen in time-based features like HOUR_OF_THE_DAY for traffic models, which often cover the same range across training and test sets. More specific features, like IS_RUSH_HOUR, generalise better but may lose critical detail provided by broader features, highlighting a trade-off between generalisation and specificity.
Chapter 6 - Model development and offline evaluation
Model development and training
6 tips for model selection
- avoid the state-of-the-art trap
- start with the simplest models
- avoid human biases in selecting models
- evaluate good performance now vs. good performance later
- evaluate trade-offs
- understand our model’s assumptions (prediction, IID, smoothness, tractability, boundaries, cond. independence, normally distributed)
Ensembles
They are popular solution for competitions like Kaggle, and provide competitive performance. However, they a re less favoured in production because ensembles are more complex to deploy and harder to maintain. However, they are still common for tasks where a small performance boost can lead to a huge financial gain, such as predicting click-through rate for ads.
Ensembles consist of methods such as boosting, bagging and stacking, shown in the pictures below, respectively.
Experiment tracking and Versioning
- experiment tracking involves monitoring various aspects of the training process, such as loss curves, performance metrics, and system resource usage
- versioning focuses on recording all details of each experiment, including code, data, and hyperparameters, to ensure reproducibility and facilitate comparisons
Challenges of data versioning
- handling large datasets can be challenging due to storage and processing limitations
- defining and managing data differences adds complexity
- legal concerns: issues such as GDPR introduce additional complications in data versioning, particularly concerning user data privacy
Debugging models
- silent model failures: models may fail without clear indicators, making issues hard to detect
- time-consuming validation: validating bug fixes can be a lengthy process
- component interplay; the intricate interaction among different ML components adds complexity
Common causes of failures
- theoretical constraints
- implementation errors
- poor hyperparameter choices
- data-related issues
- inadequate feature selection
Debugging techniques
- start with a simple model and gradually add complexity to isolate issues
- overfit a small data batch which helps verify model functionality quickly
- set a random seed to ensure reproducibility for consistent results
Distributed training
- data parallelism: splits data across multiple machines for training, discussing synchronous and asynchronous stochastic gradient descent (SGD) and their respective challenges, along with batch size and learning rate trade-offs
- model parallelism: trains different model components on separate machines
- pipeline parallelism: enhances parallel execution in model parallelism for efficiency
AutoML
AutoML aims to automate portions of the machine learning workflow. Two forms are discussed:
- soft AutoML: focuses on hyperparameter tuning
- hard AutoML: encompasses architecture search and learned optimizers
Four phases of ML model dev
- non-ML solutions: establish a baseline and gain insights into the problem
- simple ML models: initial exploration with simpler models
- optimization: enhance model performance
- complex models: consider complex models if the use case requires it
Model offline evaluation
To assess the effectiveness of ML models, companies need clear evaluation metrics. Without them, determining a model’s accuracy or comparing models becomes challenging, as seen in a case where a company couldn’t measure undetected intrusions by their ML system for drone surveillance. Although not having precise evaluation criteria doesn’t doom an ML project, it hinders finding the best solution and convincing stakeholders to adopt ML. Collaboration with business teams to develop relevant metrics can bridge this gap. Ideally, evaluation methods used in development (where ground truth labels are available) should align with those in production, though this alignment can be difficult due to the lack of labels in production.
Baselines
Evaluation metrics, by themselves, mean little. When evaluating your model, it’s essential to know the baseline we’re evaluating it against. The exact baselines should vary from one use case to another, but here are the five baselines that might be useful across use cases:
- random baseline (if our model were to predict at random)
- simple heuristic
- zero rule baseline (predicting the most common class)
- human baseline
- existing solutions
Evaluating methods
- pertrubation tests
- invariance tests (i.e. keep the inputs the same but change the sensitive information , like race)
- directional expectation tests (certain changes to the inputs should cause predictable changes in the outputs)
- model calibration
- confidence measurement
- slice-based evaluation
Slice-based evaluation involves breaking down data into subsets(slices) and assessing model performance on each slice individually. This approach highlights performance disparities that may not appear in overall metrics, like accuracy or F1 score. For instance, a model might perform well overall but show significant bias against a minority subgroup if its accuracy is much lower on that slice.
Ignoring slice-based metrics can lead to biased models and overlook areas for improvement. Slice-based evaluation also helps in cases where specific subsets are more critical, such as paid users in churn prediction. Without slicing, important trends, as illustrated by Simpson’s paradox, can be concealed, where combined data shows one trend but individual groups reveal the opposite.
To perform effective slice-based evaluation, three main approaches are recommended:
- heuristics-based slicing uses domain knowledge to identify relevant data dimensions
- error analysis reviews misclassified examples to find common patterns
- automated slice finders uses algorithms to identify and rank slice candidates
Chapter 7 - Model deployment and Prediction service
Production is a spectrum. For some teams, production means generating nice plots in notebooks to show to the business team. For other teams, production means keeping your models up and running for millions of users a day.
ML deployment myths
- you only deploy one or two ML models at a time
- if we don’t do anything, model performance remains the same
- you won’t need to update your models as much
- most MLEs don’t need to worry about scale
Batch vs Online prediction
- batch prediction, which uses only batch features
- online prediction that uses only batch features (i.e. precomputed embeddings)
- online prediction that uses both batch features and streaming features; also known as streaming prediction
Imagine we work at DoorDash, we might need to following features to estimate the delivery time:
- batch features - the mean prep time of a particular restaurant in the past
- streaming features - in the last 10 mins, how many other orders they have, and how many delivery people are available
Batch prediction (asynchronous) | Online prediction (synchronous) | |
---|---|---|
Frequency | Periodical, such as every four hours | As soon as requests come |
Useful for | Processing accumulated data when you don’t need immediate results (such as recommender systems) | When predictions are needed as soon as a data sample is generated (such as fraud detection) |
Optimized for | High throughput | Low latency |
Model compression
- low-rank factorisation: replace high-dimensional tensors with lower-dimensional tensors
- knowledge distillation: a small model (student) is trained to mimic a larger model or ensemble of models (teacher)
- pruning parameters
- quantization: reduces a model’s size by using fewer bits to represent its parameters
ML on the Cloud and on the Edge
As hardware becomes more powerful, ML models will move to online and on the edge
Streamed
- tuning the decision threshold for class prediction
- validation curves
- shap package readme
- InterpretML’s homepage
I will coninue reading the SHAP and interpretML’s docs in the next few streams because they seem interesting.
That is all for today!
See you tomorrow :)