(Day 282) Chapter 5 from MLE by Andriy Burkov

Ivan Ivanov · October 9, 2024

traditional-machine-learning theory mlops

Hello :) Today is Day 282!

A quick summary of today:

continued reading the MLE book while travelling

Chapter 5 - Supervised Model Training (part 1)

Model training (or modeling) is the fourth stage in the machine learning project life cycle

Model training is one of the most overrated activities in machine learning. On average, a machine learning engineer spends only 5 - 10% of their time on modeling, if at all. Successful data collection, preparation, and feature engineering are more important

In this first part, the topics covered are: learning preparation, choosing the learning algorithm, a shallow learning strategy, assessing model performance, bias-variance tradeoff, regularization, the concept of the machine learning pipeline, and hyperparameter tuning.

Before you start working on the model

Before working on the model, we should validate schema conformity, define an achievable level of performance, choose a performance metric, and make several other decisions.

Validate schema conformity

Ensure the data conforms to the schema, as defined by the schema file

Even if we initially prepared the data, it’s likely that the original and current datasets are no longer identical. Several factors could explain this difference, including:

an error in the method used to persist the data, whether to a hard drive or a database
an error in the method used to read the data from where it was stored
changes made by someone else to the data or schema without informing us

These schema errors must be detected, identified, and corrected, similar to how programming errors are handled. If necessary, the entire data collection and preparation pipeline should be rerun from scratch.

Define an achievable performance level

This is very important as it gives us an idea of when to stop trying to improve the model

if a human can label examples without too much effort, math, or complex logic derivations, then you can hope to achieve human-level performance with your model
if the information needed to make a labeling decision is fully contained in the features, you can expect to have near-zero error
if the input feature vector has a high number of signals (such as pixels in an image, or words in a document), you can expect to come close to near-zero error
if you have a computer program solving the same classification or regression problem, you can expect your model to perform at least as well. Often the machine learning model performance can improve as more labeled data comes in
if you observe a similar, but different system, you can expect to get a similar, but different machine learning model performance

Choose a performance metric

The metrics depend on the project. It is recommended to choose one, and only one, performance metric before we start working on the model. Then, compare different models and track the overall progress by using this one metric.

Choose the right baseline

A baseline is a model or an algorithm that provides a reference point for comparison.

Having a baseline gives an analyst confidence that the machine-learning-based solution works. If the value of the performance metric for the machine learning model is better than the value obtained using the baseline, then machine learning provides value.

A baseline doesn’t have to be the result of any learning algorithm. It can be a rule-based or heuristic algorithm, a simple statistic applied to the training data, or something else.

The two most commonly used baseline algorithms are:

random prediction: makes a prediction by randomly choosing a label from the collection of labels assigned to the training examples. In the classification problem, it corresponds to randomly picking one class from all classes in the problem. In the regression problem it means selecting from all unique target values in the training data
zero rule: predicts the most common class in classification or the average value in regression

Baselines can also be more complex, like linear models or simple neural networks, and can even involve human performance, such as using services like Amazon Mechanical Turk for predictions. Baselines help guide model development by providing a standard to measure against.

Split data into three sets

training set
validation set
test set

Key considerations when splitting the data:

the validation and test sets should come from the same statistical distribution, resembling the data expected in production
distribution shifts can occur if training data differs from the validation/test sets. We have to be mindful of this when selecting data, as it can lead to overly optimistic performance estimates and result in suboptimal models being deployed

Preconditions for supervised learning

You have a labeled dataset
You have split the dataset into three subsets: training, validation, and test
Examples in the validation and test sets are statistically similar
You engineered features and filled missed values using only the training data
You converted all examples into numerical feature vectors
You have selected a performance metric that returns a single number
You have a baseline

Representing labels for ML

multiclass classification: using OHE
multilabel classification (i.e. in an image we identify both cats and dogs): we can use BOW

Selecting the learning algorithm

Main properties of a learning algorithm

explainability
in-memory vs out-of-memory
number of features and examples
nonlinearity of the data
training speed
prediction speed

Algorithm spot-checking

Shortlisting candidate learning algorithms for a given problem is sometimes called algorithm spot-checking. For the most effective spot-checking, it is recommended to:

select algorithms based on different principles (sometimes called orthogonal), such as instance-based algorithms, kernel-based, shallow learning, deep learning, ensembles
try each algorithm with 3 - 5 different values of the most sensitive hyperparameters
use the same training/validation split for all experiments
if the learning algorithm is not deterministic (such as the learning algorithms for neural networks and random forests), run several experiments, and then average the results
once the project is over, note which algorithms performed the best, and use this information when working on a similar problem in the future

If we don’t know our problem well, we can try to solve it using as many orthogonal approaches as possible, rather than spending a lot of time on the most promising approach. It is generally a better idea to spend time experimenting with new algorithms and libraries, rather than trying to squeeze the maximum from the one with which you have the most experience. If we don’t have time to carefully spot-check algorithms, one simple ‘hack’ is to find an efficient implementation of a learning algorithm or a model that most modern papers claim to beat, when applied to a problem similar to yours, and use it for solving our problem.

There is this nice ‘cheat-sheet’ that was also in ‘The 100-page ML book’

Building a pipeline

(It’s a relatively short part haha)

Assessing model performance

Performance metrics for regression

mean squared error (MSE)
median absolute error (MAE)
almost correct predictions error rate (ACPER)

Performance metrics for classification

precision-recall
accuracy
cost-sensitive accuracy
area under the ROC curve (AUC)
F-score

A new one I learned is:

Cohen’s kappa statistic: it is a performance metric that applies to both multiclass and imbalanced learning problems. The advantage of this metric over accuracy is that Cohen’s kappa tells you how much better your classification model is performing, compared to a classifier that randomly guesses a class according to the frequency of each class

where p_o is called the observed argument, and p_e is the expected agreement

The value of Cohen’s kappa is always less than or equal to 1. Values of 0 or less indicate that the model has a problem. While there is no universally accepted way to interpret the values of Cohen’s kappa, it’s usually considered that values between 0.61 and 0.80 indicate that the model is good, and values 0.81 or higher suggest that the model is very good.

Performance metrics for ranking

Precision and recall can be naturally applied to the ranking problem. It’s convenient to think of these two metrics as measuring the quality of document search results. Precision is the proportion of relevant documents actually found in the list of all returned documents. The recall is the ratio of the relevant documents returned by the search engine, compared to the total number of the relevant documents that should have been returned.

The drawback of measuring the quality of ranking models with precision and recall is that these metrics treat all retrieved documents equally. When a human looks at search results, the few top-most results matter more than the results shown at the bottom of the list.

Discounted Cumulative Gain (DCG) is a metric that evaluates the relevance of documents in a ranked search result, with higher relevance scores discounted at lower positions.

Cumulative Gain (CG) sums the relevance of all results without considering position
DCG adjusts this by reducing the gain for results appearing lower in the ranking, assuming that highly relevant documents are more useful when they appear earlier

There is also the Normalized Discounted Cumulative Gain (nDCG) which is calculated by comparing the DCG of the actual result to the DCG of the ideal ranking (IDCG). This normalization makes nDCG a useful metric for comparing performance across different queries.

The metric is commonly used in search engine performance evaluation and model comparison, particularly in ranking tasks.

Hyperparameter tuning

grid search: it is the simplest hyperparameter tuning technique. It’s used when the number of hyperparameters and their range is not too large. The technique consists of discretizing each of the two hyperparameters, and then evaluating each pair of discrete values
random search: here, we provide a statistical distribution for each hyperparameter from which values are randomly sampled
coarse-to-fine search: analysts often use a combination of grid search and random search called coarseto-fine search. This technique uses a coarse random search to first find the regions of high potential. Then, using a fine grid search in these regions, one finds the best values for hyperparameters
other techniques: bayesian techniques, which use past evaluation results to choose the next values to evaluate; gradient-based techniques, evolutionary optimization techniques, and other algorithmic hyperparameter tuning methods
cross-validation

Shallow model training

Shallow models make predictions based directly on the values in the input feature vector. Most popular machine learning algorithms produce shallow models. The only kind of deep models commonly used are deep neural networks

Define a performance metric P
Shortlist learning algorithms
Choose a hyperparameter tuning strategy T
Pick a learning algorithm A
Pick a combination H of hyperparameter values for algorithm A using strategy T
Use the training set and train a model M using algorithm A parametrized with hyperparameter values H
Use the validation set and calculate the value of metric P for model M.
Decide:
- if there are still untested hyperparameter values, pick another combination H of hyperparameter values using strategy T and go back to step 6
- otherwise, pick a different learning algorithm A and go back to step 5, or proceed to step 9 if there are no more learning algorithms to try
Return the model for which the value of metric P is maximized

Bias-Variance tradeoff

Regularization

L1 (lasso)

Where C is a hyperparameter that controls the importance of regularization. If we set C to zero, the model becomes a standard non-regularized linear regression model. On the other hand, if we set C to a high value, the learning algorithm will try to set most w to a value close or equal to zero to minimize the objective. The model will become very simple, which can lead to underfitting

L2 (ridge)

In practice, L1 regularization produces a sparse model, assuming the value of hyperparameter C is great enough. This is a model where most of its parameters equal exactly zero. So, as discussed in the previous chapter, L1 implicitly performs feature selection by deciding which features are essential for prediction, and which are not. This property of L1 regularization is useful when we want to increase model explainability. However, if our goal is to maximize the model performance on the holdout data, then L2 usually gives better results.

Elastic net

If we combine L1 and L2, we get a method called elastic net regularization.

For neural nets there are also dropout, batch norm, also non-math methods like data augmentation and early stopping.

I could not stream yesterday and today as I was travelling. Today I also had time to read while riding on the bus/train. I am home now and tomorrow I will stream

That is all for today!

See you tomorrow :)