(Day 283) Reading more of the MLE book + fine-tuning llama models

Ivan Ivanov · October 10, 2024

Hello :) Today is Day 283!

A quick summary of today:

read ch. 6 and 7 of the MLE book by Andriy Burkov
worked on the movie review LLM project and streamed

Chapter 6 - Supervised model training (part 2)

Deep model training strategy

We can train a model from scratch, or use a pre-trained model

For images: VGG16/19, InceptionV3, ResNet50

For text: BERT, ELMo

We can use pre-trained models in two ways:

1) use its learned parameters to initialize your own model 2) use the pre-trained model as a feature extractor for your model

Performance metric and cost function

Performance metrics depend on the project. And popular cost functions are MSE (regression), categorical cross-entropy and binary cross-entropy (for classification).

Parameter-initialisation strategies

all to 1
all to 0
random normally distributed values
random uniformly distributed values
Xavier normal - params are initialised to values sample from a truncated normal distribution, centred on 0, with st.d. equal to sqrt(2/(in+out)), where ‘in’ is the number of units in the preceding layer to which the current unit is connected, and ‘out’ is the number of units on the subsequent layer to which the current uni is connected
Xavier uniform - params are initialised to values sampled from a uniform distribution within [-limit, limit], where ‘limit’ is sqrt(6/(in+out))

There are other methods as well

Optimisation algorithms

gradient descent
stochastic GD
mini-batch SGD

Learning rate decay schedules

Learning rate decay consists of gradually reducing the value of the learning rate alpha as the epochs progress. Consequently, the parameter updates become finer

time-based
step-based
exponential-based

Regularisation

L1 and L2 (not neural net specific)
dropout
early stopping
batch-norm

Network size search and hyperparam tuning

Stacking models

In model stacking there is also the risk of data leakage.

To avoid data leakage, we can create a synthetic training set following a process similar to cross-validation. First, split all training data into ten or more blocks. The more blocks the better, but the process of training the model will be slower.

Temporarily exclude one block from the training data, and train the base models on the remaining blocks. Then apply the base models to the examples in the excluded block. Obtain the predictions, and build the synthetic training examples for the excluded block by using the predictions from the base models. Repeat the same process for each of the remaining blocks, and you will end up with the training set for the stacking model. The new synthetic training set will be of the same size as that of the original training set.

Dealing with distribution shift

Types of shift

covariate shift - shift in the values of features
prior probability shift - shift in the values of the target
concept drift - shift in the relationship between the features and the labels

We may know your data is affected by a distribution shift, but we don’t usually know what type of shift it is.

If we have a large test set compared to the train set, we can randomly pick values and transfer them. However, often we have a very high number of training examples and relatively few test examples. In that case, a more effective approach is to use adversarial validation

Adversarial validation

Adversarial validation is a technique used to distinguish between training and test sets. The assumption is that feature vectors in both training and test examples contain the same number of features and represent similar information.

Split the train set:
- divide the original training set into two subsets: Training Set 1 and Training Set 2
Transform train set 1:
- add the original label of each example in Training Set 1 as an additional feature
- assign a new label “Training” to each example
Transform the test set:
- add the original label of each example in the test set as an additional feature
- assign a new label “Test” to each example
Create a synthetic train set:
- merge the Modified Training Set 1 and the Modified Test Set to form a new Synthetic Training Set
- this new set is used to solve a binary classification problem, distinguishing between “Training” and “Test” examples
Train a binary classifier:
- train a binary classifier on the Synthetic Training Set to return a prediction score
- the classifier will predict whether an example is from the training set or test set
Apply the binary classifier:
- use the binary classifier on Training Set 2 and identify the examples predicted as “Test” with high certainty. These examples serve as validation data for the original problem
- remove examples from Training Set 1 that the model predicts as “Training” with the highest certainty
Final training:
- use the remaining examples in Training Set 1 as the training data for your original problem
Experimentation:
- experiment to find the ideal way to split the original training set and how many examples to use for training and validation

Handling imbalanced datasets

class weighting: some algorithms allow the data analyst to provide weights for each class. The loss in the cost function is typically multiplied by the weight. The data analyst may, for example, provide greater weight to the minority class. This makes it harder for the learning algorithm to disregard examples of the minority class, because it would result in much higher cost than without class weighting

ensamble of resampled datasets: the analyst randomly chunks majority examples into H subsets, then creates H training sets. After training H models, the analyst then makes predictions by averaging (or taking the majority) of the outputs of H models

other: if SGD is used the class imbalance can be tackled in several ways. First, we can have different learning rates for different classes: a lower value for the examples of the majority class, and a higher value otherwise. Second, we can make several consecutive updates of the model parameters each time you encounter an example of a minority class

Model calibration

A model is considered well-calibrated when the predicted score for an input example corresponds to the true probability of that example belonging to a particular class. For instance, in a binary classifier, if the model predicts a score of 0.8, then approximately 80% of the examples with that score should belong to the positive class.

Most machine learning models are not well-calibrated. Calibration plots help visualize this. In binary classification, these plots show predicted scores on the X-axis (grouped in bins) and the fraction of positive examples in each bin on the Y-axis.

For multiclass classification, a one-versus-rest approach is used, where multiple binary classifiers are trained, each focusing on distinguishing one class from the rest. Calibration plots are generated for each class.

In a well-calibrated model, the calibration plot aligns closely with the diagonal line, meaning predicted probabilities match actual outcomes. Logistic regression is often well-calibrated, while models like support vector machines and random forests tend to show a sigmoid shape in their calibration plots, indicating poorer calibration.

Troubleshooting and error analysis

Reasons for poor model behaviour

If the model performs poorly on the train data, common reasons are:

the model architecture or learning algorithm are not expressive enough (try more advanced learning algorithm, an ensemble method, or a deeper neural network)
you regularize too much (reduce regularization)
you have chosen suboptimal values for hyperparameters (tune hyperparameters)
the features you engineered don’t have enough predictive power (add more informative features)
you don’t have enough data for the model to generalize (try to get more data, use data augmentation, or transfer learning)
you have a bug in your code (debug the code that defines and trains the model)

If the model performs well on the train data, but poorly on the holdout data, common reasons are:

you don’t have enough data for generalisation
the model is under-regularised
the train data distribution is different from the holdout data distribution
you have chosen suboptimal values for hyperparams
your features have low predictive power

Train the model using the best values of hyperparameters identified so far
Test the model by applying it to a small subset of the validation set (100−300 examples)
Find the most frequent error patterns on that small validation set. Remove those examples from the validation set, because your model will now overfit to them
Generate new features, or add more training data to fix the observed error patterns
Repeat until no frequent error patterns are observed (most errors look dissimilar)

Iterative model refinement is a simplified version of error analysis

Error analysis

Errors can be:

uniform, and appear with the same rate in all use cases, or
focused, and appear more frequently in certain types of use cases

To fix and error pattern, we can use one or a combination of techniques:

preprocessing the input
data augmentation
labeling more training examples
engineering new features that would allow the learning algorithm to distinguish between ‘hard’ cases

Using sliced metrics

If the model will be applied to different segments of the use cases, it should be separately tested for each segment. For example, if you want to predict the solvency of borrowers, you would want your model to be equally accurate for both male and female borrowers. To achieve that, you can split your validation data into several subsets, one subset per segment. Then compute the performance metric by separately applying your model to each subset. Alternatively, you can separately evaluate the model on each class by applying precision and recall metrics.

Fixing wrong labels

When humans label the training examples, the assigned labels can be wrong

Here is a simple way to identify the examples that have wrong labels. Apply the model to the training data from which it was built, and analyze the examples for which it made a different prediction as compared to the labels provided by humans. If you see that some predictions are indeed correct, change those labels.

If time is limited, we can examine cases that were close to the decision boundary.

Troubleshooting DL

Common issues and most common causes of problems with getting to overfit one minibatch by a neural network model.

Sign	Probable causes
Error goes up	Flipped the sign of the loss function or gradient
	Learning rate too high
	Softmax taken over wrong dimension
Error explodes	Numerical issue
	Learning rate too high
Error oscillates	Data or labels corrupted (e.g. zeroed or incorrectly shuffled)
	Learning rate too high
Error plateaus	Learning rate too low
	Gradient not flowing through the whole model
	Too much regularization
	Incorrect input to loss function
	Data or labels corrupted

Best practices

deliver a good model
trust popular open source implementations
optimise a business-specific performance measure
upgrade from scratch
avoid correction cascades
use model cascading with caution (using the output of one model as the input to another)
write efficient code, compile, and parallelise
test on both newer and older data
more data beats cleverer algorithms
new data beats cleverer features
embrace tiny progress
facilitate reproducability

Chapter 7 - Model evaluation

Depending on the model’s applicative domain and the organization’s business goals and constraints, model evaluation may include the following tasks:

Estimate legal risks of putting the model in production. For example, some model predictions may indirectly communicate confidential information. Cyber attackers or competitors may attempt to reverse engineer the model’s training data. Additionally, when used for prediction, some features, such as age, gender, or race, might result in the organization being considered as biased or even discriminatory
Study the main properties of distributions of the training data versus the production data. By comparing the statistical distribution of examples, features, and labels, in both training and production data, is how distribution shift is detected. A significant difference between the two indicates a need to update the training data, and retrain the model
Evaluate the performance of the model. Before the model is deployed in production, its predictive performance must be evaluated on the external data, that is, data not used for training. The external data must include both historical and online examples from the production environment. The context of evaluation on the real-time, online data should closely resemble the production environment
Monitor the performance of the deployed model. The model’s performance may degrade over time. It is important to be able to detect this and, either upgrade the model by adding new data, or train an entirely different model. Model monitoring must be a carefully designed automated process, and might include a human in the loop

Offline and Online evaluation

An offline model evaluation happens when the model is being trained by the analyst. The analyst tries out different features, models, algorithms, and hyperparameters. Tools like confusion matrix and various performance metrics, such as precision, recall, and AUC, allow comparing candidate models, and guide the model training in the right direction.

Online model evaluation is testing and comparing models in production by using online data. The difference between offline and online model evaluation, as well as the placement of each type of evaluation in a machine learning system

Why do we evaluate both offline and online? The offline model evaluation reflects how well the analyst succeeded in finding the right features, learning algorithm, model, and values of hyperparameters. In other words, the offline model evaluation reflects how good the model is from an engineering standpoint.

Online evaluation, on the other hand, focuses on measuring business outcomes, such as customer satisfaction, average online time, open rate, and click-through rate. This information may not be reflected in historical data, but it’s what the business really cares about. Furthermore, offline evaluation doesn’t allow us to test the model in some conditions that can be observed online, such as connection and data loss, and call delays.

The performance results obtained on the historical data will hold after deployment only if the distribution of the data remains the same over time. In practice, however, it’s not always the case. Typical examples of a distribution shift include the ever-changing interests of the user of a mobile or online application, instability of financial markets, climate change, or wear of a mechanical system whose properties the model is intended to predict. As a consequence, the model must be continuously monitored once deployed in production. When a distribution shift happens, the model must be updated with new data and re-deployed. One way of doing such monitoring is to compare the performance of the model on online and historical data. If the performance on online data becomes significantly worse, as compared to historical, it’s time to retrain the model.

A common scenario is to monitor user behavior in response to different versions of the model. One popular technique used in this scenario is A/B testing. We split the users of a system into two groups, A and B. The two groups are served the old and the new models, respectively. Then we apply a statistical significance test to decide whether the performance of the new model is better than the old one.

Multi-armed bandit (MAB) is another popular technique of online model evaluation. Similar to A/B testing, it identifies the best performing models by exposing model candidates to a fraction of users. Then it gradually exposes the best model to more users, by keeping gathering performance statistics until it’s reliable.

A/B testing

A/B testing is a widely used statistical technique for evaluating models in production, commonly applied to online systems like websites and mobile apps. It helps answer questions like, “Does the new model (mB) perform better than the existing model (mA)?” In A/B testing, live traffic is split into two groups: Group A (control) uses the old model, while Group B (experiment) uses the new model. Performance is compared through statistical hypothesis testing to determine if the new model improves business metrics, such as user engagement or sales.

The test uses a null hypothesis (no significant change) and an alternative hypothesis (significant change) to measure whether the new model has a statistically significant impact on a specific metric. Though A/B testing involves a variety of tests depending on the metric being measured, the core principle of splitting traffic and assessing statistical significance remains constant.

G-Test

The G-test is a statistical method used in A/B testing, particularly when evaluating metrics based on yes/no questions. This test is suitable for any scenario where only two possible answers exist, such as “Did the user buy the recommended product?” or “Did the user renew the subscription?”

In the G-test, users are split into two groups: A (control) and B (experiment). Group A uses the old model, while group B uses the new model. The actions of each user are recorded as either “yes” or “no,” and the data is placed into a contingency table. The G-test compares the observed results with the expected results assuming both models (A and B) are equivalent.

The formula for the G-test measures the difference between the two groups, and if the calculated G-value is high enough, it suggests that one model is likely performing better than the other. The significance of the difference is determined by the p-value. If the p-value is less than a threshold (e.g., 0.05), the null hypothesis (both models perform equally) is rejected, indicating a significant performance difference between the two models.

To calculate the p-value in Python, you can use scipy.stats.chi2. For R, the equivalent function is pchisq. The test is considered valid when there are enough “yes” and “no” results in both groups, typically at least 100 of each per group.

In cases where there are more than two models or multiple possible answers (e.g., “yes,” “no,” “maybe”), the G-test can still be applied, but the complexity increases, and interpreting the results becomes more challenging. For simplicity, it’s often recommended to compare just two models with binary outcomes.

If sample sizes are small, Monte Carlo simulations can be used as an alternative to approximate the p-value.

Z-Test

The second formulation of A/B testing uses the Z-test when the question is “How many?” or “How much?”—for example, how much time a user spent on a website or how much money a user spent in a month.

In this test, users are again split into two groups: A (control) using the old model, and B (experiment) using the new model. The goal is to determine if users in group B spend significantly more time on the website than those in group A. The null hypothesis assumes no difference between the two groups, while the alternative hypothesis assumes that users of version B spend more time than those of version A.

To calculate the Z-test value, you first compute the sample mean and sample variance for both groups. The Z-value indicates how likely it is that the difference between A and B is significant. If the calculated Z is large, it suggests a performance difference between the two models. Under the null hypothesis (if A and B are equivalent), Z follows a normal distribution.

To determine statistical significance, the p-value is computed, which represents the probability of observing a Z-value as extreme as the one calculated, assuming no difference between the models. For example, a Z-value of 2.64 corresponds to a p-value of 0.05, meaning there’s a 5% chance that the observed difference is due to random variation. If the p-value is less than the chosen significance level (commonly 0.05), the null hypothesis is rejected, indicating that the new model performs better.

However, if the p-value is greater than or equal to the significance level, the null hypothesis is not rejected, meaning there’s no evidence that the new model is better. It’s important to note that not rejecting the null hypothesis is not the same as proving the models are identical; it simply means no significant evidence was found. Continuing to gather evidence until the p-value drops below the threshold is not scientifically sound.

In practice, a significance level of 0.05 or 0.01 is commonly used, though the appropriate value may vary depending on the application. A lower threshold requires more evidence to reject the null hypothesis.

Warnings

The above two are commonly used, but may not always be suitable for real-world business problems. Cassie Kozyrkov, Chief Decision Scientist at Google, points out that these tests often demonstrate that two models are different but fail to quantify the practical significance of that difference—i.e., whether the difference is “at least x.” This is crucial when the cost or risk of replacing an old model with a new one is high. In such cases, a more tailored test is needed, and consulting a statistician is recommended.

It’s also important to ensure the accuracy of the A/B test’s programming implementation. Mistakes in the code could invalidate the model evaluation, as broken tests don’t reveal their faults. Additionally, measurements for groups A and B should be taken simultaneously, as user behavior varies with time of day, day of the week, and other factors like country, internet speed, or browser version. This ensures the integrity of the experiment.

Multi-Armed Bandit (MAB)

Multi-armed bandit (MAB) is a more advanced and efficient method for online model evaluation compared to A/B testing. One major drawback of A/B testing is that it requires a large number of test results, meaning many users are exposed to suboptimal models for extended periods. MAB addresses this by minimizing the exposure to suboptimal models while still ensuring sufficient data is gathered to evaluate both models reliably.

This creates the exploration-exploitation dilemma: we need to explore enough to understand each model’s performance (exploration) while routing users to the better-performing model as often as possible (exploitation).

In probability theory, the multi-armed bandit problem involves allocating limited resources (users) between competing options (models or “arms”) to maximize the expected reward (a business performance metric like average session duration or purchase rate). The UCB1 (Upper Confidence Bound) algorithm solves this by dynamically choosing which arm to “play” based on past performance and how well the algorithm knows each arm’s performance.

The UCB1 algorithm:

Tracks how often each model (arm) is used and the average reward.
Plays the arm with the highest UCB value, a balance between its past reward and how much uncertainty remains about its performance.

The algorithm is mathematically proven to converge to the optimal solution, eventually favoring the best-performing model most of the time.

Statistical bounds on the model performance

When reporting model performance, it is important to include not only the metric value but also the statistical bounds (or statistical interval). The terminology varies in different contexts, where “confidence interval” and “credible interval” are sometimes used interchangeably, but they have distinct meanings in frequentist and Bayesian statistics. For simplicity, in this context, a statistical interval can be understood as indicating that there is a specific probability (e.g., 95%) that the estimated parameter lies within the defined bounds.

Techniques for Establishing Statistical Bounds

Statistical interval for classification error:
- for classification models, the error ratio (err) can be reported with a statistical interval. Given a test set size N, a 99% statistical interval can be calculated as: [err - \delta, err + \delta], where delta is determined by the size of the test set and a value z_N that corresponds to the desired confidence level (e.g., 2.58 for 99%)
Bootstrapping statistical interval:
- bootstrapping is a method used for reporting statistical intervals applicable to both classification and regression. This technique involves creating multiple random samples (with replacement) from the dataset, training a model, and computing the performance metric for each sample. By sorting the resulting metrics, analysts can establish a statistical interval based on the desired confidence level (e.g., 80%, 95%, or 99%)
Bootstrapping prediction interval for regression:
- for regression models, bootstrapping can also be employed to compute prediction intervals for specific input feature vectors. Similar to the previous method, multiple regression models are built using bootstrap samples from the training set. Predictions for the input feature vector are obtained from these models, and a tight interval is established around these predictions to indicate where the predicted value is expected to lie with a specified confidence level.

In practice, confidence levels of 95% or 99% are commonly used, with the number of bootstrap samples typically set to 100 to ensure a reliable estimation of model performance.

Evaluation of test set adequacy

The test examples used to evaluate the model must also be designed in such a way that they allow the discovery of the model’s defective behavior before the model reaches production.

Neuron coverage

When we evaluate a neural network, especially one to be used in a mission-critical scenario, such as a self-driving car or a space rocket, our test set must have good coverage. Neuron coverage of a test set for a neural network model is defined as the ratio of the units (neurons) activated by the examples from the test set, to the total number of units. A good test set has close to 100% neuron coverage.

A technique for building such a test set is to start with a set of unlabeled examples, and all units of the model uncovered. Then, iteratively, we

randomly pick an unlabeled example i and label it,
send the feature vector xi to the input of the model,
observe which units in the model were activated by xi
if the prediction was correct, mark those units as covered,
go back to step 1; continue iterating until the neuron coverage becomes close to 100%.

A unit is considered activated when its output is above a certain threshold. For ReLU, it’s usually zero; for a logistic sigmoid, it’s 0.5.

Mutation testing

In SE, this is done by doing small but crucial changes in the tested code. Like changing a + to a -, or removing an else.

In modelling, to create a mutant statistical model, instead of modifying the code, we modify the training data. If the model is deep, we can also randomly remove or add a layer, or remove or replace an activation function. The training data can be modified by:

adding duplicated examples
falsifying the labels of some examples
removing some examples
adding random noise to the values of some features

We say that we kill a mutant if at least one test example gets a wrong prediction by that mutant statistical model.

Evaluation of model properties

When we measure the quality of the model according to some performance metric, such as accuracy or AUC, we evaluate its correctness property. Besides this commonly evaluated property of the model, it can be appropriate to evaluate other properties of the model, such as robustness and fairness.

Robustness

This refers to the stability of a machine learning model’s performance when noise is added to the input data. A robust model’s performance degrades proportionally to the level of noise introduced.

To measure robustness, the following approach is used:

modify input features randomly by replacing values with zeros, ensuring the Euclidean distance between the original and modified inputs is within a set threshold (δ)
the model is considered ε-robust to a δ-perturbation if the difference in model predictions on the original and perturbed inputs remains within ε

To determine robustness:

perturb the test set using δ-perturbations
gradually increase δ until the model’s prediction difference exceeds an acceptable threshold (ε̂). The model with the greatest δ before exceeding ε̂ is considered the most robust and suitable for production deployment

Fairness

Fairness in ML involves ensuring that models do not exhibit bias against protected attributes such as race, gender, or age. Biases can arise from human-collected data and can lead to unfair models.

Two key concepts of fairness are:

Demographic Parity: ensures that the model gives positive predictions (e.g., university acceptance, loan approval) at equal rates across different groups. For example, the model should predict a positive outcome for women and men at the same rate. Mathematically, it requires the probability of a positive prediction to be the same for different groups based on a sensitive attribute
Equal Opportunity: ensures that qualified individuals from different groups have the same chance of receiving a positive prediction. This means that, for those who qualify, the model should predict a positive outcome at the same rate, regardless of group membership. It focuses on equalizing the true positive rate (TPR) across different groups.

Despite excluding protected attributes from the feature set, models may still exhibit bias due to correlations with other features, making fairness a complex and domain-specific challenge.

PEFT LLMs for movie review

Today I worked on the movie review project, and I talked about it more in-depth on stream.

A quick summary of evaluations:

Ratings:

rmse_llama3.1_8b    1.581458

rmse_llama3.2_3b    1.693664

Text:

llama_3.1_8b: {'rouge1': 0.7066951125849534, 'rouge2': 0.3931419071408453, 'rougeL': 0.6032890188826542, 'rougeLsum': 0.7037531656430065}

llama_3.2_3b : {'rouge1': 0.6371560616916654, 'rouge2': 0.2941605420908827, 'rougeL': 0.511882508004283, 'rougeLsum': 0.6224696637390135}

The models are on my huggingface

That is all for today!

See you tomorrow :)