(Day 305) More of the LLM Engineer's handbook

Ivan Ivanov · November 1, 2024

Hello :) Today is Day 305!

A quick summary of today:

  • read chapters 5 and 6 of the handbook

LLM Engineer’s Handbook Chapter 5 - Supervised Fine-Tuning

Overview of the post-training data pipeline covered in this chapter

image

Creating our own instruction dataset

To create a high-quality instruction dataset, we need to address two main issues:

  • the instructured nature of our data
  • the limited number of articles we can crawl

The unstructured nature comes from the fact that we are dealing with raw text, instead of pairs of intructions and answers. To address this, an LLM is used to perform backtranslation and rephrasing. Backtranslation refers to the process of providing the expected answer as output and generating its corresponding instruction. However, using a chunk of text like a paragraph as an answer might not always be appropriate. This is why we want to rephrase the raw text to ensure we’re outputting properly formatted, high-quality answers. Additionally, we can ask the model to follow the author’s writing style to stay close to the original paragraph. While this process involves extensive prompt engineering, it can be automated and used at scale.

The second issue regarding the limited number of samples is quite common in real-world use cases. To address this problem, we will divide our articles into chunks and generate three instruction-answer pairs for each chunk. This will multiply the number of samples we create while maintaining diversity in the final dataset

LLMs are not reliable when it comes to producing structured output. To ensure properly structured results, we can employ structured generation techniques. Structured generation is an effective method to force an LLM to follow a predefined template, such as JSON, pydantic classes, or regular expressions

Synthetic data generation pipeline from raw text to instruction dataset

image

After the python implementation, the authors share the generated dataset on huggingface

image

Exploring SFT and its techniques

‘When to fine-tune?’ - basic chart:

image

Parameter-efficient fine-tuning techniques

Differences between full, LoRA and QLoRA:

image

This part gave a very nice overview of the parameters used when doing (LoRA) fine-tuning. Which was interesting as I see all the parameters when using unsloth for the company reviewer LLM. They also show how to do LoRA fine-tuning and it is the same code as I use haha.

LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) are techniques designed to make fine-tuning large language models (LLMs) more memory-efficient and accessible, especially for users with limited computational resources.

  • LoRA enables efficient fine-tuning by introducing trainable, low-rank matrices A and B to model layers, preserving the original parameters while reducing memory and speeding up training. LoRA focuses primarily on attention layers but can be extended to other parts, like the feed-forward layers. With LoRA, tuning a model with 7B parameters can be done on a single GPU with 14–18 GB of VRAM

  • QLoRA combines LoRA with 4-bit quantization to further minimize memory usage, making it possible to fine-tune even large models on modest GPUs. QLoRA’s double quantization and paged optimizers reduce memory by up to 75%, though with a 30% training speed penalty compared to LoRA

Key Components of Training LLMs with LoRA and QLoRA:

  1. hyperparams: LoRA’s rank r and scaling factor alpha control the size and impact of the added matrices, with rank values from 8 to 256 offering flexibility between task diversity and efficiency
  2. training params:
    • learning rate and scheduler: acrucial factor, often in the range of 1e-5, adjusted over time using linear or cosine decay
    • batch size and gradient accumulation: effective batch sizes depend on GPU memory, with larger sizes increasing stability
    • max sequence length and packing: length caps affect memory and performance, with packing optimizing short sequences for training
    • number of epochs: typically 2–5, balancing overfitting and underfitting
    • optimizers**: AdamW is widely used; paged or quantized versions help manage memory on constrained GPUs
  3. regularization techniques: weight decay (0.01-0.1) and gradient checkpointing reduce memory load by recalculating intermediate steps, improving efficiency without losing performance

Chapter 6 - Fine-Tuning with Preference Alignment

SFT sturggles to capture the nuance of human preferences and the long tail of potential interactions that a model might encounter. This limitation has led to the development of more advanced techniques for aligning Ai systems with human preferences - preference alignment

Preference alignment addresses the shortcomings of SFT by incorporating direct human or AI feedback into the training process. This method allows a more nuanced understanding of human preferences, especially in complex scenarios where simple supervised learning falls short. While numerous techniques exist for preference alignment, this chapter will primarily focus on Direct Preference Optimization (DPO) for simplicity and efficiency.

Understanding preference datasets

Preference datasets lack the standardization of instruction datasets due to varying data requirements across different training algorithms. Preference data comprises a collection of responses to a given instruction, ranked by humans or language models

The structure of DPO datasets is straightforward: each instruction is paired with one preferred answer and one rejected answer. The objective is to train the model to generate the preferred response rather than the rejected one.

image

In preference datasets, the rejected response is as important as the chosen one. Without the rejected response, the dataset would be a simple instruction set. Rejected responses represent the behavior we aim to eliminate from the model. This provides a lot of flexibility and allows us to use preference datasets in many contexts. Here is a list of examples where preference datasets are more beneficial to use compared to using SFT alone:

  • chatbots
  • content moderation
  • summarisations
  • code generation
  • creative writing
  • translation

DPO datasets typically require fewer samples than instruction datasets to significantly impact model behavior. As with instruction datasets, the required sample count depends on model size and task complexity. Larger models are more sample-efficient and thus require less data, while complex tasks demand more examples to capture the desired behavior. Once again, data quality is crucial, and a large number of preference pairs is generally beneficial.

DPO datasets can be created using various methods, each with its own trade-offs between quality, cost, and scalability. These methods can be tailored to specific applications and require varying degrees of human feedback. The book divides them into 4 methods:

  • human-generated, human-evaluated
  • human-generated, LLM-evaluated
  • LLM-generated, human-evaluated
  • LLM-generated, LLM-evaluated

Note that evaluation is not always required and preferences can emerge naturally from the generation process. For instance, it is possible to use a high-quality model to generate preferred outputs and a lower-quality or intentionally flawed model to produce less preferred alternatives. This creates a clear distinction in the preference dataset, allowing more effective training of AI systems to recognize and emulate high-quality outputs. The Intel/orca_dpo_pairs dataset available on the Hugging Face Hub was created with this process.

Here is the preference dataset created by the authors.

image

Preference alignment

RLHF

At its core, RLHF works by iteratively improving both a reward model and a policy:

  • reward model learning
  • policy optimisation
  • iterative improvement

A key innovation in RLHF is its approach to handling the high cost of human feedback. Rather than requiring constant human oversight, RLHF allows for asynchronous and sparse feedback. The learned reward model serves as a proxy for human preferences, enabling the RL algorithm to train continuously without direct human input for every action.

Proximal Policy Optimization (PPO) algorithm is one of the most famous RLHF algorithms. Here, the reward model is used to score the text that is generated by the trained model. This reward is regularized by an additional Kullback–Leibler (KL) divergence factor, ensuring that the distribution of tokens stays similar to the model before training (frozen model).

image

While RLHF has proven effective for aligning AI systems with human preferences, it faces challenges due to its iterative nature and reliance on a separate reward model, which can be computationally expensive and potentially unstable.

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO), introduced by Rafailov et al. in 2023, offers a simpler alternative to the traditional Reinforcement Learning from Human Feedback (RLHF) methods for training language models. Rather than relying on separate reward models and complex RL algorithms, DPO reformulates preference learning by deriving a closed-form expression for the optimal policy under RLHF’s typical reward-maximizing objective with a KL-divergence constraint. This reformulation allows DPO to handle preference learning directly, without separate reward models or reinforcement learning.

image

In practice, DPO functions as a binary cross-entropy loss on the model’s output probabilities, encouraging higher probabilities for preferred responses and lower for non-preferred ones, while remaining close to a reference model, controlled by a tunable beta parameter. The approach enables simple gradient descent optimization, streamlining the process and eliminating the need for sampling or other RL complexities. This simplicity reduces engineering complexity and computational cost, especially when using adapters like LoRA or QLoRA, which allow one model to be trained without separating frozen and trained models, saving VRAM.

DPO can match RLHF’s performance but is often more stable and less sensitive to hyperparameters, making it accessible to smaller teams lacking deep RL expertise. Although RLHF methods have a higher performance ceiling for large-scale applications, DPO offers most benefits with less computational and engineering effort. Both methods benefit from synthetic data, which enables iterative improvements in model training.

However, DPO does have limitations. It still requires paired preference data and lacks some of the theoretical guarantees of RL approaches. While RLHF may be more advantageous for complex or dynamic tasks, DPO serves as an effective, lower-cost alternative for preference alignment in many applications.

scikit-learn official certification 🥳

I saw this morning that probabl (one of the companies run by scikit-learn devs) announced an official certification program - here

The link takes you to the 1st level of the certification (there are 3 overall). And I think I will review the materials suggested to pass this 1st level on stream. Here they are:

image

  • Data Preprocessing
    • Loading parquet datasets
    • Visualizing data with basic plotting techniques: scatterplot, boxplot
    • Identifying wrongly encoded predictive columns (e.g., float encoded as string)
    • Handling missing values using imputation (SimpleImputer)
    • Correct choice of feature scaling: StandardScaler, MinMaxScaler
    • Encoding categorical data: OrdinalEncoder, OneHotEncoder
    • Combining preprocessing steps with ColumnTransformer
  • Model Building and Evaluation
    • Splitting datasets into training and testing sets using train_test_split
    • Training ML models using the fit() method
    • Making predictions using the predict() method
    • Evaluating model performance with common metrics: accuracy, precision, recall, F1 score, confusion matrix, mean squared error, R-squared
    • Interpreting scores with respect to dummy models
  • Model Selection and Validation
    • Understanding and implementing cross-validation techniques: KFold, ShuffleSplit
    • Learning and validation curves
    • Performing hyperparameter tuning using GridSearchCV, RandomSearchCV
    • Stability of learned coefficients across splits
  • Interpretation of Results & Communication
    • Visualizing model results using basic plotting techniques (matplotlib, seaborn)
    • Interpreting and communicating model outputs and performance metrics to non-technical stakeholders

That is all for today!

See you tomorrow :)