Hello :) Today is Day 306!
A quick summary of today:
- read a bit more of the LLM engineer’s handbook
- started covering sklearn’s MOOC
LLM Engineer’s handbook Chapter 7 - Evaluating LLMs
Model evaluation
In model evaluation, the objective is to assess the capabilities of a single model without any prompt engineering, RAG pipeline, and so on.
General-purpose LLM metrics are:
- train loss
- validation loss
- perplexity
- gradient norm
After pre-training, it is common to use a suite of evaluations to evaluate the base model. This suite can include internal and public benchmarks. Here’s a non-exhaustive list of common public pre-training evaluations:
- MMLU (knowledge)
- HellaSwag (reasoning)
- ARC-C (reasoning)
- Winogrande (reasoning)
- PIQA (reasoning)
Domain-specific LLM evaluations are:
- Open Medical-LLM Leaderboard
- BigCodeBench Leaderboard
- Hallucinations Leaderboard
- Enterprise Scenarios Leaderboard
Task-specific LLM evaluations are:
- ROUGE
- for classification: accuracy, precision, recall, f1
- LLM-as-a-judge
RAG evaluation
The evaluation of RAG systems goes beyond assessing a standalone LLM. It requires examining the entire system’s performance, including:
- retrieval accuracy
- integration quality
- factuality and relevance
Below, two methods are covered on how to evaluate how well RAG models incorporate external information into their responses
RAGAS
Retrieval-Augmented Generation Assessment (Ragas) is an open-source toolkit designed to provide developers with a comprehensive set of tools for RAG evaluation and optimization. It’s designed around the idea of metrics-driven development (MDD), a product development approach that relies on data to make well-informed decisions, involving the ongoing monitoring of essential metrics over time to gain valuable insights into an application’s performance
Ragas provides a suite of LLM-assisted evaluation metrics designed to objectively measure different aspects of RAG system performance. These metrics include:
- faithfulness: this metric measures the factual consistency of the generated answer against the given context. It works by breaking down the answer into individual claims and verifying if each claim can be inferred from the provided context. The faithfulness score is calculated as the ratio of verifiable claims to the total number of claims in the answer
- answer relevancy: this metric evaluates how pertinent the generated answer is to the given prompt. It uses an innovative approach where an LLM is prompted to generate multiple questions based on the answer and then calculates the mean cosine similarity between these generated questions and the original question. This method helps identify answers that may be factually correct but off-topic or incomplete
- context precision: this metric evaluates whether all the ground-truth relevant items present in the contexts are ranked appropriately. It considers the position of relevant information within the retrieved context, rewarding systems that place the most pertinent information at the top
- context recall: this metric measures the extent to which the retrieved context aligns with the annotated answer (ground truth). It analyzes each claim in the ground truth answer to determine whether it can be attributed to the retrieved context, providing insights into the completeness of the retrieved information
ARES
ARES (an automated evaluation framework for RAG systems) is a comprehensive tool designed to evaluate RAG systems. It offers an automated process that combines synthetic data generation with fine-tuned classifiers to assess various aspects of RAG performance, including context relevance, answer faithfulness, and answer relevance.
The ARES framework operates in three main stages: synthetic data generation, classifier training, and RAG evaluation. Each stage is configurable, allowing users to tailor the evaluation process to their specific needs and datasets.
Chapter 8 - Inference Optimization
Model optimization strategies
Below, several optimization strategies that are commonly used to speed up inference and reduce Video Random-Access Memory (VRAM) usage, such as implementing a (static) KV cache, continuous batching, speculative decoding, and optimized attention mechanisms, are discussed.
KV cache
LLMs generate text token by token, which is slow because each new prediction depends on the entire previous context. For example, to predict the 100th token in a sequence, the model needs the context of tokens 1 through 99. When predicting the 101st token, it again needs the information from tokens 1 through 99, plus token 100. This repeated computation is particularly inefficient.
The key-value (KV) cache addresses this issue by storing key-value pairs produced by self-attention layers. Instead of recalculating these pairs for each new token, the model retrieves them from the cache, significantly speeding up the generation.
The size of the cache is:
It grows with each generation step and is dynamic, it prevents us from taking advantage of torch.compile, a powerful optimization tool that fuses PyTorch code into fast and optimized kernels. The static KV cache solves this issue by pre-allocating the KV cache size to a maximum value, which allows us to combine it with torch.compile for up to a 4x speedup in the forward pass.
Efficiently managing the KV cache is essential, as it can quickly exhaust available GPU memory and limit the batch sizes that can be processed.
Continuous batching
Continuous batching improves the efficiency of inference by processing new requests as soon as previous ones finish, rather than waiting for all requests in a batch to complete. This method avoids idle time, keeping the hardware fully utilized. It works by evicting completed requests from the batch and immediately replacing them with new ones. This approach is especially useful for decoder-only models with varying input and output lengths, which can otherwise lead to underutilization. Most inference frameworks, such as Hugging Face’s TGI, vLLM, and NVIDIA TensorRT-LLM, support continuous batching natively.
Speculative decoding
Another powerful optimization technique is speculative decoding, also called assisted generation. The key insight is that even with continuous batching, the token-by-token generation process fails to fully saturate the parallel processing capabilities of the accelerator. Speculative decoding aims to use this spare compute capacity to predict multiple tokens simultaneously, using a smaller proxy model
The general approach is:
- apply a smaller model, like a distilled or pruned version of the main model, to predict multiple token completions in parallel. This could be 5–10 tokens predicted in a single step
- feed these speculative completions into the full model to validate which predictions match what the large model would have generated
- retain the longest matching prefix from the speculative completions and discard any incorrect tokens
The result is that, if the small model approximates the large model well, multiple tokens can be generated in a single step. This avoids running the expensive large model for several iterations. The degree of speedup depends on the quality of the small model’s predictions – a 90% match could result in a 3–4X speedup.
Optimized attention mechanisms
PagedAttention and FlashAttention-2 are optimized attention mechanisms that address memory inefficiencies in traditional Transformers, especially for longer sequences.
PagedAttention (Kwon, Li, et al., 2023) tackles memory challenges by using a block-based approach, inspired by virtual memory. It partitions the key-value (KV) cache, enabling efficient memory utilization and reducing the need for contiguous memory allocation. This approach supports batching more sequences and allows memory sharing across sequences, enhancing parallel processing. Results show memory overhead reduction up to 55% and throughput improvement by 2.2x, with implementations in libraries like vLLM, TGI, and TensorRT-LLM.
FlashAttention-2 (Tri Dao, 2023) optimizes runtime and memory by dividing input/output matrices into blocks stored in GPU SRAM, which minimizes data transfers. It also employs ‘online softmax’, allowing block-wise computation of attention scores without large intermediate matrices, lowering memory requirements from quadratic to linear for backpropagation.
Model parallelism
Model parallelism allows us to distribute the memory and compute requirements of LLMs across multiple GPUs. This enables the training and inference of models too large to fit on a single device, while also improving performance in terms of throughput (tokens per second).
There are three main approaches to model parallelism, each involving splitting the model weights and computation in different ways: data parallelism, pipeline parallelism, and tensor parallelism.
Although these approaches were originally developed for training, we can reuse them for inference by focusing on the forward pass only.
Data parallelism
This is the simplest type of model parallelism. It involves making copies of the model and distributing these replicas across different GPUs
Pipeline parallelism
This is a strategy for distributing the computational load of training and running large neural networks across multiple GPUs.
Tensor parallelism
This is another popular technique to distribute the computation of LLM layers across multiple devices. In contrast to pipeline parallelism, TP splits the weight matrices found in individual layers. This enables simultaneous computations, significantly reducing memory bottlenecks and increasing processing speed.
Model quantization
(side note: this book definitely feels like an LLM engineer’s handbook - especially after covering yesterday and today’s chapters. It covers crucial information for anyone working with an LLM)
Model quantization reduces the precision of neural network weights and activations to lower bit formats, minimizing memory usage and speeding up large language models (LLMs) without significantly compromising performance. For example, large models (over 30B parameters) can maintain quality with 2- or 3-bit precision.
There are two primary methods: Post-Training Quantization (PTQ), a simpler approach that doesn’t require retraining, and Quantization-Aware Training (QAT), which incorporates quantization during training to maintain better accuracy.
Quantization Techniques
- GGUF and llama.cpp: these enable CPU inference with optional GPU offloading, running on diverse hardware. llama.cpp, a popular open-source library, also supports optimizations like FlashAttention-2
- GPTQ and EXL2: both tailored for GPU, they enhance speed over llama.cpp. GPTQ uses a refined quantization algorithm, while EXL2 adds flexibility by mixing quantization levels (2–8 bits) to minimize errors and optimize memory, allowing large models (up to 70B parameters) to run on single GPUs
Chapter 9 - The RAG inference pipeline
Understanding the LLM Twin’s RAG inference pipeline and Advanced RAG feature techniques
- User query: We begin with the user who makes a query, such as “Write an article about…”
- Query expansion: We expand the initial query to generate multiple queries that reflect different aspects or interpretations of the original user query. Thus, instead of one query, we will use xN queries. By diversifying the search terms, the retrieval module increases the likelihood of capturing a comprehensive set of relevant data points. This step is crucial when the original query is too narrow or vague.
- Self-querying: We extract useful metadata from the original query, such as the author’s name. The extracted metadata will be used as filters for the vector search operation, eliminating redundant data points from the query vector space (making the search more accurate and faster).
- Filtered vector search: We embed each query and perform a similarity search to find each search’s top K data points. We execute xN searches corresponding to the number of expanded queries. We call this step a filtered vector search as we leverage the metadata extracted from the self-query step as query filters.
- Collecting results: We get up to xK results closest to its specific expanded query interpretation for each search operation. Further, we aggregate the results of all the xN searches, ending up with a list of N x K results containing a mix of articles, posts, and repositories chunks. The results include a broader set of potentially relevant chunks, offering multiple relevant angles based on the original query’s different facets.
- Reranking: To keep only the top K most relevant results from the list of N x K potential items, we must filter the list further. We will use a reranking algorithm that scores each chunk based on the relevance and importance relative to the initial user query. We will leverage a neural cross-encoder model to compute the score, a value between 0 and 1, where 1 means the result is entirely relevant to the query. Ultimately, we sort the N x K results based on the score and pick the top K items. Thus, the output is a ranked list of K chunks, with the most relevant data points situated at the top.
- Build the prompt and call the LLM: We map the final list of the most relevant K chunks to a string used to build the final prompt. We create the prompt using a prompt template, the retrieved context, and the user’s query. Ultimately, the augmented prompt is sent to the LLM (hosted on AWS SageMaker exposed as an API endpoint).
- Answer: We are waiting for the answer to be generated. After the LLM processes the prompt, the RAG logic finishes by sending the generated response to the user.
Most of this chapter was just code from the repo, and it covers the above 8 steps. Here is the code.
Streamed
I covered the 1st module - The predictive modeling pipeline. I don’t know how I just find about this 😆. It is such a great tool for someone just entering the field. This was just the 1st module and it was so comprehensive.
These are the main takeaways from it and what someone can learn:
- to create a scikit-learn predictive model;
- about the scikit-learn API to train and test a predictive model;
- to process numerical data, notably using a Pipeline
- to process categorical data, notably using a OneHotEncoder and an OrdinalEncoder;
- to handle and process mixed data types (i.e. numerical and categorical data), notably using a ColumnTransformer
I will continue with this before taking the exam. Might be an overkill ~ but it’s cool
That is all for today!
See you tomorrow :)