Hello :) Today is Day 303!
A quick summary of today:
- using a powerful GPU to fine-tune models
- reading more of the LLM engineer’s handbook
Using my professor’s A6000 GPU
Today, I went to the professor who is leading the whole project that involves other parts, but mine is developing the company reviewer LLM which I have been talking about. Up until now, I have been using the resources available online, but today because the client came for a meeting - I went to meet them and also use my professor’s 2nd pc which has a 49GB A6000 GPU 🤯
I had to setup conda, and unsloth. For unsloth I ended up unstalling WSL as unsloth is not really supported for Windows (there is a way to do it but it is not that optimal). No proble, I installed WSL with Ubuntu and set it all up.
Because I am using WSL I faced some weird issues. Turns out other people using WSL face it as well. There is a workaround but it makes it so that I am not utilising unsloth’s full potential. The issue is quite recent, but there have been no updates on fixing it. Nonetheless, I first ran a llama 3.1 8B with 4K context and the results were subpar just by looking at them. This took 1 hour, and if I were using T4 - it would have taken 7-8hrs 🤯
After the above did not succeed, I started a 8K context llama 3.1 8b. It trained for ~1.5hrs and the results look promising. I saved the results in a df and just by doing a manual review (human evaluation) they seem good and follow the output format that we need.
Here are the results (alongside the gpt ones):
Model | Text RougeLSum | Ratings RMSE |
---|---|---|
gpt4o-mini | 0.8768 | 0.5924 |
gpt4o-mini-fewshot | 0.8499 | 0.6797 |
llama-3.1-8b-8K | 0.7827 | 0.8580 |
I call it 8K because that’s the max sequence length I used when training the model. Here is a link to it on HuggingFace.
During the models’ training time, I read the below chapter.
LLM Engineer’s handbook - Chapter 4 - RAG Feature Pipeline
The book explains a bit about RAG and its benefits, and shows the below ‘vanilla’ framework:
- ingestion pipeline: a batch or streaming pipeline used to populate the vector DB
- retrieval pipeline: a module that queries the vector DB and retrieves relevant entries to the user’s input
- generation pipeline: the layer that uses the retrieved data to augment the prompt and an LLM to generate answers
How are these 3 connected?
- On the backend side, the ingestion pipeline runs either on a schedule or constantly to populate the vector DB with external data
- On the client side, the user asks a question
- The question is passed to the retrieval module, which preprocesses the user’s input and queries the vector DB
- The generation pipelines use a prompt template, user input, and retrieved context to create the prompt
- The prompt is passed to an LLM to generate the answer
- The answer is shown to the user
An overview of advanced RAG
The ‘vanilla’ design can be optimised:
- pre-retrieval: this focuses on how to structure and preprocess your data for data indexing optimizations as well as query optimizations
- retrieval: this revolves around improving the embedding models and metadata filtering to improve the vector search step
- post-retrieval: this mainly targets different ways to filter out noise from the retrieved documents and compress the prompt before feeding it to an LLM for answer generation
The 3 stages of advanced RAG apps:
The pre-retrieval steps are performed in two different ways:
- data indexing: it is part of the RAG ingestion pipeline. It is mainly implemented within the cleaning or chunking modules to preprocess the data for better indexing
- query optimization: the algorithm is performed directly on the user’s query before embedding it and retrieving the chunks from the vector DB
Data indexing techniques focus on better preprocessing and structuring the data to improve retrieval efficiency, these include:
- sliding window
- enhancing data granularity: involves data cleaning, such as removing irrelevant details, verifying accuracy, and updating outdated information. Clean, accurate datasets lead to sharper retrieval
- metadata: adding tags like dates, URLs, and external IDs helps filter results efficiently during retrieval
- optimising index structures: uses varied chunk sizes and multi-indexing strategies to refine retrieval precision
- small-to-big: decouples the chunk size for embedding from the larger context in the final prompt. Small chunks improve retrieval accuracy, while the larger context adds depth to the generated response
The intuition behind this is that if we use the whole text for computing the embedding, we might introduce too much noise, or the text could contain multiple topics, which results in a poor overall semantic representation of the embedding.
On the query optimization side, we can leverage techniques such as query routing, query rewriting, and query expansion to refine the retrieved information for the LLM further:
- query routing: based on the user’s input, we might have to interact with different categories of data and query each category differently. Query rooting is used to decide what action to take based on the user’s input, similar to if/else statements. Still, the decisions are made solely using natural language instead of logical statements
-
query rewriting: sometimes, the user’s initial query might not perfectly align with the way your data is structured. Query rewriting tackles this by reformulating the question to match the indexed information better
- Hypothetical document embeddings (HyDE): this technique involves having an LLM create a hypothetical response to the query. Then, both the original query and the LLM’s response are fed into the retrieval stage
- query expansion: this approach aims to enrich the user’s question by adding additional terms or concepts, resulting in different perspectives of the same initial question. For example, when searching for ‘disease’, we can leverage synonyms and related terms associated with the original query words and also include ‘illnesses” or “ailments’
- self-query: the core idea is to map unstructured queries into structured ones. An LLM identifies key entities, events, and relationships within the input text. These identities are used as filtering parameters to reduce the vector search space (i.e. identify cities within the query, for example, ‘Paris’ and add it to our filter to reduce your vector search space)
Retrieval
The retrieval step can be optimized in two fundamental ways:
- improving the embedding models used in the RAG ingestion pipeline to encode the chunked documents and, at inference time, transform the user’s input
- leveraging the DB’s filter and search features: this step will be used solely at inference time when you have to retrieve the most similar chunks based on user input
- hybrid search: combines vector and keyword-based searches. It uses keyword search for exact matches and vector search for broader semantic similarities. The balance between the two is controlled by a parameter (often called alpha), and results from each method are normalized and combined
- filtered vector search: uses metadata filtering with vector search by applying specific keyword filters to metadata either before or after the vector search, reducing the search space without performing a separate keyword-based search
Post-retrieval
The post-retrieval optimizations are solely performed on the retrieved data to ensure that the LLM’s performance is not compromised by issues such as limited context windows or noisy data. This is because the retrieved context can sometimes be too large or contain irrelevant information, both of which can distract the LLM.
Two popular methods performed at the post-retrieval step are:
- prompt compression: eliminate unnecessary details while keeping the essence of the data
- re-ranking: use a cross-encoder ML model to give a matching score between the user’s input and every retrieved chunk. The retrieved items are sorted based on this score. Only the top N results are kept as the most relevant. This works because the re-ranking model can find more complex relationships between the user input and some content than a simple similarity search. However, we can’t apply this model at the initial retrieval step because it is costly. That is why a popular strategy is to retrieve the data using a similarity distance between the embeddings and refine the retrieved information using a re-raking model
Bi-encoder (the standard embedding model) versus cross-encoder
The re-ranking algorithm
Exploring the LLM Twin’s RAG feature pipeline architecture
Any RAG system is split into two independent components:
- the ingestion pipeline takes in raw data, cleans, chunks, embeds, and loads it into a vector DB
- the inference pipeline queries the vector DB for relevant context and ultimately generates an answer by levering an LLM
Going back to the main goal - the LLM twin, the next step is to create a feature store
The feature store will be the central access point for all the features used within the training and inference pipelines. The training pipeline will use the cleaned data from the feature store (stored as artifacts) to fine-tune LLMs. The inference pipeline will query the vector DB for chunked documents for RAG. That is why we are designing a feature pipeline and not only a RAG ingestion pipeline. In practice, the feature pipeline contains multiple subcomponents, one of which is the RAG logic.
Here is an overview of the architecture of the RAG feature pipeline:
Batch vs Streaming pipelines are discussed and a nice graph is shown for tools depending on the use case:
In this case, the pipeline will be batch, as new data does not require immediate data processin, and for simplicity.
Change data capture: syncing the data warehouse and feature store
Data is constantly changing, which can result in dbs, data lakes, data warehouses, and feature stores getting out of sync. Change data capture (CDC) is a strategy that allows you to optimally keep two or more data storage types in sync without computing and I/O overhead. It captures any CRUD operation done on the source DB and replicates it on a target db.
The syncing issues also apply when building a feature pipeline. One key design choice concerns how to sync the data warehouse with the feature store to have data fresh enough for our particular use case.
In this LLM Twin use case, a naive batch processing pipeline reads and processes all raw data from the data warehouse in batches, updating or inserting records in the Qdrant vector db. This approach works well for smaller datasets but raises scalability concerns as data volumes grow. Specific issues include handling millions of records, reflecting deletions from the data warehouse, and processing only new or updated items.
A Change Data Capture (CDC) pattern addresses these issues through two main approaches:
- push: the source DB actively identifies and sends changes to target systems for near-instant updates, with messaging systems buffering data if targets are unavailable
- pull: target systems periodically request changes from a passive source DB, better for large-scale transfers where immediate updates aren’t required but also requiring messaging systems to avoid data loss
Push is best for real-time needs, while pull fits non-real-time, large-scale data updates. To detect changes there are timestamp, trigger and log-based approaches.
Implementing the LLM Twin’s RAG feature pipeline
This step I wanted to cover on stream but I was in the lab until ~9pm so I could not stream today.
That is all for today!
See you tomorrow :)