(Day 303) Using my professor's A6000 GPU for LLM fine-tuning

Ivan Ivanov · October 30, 2024

Hello :) Today is Day 303!

A quick summary of today:

using a powerful GPU to fine-tune models
reading more of the LLM engineer’s handbook

Using my professor’s A6000 GPU

Today, I went to the professor who is leading the whole project that involves other parts, but mine is developing the company reviewer LLM which I have been talking about. Up until now, I have been using the resources available online, but today because the client came for a meeting - I went to meet them and also use my professor’s 2nd pc which has a 49GB A6000 GPU 🤯

I had to setup conda, and unsloth. For unsloth I ended up unstalling WSL as unsloth is not really supported for Windows (there is a way to do it but it is not that optimal). No proble, I installed WSL with Ubuntu and set it all up.

Because I am using WSL I faced some weird issues. Turns out other people using WSL face it as well. There is a workaround but it makes it so that I am not utilising unsloth’s full potential. The issue is quite recent, but there have been no updates on fixing it. Nonetheless, I first ran a llama 3.1 8B with 4K context and the results were subpar just by looking at them. This took 1 hour, and if I were using T4 - it would have taken 7-8hrs 🤯

After the above did not succeed, I started a 8K context llama 3.1 8b. It trained for ~1.5hrs and the results look promising. I saved the results in a df and just by doing a manual review (human evaluation) they seem good and follow the output format that we need.

Here are the results (alongside the gpt ones):

Model	Text RougeLSum	Ratings RMSE
gpt4o-mini	0.8768	0.5924
gpt4o-mini-fewshot	0.8499	0.6797
llama-3.1-8b-8K	0.7827	0.8580

I call it 8K because that’s the max sequence length I used when training the model. Here is a link to it on HuggingFace.

During the models’ training time, I read the below chapter.

LLM Engineer’s handbook - Chapter 4 - RAG Feature Pipeline

The book explains a bit about RAG and its benefits, and shows the below ‘vanilla’ framework:

ingestion pipeline: a batch or streaming pipeline used to populate the vector DB
retrieval pipeline: a module that queries the vector DB and retrieves relevant entries to the user’s input
generation pipeline: the layer that uses the retrieved data to augment the prompt and an LLM to generate answers

How are these 3 connected?

On the backend side, the ingestion pipeline runs either on a schedule or constantly to populate the vector DB with external data
On the client side, the user asks a question
The question is passed to the retrieval module, which preprocesses the user’s input and queries the vector DB
The generation pipelines use a prompt template, user input, and retrieved context to create the prompt
The prompt is passed to an LLM to generate the answer
The answer is shown to the user

An overview of advanced RAG

The ‘vanilla’ design can be optimised:

pre-retrieval: this focuses on how to structure and preprocess your data for data indexing optimizations as well as query optimizations
retrieval: this revolves around improving the embedding models and metadata filtering to improve the vector search step
post-retrieval: this mainly targets different ways to filter out noise from the retrieved documents and compress the prompt before feeding it to an LLM for answer generation

The 3 stages of advanced RAG apps:

The pre-retrieval steps are performed in two different ways:

data indexing: it is part of the RAG ingestion pipeline. It is mainly implemented within the cleaning or chunking modules to preprocess the data for better indexing
query optimization: the algorithm is performed directly on the user’s query before embedding it and retrieving the chunks from the vector DB

Data indexing techniques focus on better preprocessing and structuring the data to improve retrieval efficiency, these include:

sliding window
enhancing data granularity: involves data cleaning, such as removing irrelevant details, verifying accuracy, and updating outdated information. Clean, accurate datasets lead to sharper retrieval
metadata: adding tags like dates, URLs, and external IDs helps filter results efficiently during retrieval
optimising index structures: uses varied chunk sizes and multi-indexing strategies to refine retrieval precision
small-to-big: decouples the chunk size for embedding from the larger context in the final prompt. Small chunks improve retrieval accuracy, while the larger context adds depth to the generated response

The intuition behind this is that if we use the whole text for computing the embedding, we might introduce too much noise, or the text could contain multiple topics, which results in a poor overall semantic representation of the embedding.

On the query optimization side, we can leverage techniques such as query routing, query rewriting, and query expansion to refine the retrieved information for the LLM further:

query routing: based on the user’s input, we might have to interact with different categories of data and query each category differently. Query rooting is used to decide what action to take based on the user’s input, similar to if/else statements. Still, the decisions are made solely using natural language instead of logical statements

query rewriting: sometimes, the user’s initial query might not perfectly align with the way your data is structured. Query rewriting tackles this by reformulating the question to match the indexed information better
Hypothetical document embeddings (HyDE): this technique involves having an LLM create a hypothetical response to the query. Then, both the original query and the LLM’s response are fed into the retrieval stage
query expansion: this approach aims to enrich the user’s question by adding additional terms or concepts, resulting in different perspectives of the same initial question. For example, when searching for ‘disease’, we can leverage synonyms and related terms associated with the original query words and also include ‘illnesses” or “ailments’
self-query: the core idea is to map unstructured queries into structured ones. An LLM identifies key entities, events, and relationships within the input text. These identities are used as filtering parameters to reduce the vector search space (i.e. identify cities within the query, for example, ‘Paris’ and add it to our filter to reduce your vector search space)

Retrieval

The retrieval step can be optimized in two fundamental ways:

improving the embedding models used in the RAG ingestion pipeline to encode the chunked documents and, at inference time, transform the user’s input
leveraging the DB’s filter and search features: this step will be used solely at inference time when you have to retrieve the most similar chunks based on user input
- hybrid search: combines vector and keyword-based searches. It uses keyword search for exact matches and vector search for broader semantic similarities. The balance between the two is controlled by a parameter (often called alpha), and results from each method are normalized and combined
- filtered vector search: uses metadata filtering with vector search by applying specific keyword filters to metadata either before or after the vector search, reducing the search space without performing a separate keyword-based search

Post-retrieval

The post-retrieval optimizations are solely performed on the retrieved data to ensure that the LLM’s performance is not compromised by issues such as limited context windows or noisy data. This is because the retrieved context can sometimes be too large or contain irrelevant information, both of which can distract the LLM.

Two popular methods performed at the post-retrieval step are:

prompt compression: eliminate unnecessary details while keeping the essence of the data
re-ranking: use a cross-encoder ML model to give a matching score between the user’s input and every retrieved chunk. The retrieved items are sorted based on this score. Only the top N results are kept as the most relevant. This works because the re-ranking model can find more complex relationships between the user input and some content than a simple similarity search. However, we can’t apply this model at the initial retrieval step because it is costly. That is why a popular strategy is to retrieve the data using a similarity distance between the embeddings and refine the retrieved information using a re-raking model

Bi-encoder (the standard embedding model) versus cross-encoder

The re-ranking algorithm

Exploring the LLM Twin’s RAG feature pipeline architecture

Any RAG system is split into two independent components:

the ingestion pipeline takes in raw data, cleans, chunks, embeds, and loads it into a vector DB
the inference pipeline queries the vector DB for relevant context and ultimately generates an answer by levering an LLM

Going back to the main goal - the LLM twin, the next step is to create a feature store

The feature store will be the central access point for all the features used within the training and inference pipelines. The training pipeline will use the cleaned data from the feature store (stored as artifacts) to fine-tune LLMs. The inference pipeline will query the vector DB for chunked documents for RAG. That is why we are designing a feature pipeline and not only a RAG ingestion pipeline. In practice, the feature pipeline contains multiple subcomponents, one of which is the RAG logic.

Here is an overview of the architecture of the RAG feature pipeline:

Batch vs Streaming pipelines are discussed and a nice graph is shown for tools depending on the use case:

In this case, the pipeline will be batch, as new data does not require immediate data processin, and for simplicity.

Change data capture: syncing the data warehouse and feature store

Data is constantly changing, which can result in dbs, data lakes, data warehouses, and feature stores getting out of sync. Change data capture (CDC) is a strategy that allows you to optimally keep two or more data storage types in sync without computing and I/O overhead. It captures any CRUD operation done on the source DB and replicates it on a target db.

The syncing issues also apply when building a feature pipeline. One key design choice concerns how to sync the data warehouse with the feature store to have data fresh enough for our particular use case.

In this LLM Twin use case, a naive batch processing pipeline reads and processes all raw data from the data warehouse in batches, updating or inserting records in the Qdrant vector db. This approach works well for smaller datasets but raises scalability concerns as data volumes grow. Specific issues include handling millions of records, reflecting deletions from the data warehouse, and processing only new or updated items.

A Change Data Capture (CDC) pattern addresses these issues through two main approaches:

push: the source DB actively identifies and sends changes to target systems for near-instant updates, with messaging systems buffering data if targets are unavailable
pull: target systems periodically request changes from a passive source DB, better for large-scale transfers where immediate updates aren’t required but also requiring messaging systems to avoid data loss

Push is best for real-time needs, while pull fits non-real-time, large-scale data updates. To detect changes there are timestamp, trigger and log-based approaches.

Implementing the LLM Twin’s RAG feature pipeline

This step I wanted to cover on stream but I was in the lab until ~9pm so I could not stream today.

That is all for today!

See you tomorrow :)