(Day 241) Techniques for improving RAG pipes

Ivan Ivanov · August 29, 2024

mlops applying-knowledge nlp

Hello :) Today is Day 241!

A quick summary of today:

ways to improve RAG pipelines
sample project for the LLM zoomcamp

Techniques for improving RAG pipes

small-to-big chunk retrieval - use small chunks in the embedding stage and large chunks in the answering stage (helps tackling the problem of choosing the right embedding size)
leverage document metadata - adding doc name and path can be useful and also can allow the user to see the original document
hybrid search (vector-based and keyword-based search) - vector search looks for the semantically closest vectors in the embedding space, while keyword-search is looking for lexical similarity

A simple formular for this could be: hybrid_score = (1 - alpha) * match_score + alpha * vec_score

user query re-writing - rephrase user Qs in a more structured way
retrieved document re-ranking - can be done using an llm, or by using specific relevance score metrics like Normalized Discounted Cumulative Gain (NDCG), MAP@5, reciprocal rank fusion (RRF)

LLM zoomcamp project example

Alexey, the main instructor and creator of DataTalksClub, decided to upload a series of videos showcasing how to create a sample project for the LLM zoomcamp. There are in total 7 videos and it is great as it gives others, and me as well, a clearer idea of what he is looking for in the final project for the course.

Creating a dataset

First is creating a sample dataset using chatgpt (creating my own dataset is not required, it is just used as an example, and also to show chatgpt’s abilities)

Setting up a simple RAG flow
Evaluating the retrieval

For each exercise, chatGPT can be used to create sample questions. For instance, for push ups chatgpt creates 5 questions related to: form, back position, repetition, things like that. So these questions are then used to evaluate the retrieval - for a given question do we retrieve the relevant exercise and its info from our database.

Two evaluation approaches were presented: a basic approach using minsearch (a custom implementation for measuring similarity between vectors) and a more advanced approach involving hyperparameter tuning. In the advanced approach, focus was on identifying the most relevant parts of a document. As expected, exercise_name, body_part, and muscle_groups_activated were found to be the most relevant features, while type_of_activity, type_of_equipment, type, and instructions were less significant.

The basic approach - using minsearch without any boosting - gave the following metrics:

Hit rate: 94%
MRR: 82%

The improved version (with tuned boosting):

Hit rate: 94%
MRR: 90%

Evaluating RAG

The above generated 5 questions for each exercise are now asked through our RAG system. Here we are using LLM-as-a-judge to judge whether the answer generated by our RAG is relevant to the question (another option is cosine similarity if we had an original/actual answer). Here is a sample prompt for the LLM to judge:

You are an expert evaluator for a RAG system.
Your task is to analyze the relevance of the generated answer to the given question.
Based on the relevance of the generated answer, you will classify it as "NON_RELEVANT," "PARTLY_RELEVANT," or "RELEVANT."

Here is the data for evaluation:

Question: What is the starting position for doing push-ups?
Generated Answer: The starting position for doing push-ups is to begin in a high plank position with your hands under your shoulders.

Please analyze the content and context of the generated answer in relation to the question and provide your evaluation in parsable JSON without using code blocks:

{
   "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
   "Explanation": "[Provide a brief explanation for your evaluation]"
}

The above particular case resulted in :

{
   "Relevance": "RELEVANT",
   "Explanation": "The generated answer accurately describes the starting position for doing push-ups, which directly addresses the question asked."
}

On a small dataset for validation with around 200 exercises, and for each a question and answer was generated. Among those, after using llm-as-a-judge, the results of the RAG evaluation are:

167 (83%) RELEVANT
30 (15%) PARTLY_RELEVANT
3 (1.5%) NON_RELEVANT

And such stats can be shown in a Grafana monitoring dashboard:

This got me thinking about whether I can make a RAG project related to this very blog. The problem is that I have a lot of pictures, and for many days I hand-wrote my notes when covering a course or a book, so I would have to pass those images through an LLM to get the text. But I cannot pass all images because I would get image info that is irrelevant, or maybe not describable, and it may cost money if I want the highest quality image descriptions - i.e. using gpt’s api. Then again, I can just use the text part from my blogs and it may work fine. So it is just about finding what I want to do for this project.

Interview for a university course

Today I was in a meeting with other scholarship students in my university, and our teacher (who is responsible for the scholarship students) mentioned that in the ‘Hannam Design Factory’ (the page is all in english) department is looking for students for their projects. The way that particular department operates is it runs projects which server as 6 or 9 credit courses throughout the year, and the idea is for the students to work on some realistic project assigned from a company. Apparently they look for people with an engineering background, but I said that while I am not an engineer, if possible I can help with AI. Later, the teacher messaged me letting me know that I should go for an interview at the department tomorrow morning. This is the 1st time interviewing to take a university course haha. But jokes aside, I believe there is an interview as I would be part of some project by some company and they just want to make sure I understand the assignment and have the skills to do it. I will update on this tomorrow ^^

Big Data on Kubernetes Chapter 2: K8s architecture

Control Plane Components:
- API Server: The entry point for the control plane, managing communication and maintaining the desired state of the cluster
- etcd: The distributed key-value store that holds cluster data, ensuring consistency across the system
- Controller Manager: Responsible for ensuring that the cluster state matches the desired state by controlling various controllers
- Schedulers: These assign workloads to nodes based on resource availability
Node Components:
- kubelet: the agent running on each node that communicates with the control plane and ensures containers are running as expected
- kube-proxy: manages networking for Pods, handling communication between the nodes and the control plane
API Resources:
- Pods: the smallest deployable units in Kubernetes, encapsulating containers with shared storage and networking
- Deployments: enable declarative management of Pod replicas, ensuring scalability and self-healing of applications
- StatefulSets: manage stateful applications by ensuring persistent storage and stable networking
- Jobs: Allow batch processing workloads to run to completion
- Services: Provide stable networking and enable loose coupling between Pods, abstracting away direct Pod access
Ingress Resources and Controllers: Ingress resources define external access rules for cluster services, while ingress controllers handle the actual traffic routing. The Gateway API provides a new, centralized method for managing ingress configurations
Storage in Kubernetes:
- PersistentVolumes (PVs): provide portable, network-attached storage that can be dynamically provisioned
- PersistentVolumeClaims (PVCs): request and bind to PVs for efficient storage management
- StorageClasses: define different classes of storage to cater to varying performance and availability requirements
ConfigMaps and Secrets:
- ConfigMaps: inject configuration data into Pods in a decoupled manner, allowing for flexible application configuration
- Secrets: Securely manage sensitive data, such as passwords or API keys, and inject them into Pods while keeping them separate from the application code

That is all for today!

See you tomorrow :)