Hello :) Today is Day 86!
A quick summary of today:
- Coded, planned, recorded and posted a video tutorial making a chat with your pdf rag system for free
- All code + colab link + pdf used is on this github repo
Well, after waking up today, I definitely did not expect to plan, execute and upload an almost 1hr tutorial on youtube.
I was looking around chat with your PDF videos, to see what I can improve in my pdf_rag_from_scratch but I saw that most of the videos require an OpenAI api key, and I did not like that, given the availability of so many free resources and models.
And I found this great resource from huggingface - Building A RAG System with Gemma, MongoDB and Open Source Model. Instead of a pdf, they were using some dataframe for films, so I decided to improve upon that, and make the code preprocess a pdf, embed it, upload to mongodb, load gemma, create a prompt and chat with the pdf (kind of a combination of the tutorial + my pdf_rag_from_scratch).
The code itself is not that complicated, but I wanted to write it once/twice to make sure when I write live in the video recording, I do not have problems. So the whole process from idea to published video maybe took me 8 hours, mind that I had to find an app to edit the video (the editing was not much, but the app’s video processing time was long because I wanted it in 1080p).
Anyway ~ below I will provide an overall summary of the code
- Download libraries
- Preprocess PDF
2.1 Load PDF with llama-index
2.2 Chunk PDF text using langchain
2.3 Embed chunks
- Set up mongodb
An important part is to set up an atlas vector search
3.1 Connect to the db
3.2 Delete existing (if any), and insert data
- Find relevant texts
4.1 Perform vector search in db + get context
- Load gemma using huggingface
- Prompt engineering + talk with your PDF
I used similar base_prompt with the pdf_rag_from_scratch
Query: Do you pay or charge interest? Answer: Yes, the Core Banking Agreement states that interest is paid and charged on a daily basis, and the interest rate applicable to your account(s) is stated in the Product & Services Terms & Conditions or, if no such terms are provided, on the website.
The results are not perfect, but is a good starting point for fine-tuning.
That is all for today!
See you tomorrow :)