(Day 80) Starting to write my own RAG from scratch on a bank's T&C pdf

Ivan Ivanov · March 21, 2024

applying-knowledge nlp

Hello :) Today is Day 80!

A quick summary of today:

based on yesterday’s RAG from scratch, I decided to create a new one, based on a pdf chosen by me

I uploaded the code so far on this github repo, and below is an overview of my progress so far.

Firstly, what document?

I wanted to use a document that is shorter (tutorial used 1200 page doc), but a bit more complex - including tables, more numbers. So I chose a bank’s terms and conditions 32 page document (source). The file itself is in the github repo.

Sample content

Firstly, I decided to follow the tutorial’s code, but just shortening it and make it more comfortable to use. To embed the text I tried using mixedbread-ai/mxbai-embed-large-v1 (new SoTA) and all-mpnet-base-v2 (from tutorial). Each model has to be used following specific instructions, for example, the all-mpnet one directly outputs normalised scores, but the mixedbread-ai one outputs raw scores, and cosine similarity function needs to be applied.

Next, time to create the RAG itself. I decided to use google/gemma-2b-it again. Also yesterday, I got a ran out of memory error, but it turns out I can run with gemma on my mps, but I need to close chrome and other draining programs, and it runs fine (slower than a powerful cuda-based gpu, but still running).

Based on a quick look over the file, I came up with these sample queries

And then, seeing the results. For many of these, the model said there is no relevant info in the pdf about it. I found out that the reason is - answers ‘hide’ in tables. The PyMuPDF library does not read tables well, and so the model gets further confused and misses knowledge. Then I started looking at how to read tables as well as text from a pdf.

Note that, I keep in mind that there could be other issues, like chunk length, preprocessing the text, preprocessing when loading the saved embeddings csv. But at first I decided to first go with finding a way to read tables better.

I found tabula, pdfplumber, and also even PyMuPDF has a way to read tables.

Actually the same library (PyMuPDF) reads tables well, the problem now became how can I incorporate them in the text. I could not find a way to add the table at the exact spot where it came up on. the page (there might be a way with this library but today I could not find it). So I tried, just appending the table at the end of the page text. But this means that we are repeating text, and because chunks are only a fixed length, the table and the text itself + the deficiently read table, might not even appear together. In addition, the model might not know that this table at the end of the page is related to this passage. I did want to add the table at the right spot, but the problem is that the table’s title is not included in the extracted table - the title is part of the text above it. So for tomorrow, I need to find a way to read tables properly, or find a library that reads text and tables properly and places them in the correct order.

Some results

The last query I ran, just now before uploading to github, I got

It might look alright, but it is wrong actually. Before I mention how it is wrong, I want to say that before adding the tables at the end of the page’s text, the answer to this query was something along the lines of: the provided context does not talk about this. So even though adding the table’s text (not in the best way) to the end of a page’s text is not the best method, the model improved a little and is giving some kind of answer now.

So… the correct answer to that query is (from the pdf)

There is some confussion, likely still due to the table being read incorrectly (among other reasons).

Tomorrow, we continue ^^

That is all for today!

See you tomorrow :)