(Day 82) Looking for better parsing methods and prompting techniques for my PDF RAG

Ivan Ivanov · March 23, 2024

Hello :) Today is Day 82!

A quick summary of today:

  • looked online for better ways to prompt gemma-2b-it
  • switched to llama parser for my pdf preprocessing step (w.i.p.)

All is on this github repo.

Firstly, about the prompting techniques

Before switching my base prompt, I went back to reduce the chunk_size of the langchain recursiveTextSplitter to values between 250~800. I settled on 800, thinking that 1500 is an overkill, and I saw a little improvement in the responses on the base questions.

The custom query list and base prompt at this point were:

image

And answer to the queries:

image

Next, I decided to see the results to the outputs, if I change the base prompt a bit. The above prompt was modified by adding

‘You are a helpful assisstant to customers about our bank’s terms and conditions.’

And at the end, I saw that we can add these special tokens to tell the model to differentiate better between user and model’s turn to respond.

User query: <start_of_turn>user{query}<end_of_turn>

Answer: <start_of_turn>model

Below is the full prompt for reference

image

And now the answers to the queries are:

image

I used this website to look for prompting techniques: https://www.promptingguide.ai/models/gemma and also this post on kaggle: https://www.kaggle.com/code/sooyoungher/study-prompt-engineering-gemma-2b-it-in-langchain#Prompt-Engineering

Afterwards, I changed the base prompt a bit more. Before, towards the end it was:

‘Now use the following context items to answer the user query {context} Relevant passages: '

But I decided to change this a bit to:

‘Now use the following context items: {context} \n And answer the user’s query:’

image

(also seen in the pic, I added a line to tell the model what to say in case it does not know, but it did not work)

Now, the answer to the queries are:

image

(side note here, before I noticed that if we ask What is a Business Day? and What is a business day?, in the former case we get an answer, but in the latter we get that there is no info in the context - which is interesting that it takes into account uppercase letters like that, but in a bit you can see that this fixes itself with better base promp and later when the pdf preprocessing is done using llama parser)

Next, seeing there is some improvement, I decided to improve the base prompt, give it better examples, but also reduce them from 4 to 2.

image

In addition, I included some more questions in the queries list for the model to answer.

image

Output is:

image image

There is definitely improvement, and this is based on the base prompt improvements, and also when preprocessing the text I commented out this line: text = text.replace('\n', ' ').replace(' ', ' ') which seems like had a bad effect on the then current state of preprocessing and loading data into the model.

Switching to llama index for pdf parsing

Afterwards, I remember that besides langchain there I had saved one more pdf parsing method - llama index. I learned more about it from this video, and the gist is below.

It was not so much llama index that stood out to me, but it was an idea that they talk about - parsing PDFs and turning them into html/markdown. They talked about an issue when parsing complex PDFs, that include tables, such as annual reports, we need to the model to know for sure the values in a financial statement report, if we want to give a RAG model to our stakeholders. But reading complex tables is not so easy, they said. They talked that, if we can parse the pdf and turn it into markdown, markdown can represent tables very well, and markdown tables are read well by LLMs -> this got me really excited. It means that if I can use this llama parser, I can make my cutom RAG more powerful and even give it more complex PDFs.

So I decided to go to llamaindex’s page to see how to use its api, I got an api key, and the implementation was not hard. At the moment when I am writing this blog, the code is not very structured, I just wanted to see it work - and it works!

The answers to the above queries list, just by switching from langchain’s recursive char splitter to llamaindex’s parser, are below

image image

At the moment using the llama API is free, but if that becomes paid in the future, going back to langchain is always an option. I will explore this more, and also just overall ways to improve the whole process ^^

Finally, a small thing, if we run the executable rag.py now, we will be able to ask the LLM questions until the user enters q to exit. Not a big thing, but just feels more like a chat with a customer support bot.

That is all for today!

See you tomorrow :)