(Day 174) Starting LLM zoomcamp + Learning about Data Warehouses + BigQuery

Ivan Ivanov · June 23, 2024

Hello :) Today is Day 174!

A quick summary of today:

  • did Module 1 and its homework from the DataTalksClub’s LLM camp
  • covered Module 2 using free models and creating a simple FAQ bot
  • covered Module 3 from the data engineering camp on data warehouses

LLM camp module 1: basic RAG app

Maybe a big thing today is that I finally gave in and loaded $10 into my OpenAI account. I am officially in the belly of the beast haha. I finally decided to use the almighty GPT API. That being said, I will use the cheapest one gpt-3.5-turbo and be conservative with my prompts whenever I can.

image

As for the LLM zoomcamp material from Module 1 which serves as an intro to RAG ~

image

This is a nice simple graphic done by the instructor and it explains the basics pretty well. I have already had plenty of experience with RAG apps but still I wanted to give this camp a go as I know I will learn something new. And even today I did - mainly GPT api and the Elasticsearch library for a search database. Here is a bit of the material from today. Everything is on this repo.

The built RAG app is based on DataTalksClub’s Q&A documents where they have gathered tons and tons of questions so that students can reference and ask the doc itself, rather than the same questions appearing in Slack all the time.

First, load documents (and you can see an example)

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",

 'section': 'General course-related questions',

 'question': 'Course - When will the course start?',

 'course': 'data-engineering-zoomcamp'}

Next, create a search index

We can run elastic search using docker.

And to create an index we need the following format

image

Then adding documents to the index using (the actual doc count is ~985 so the for loop is fine in this case)

for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

Next, querying elastic search

We need the below format

image

Finally, sending the context + query to GPT

image

Query: ‘I just discovered the course. Can I still join it?’ (Answer should be yes)

Returned result: ‘Based on the information provided, yes, you can still join the course even after the start date. Registration is not necessary, and you are still eligible to submit homework. Just be aware of deadlines for final projects and work at your own pace.’ Nothing that new, but nice to discover a new library in Elasticsearch and start using the infamous GPT API.

As for the homework, it was quite similar to the above. Here is my submission.

Also, all of this made me think for when I finish this 365 day learning journey. I should create an app like that, that lets people chat and more easily learn about my journey.

LLM camp module 2: open-source LLMs

Here the instructor introduced us to free alternatives to OpenAI, like huggingface and ollama.

I learned how to setup ollama locally using docker.

Firstly, a docker compose file was needed to run both elastic search and ollama

image

Then, we used phi3, and because we need it locally downloaded, I went into docker and downloaded it

image

And then, I just modified the final notebook from Module 1 just so it uses ollama, and the output is surprisingly good.

Query: ‘I just disovered the course. Can I still join it?’ Result: ‘ Yes, you can still join the course even after the start date. We will keep all materials available for you to follow at your own pace after it finishes. You are also eligible to submit homeworks and prepare for future cohorts if needed.\n\nHowever, please note that there will be deadlines for submitting final projects. It is recommended not to leave everything for the last minute.’

Here is the full code.

The final bit of content of module 2 is creating a basic streamlit FAQ bot using ollama and elastic search.

image

Very nice so far. Can’t wait for the next topics which will come in the next 2 weeks~

Data engineering camp Module 3: data warehouses

Firstly, I learned about OLAP vs OLTP

image image

What is a data warehouse?

it is an OLAP solution, that is used for reporting and data analysis. They might have many data sources that all report to a staging area which is then written to a data warehouse. A warehouse can consist of metadata, raw data and summary data. Then it can be transformed into various data marts, and then users can access this data.

image

What is BigQuery?

  • a serverless data warehouse
    • there are no servers to manage or database software to install
  • it includes scalable and highly available software and infrastructure
  • there are built-in features like ML, geospatial analysis, BI
  • it maximises flexibility by separating the compute engine that analyses our data from our storage

Then a bit about partitioning in BigQuery

image

And clustering

image image

Choosing clustering over partitioning

  • partitioning results in a small amount of data per partition (apprx < 1GB)
  • partitioning results in a large number of partitions beyond the limits on partitioned tables
  • partitioning results in our mutation operations modifying the majority of partitions in the table frequently

Automatic Reclustering

As data is added to a clustered table, the newly inserted data can be written to blocks that contain key ranges that overlap with the key ranges in previously written blocks. These overlapping keys weaken the sort property of the table. To maintain the performance characteristics of a clustered table, BigQuery performs automatic reclustering in the background to restore the sort property of the table.

For partitioned tables, clustering is maintained for data within the scope of each partition.

Best Practices in BigQuery

Cost Reduction

  • Avoid SELECT *
  • Price your queries before running them
  • Use clustered or partitioned tables
  • Use streaming inserts with caution
  • Materialize query results in stages

Query Performance

  • Filter on partitioned columns
  • Denormalize data
  • Use nested or repeated columns
  • Use external data sources appropriately
  • Reduce data before using a JOIN
  • Do not treat WITH clauses as prepared statements
  • Avoid oversharding tables
  • Avoid JavaScript user-defined functions
  • Use approximate aggregation functions
  • Order Last: For query operations, to maximize performance, optimize join patterns. As a best practice, place the table with the largest number of rows first, followed by the table with the fewest rows, and then place the remaining tables by decreasing size.

That is all for today!

See you tomorrow :)