(Day 302) Data engineering pipeline from the LLM Engineer's handbook

Ivan Ivanov · October 29, 2024

Hello :) Today is Day 302!

A quick summary of today:

covered ch. 2 and 3 of the book
streamed
small update on the company reviewer project

LLM Engineer’s handbook Chapter 2 - Tooling and Installation

As the name of the chapter suggests, this was information about how to setup the code locally.

After some environment setup, I ran a ZenML pipeline using poetry, and in the UI we can see:

A run looks like:

MLOPs and LLMOps tools:

HuggingFace is used as a model registry
ZenML is used as an orchestrator and to save artifacts and metadata
Comet ML is used for experiment tracking
Opik is used for prompt monitoring

Databases

MongoDB as a noSQL db
Qdrant as a vector db

AWS

Optionally, I can use AWS SageMaker for training and inference

Chapter 3 - Data Engineering

Designing the LLM Twin’s data collection pipeline

input: a list of links and their associated user (the author)
output: a list of raw documents stored in the NoSQL data warehouse

The ETL pipeline will detect the domain of each link, based on which it will call a specialized crawler

Medium crawler: used for data from Medium. It outputs an article document. It logs in Medium and crawls the HTML of the article, then, extracts, cleans, and normalises the text from the HTML and loads the standardised text of the article into mongodb (our nosql data warehouse)
custom article crawler: similar to the Merdium crawler, but more generic. It doesn’t do any log ins, it just gathers all the HTML from a particular link
github crawler: collects data from github. It outputs a repository document. It clones the repository, parses the repository file tree, cleans and normalises the files, and loads them to the db
linkedin crawler: collects data from linkedin. It outputs multiple post documents. It logs in linkedin, navigates to the user’s feed, and crawls all the user’s latest posts. For each post, it extracts its HTML, cleans and normalises it, and loads it to the db

How is the ETL process connected to the feature pipeline?

The feature pipeline ingests the raw data from the MongoDB data warehouse, cleans it further, processes it into features, and stores it in the Qdrant vector DB to make it accessible for the LLM training and inference pipelines. The ETL process is independent of the feature pipeline. The two pipelines communicate with each other strictly through the MongoDB data warehouse. Thus, the data collection pipeline can write data for MongoDB, and the feature pipeline can read from it independently and on different schedules.

Why is MongoDB used as a data warehouse?

Using a transactional database, such as MongoDB, as a data warehouse is uncommon. However, in this use case, we are working with small amounts of data, which MongoDB can handle. Even if we plan to compute statistics on top of our MongoDB collections, it will work fine at the scale of our LLM Twin’s data (hundreds of documents). By mainly working with unstructured text, selecting a NoSQL database that doesn’t enforce a schema will make the development easier and faster. However, when working with big data (millions of documents or more), using a dedicated data warehouse such as Snowflake or BigQuery will be ideal.

Implementing the LLM Twin’s data collection pipeline

Below is the digital_data_etl pipeline:

from zenml import pipeline
from steps.etl import crawl_links, get_or_create_user

@pipeline
def digital_data_etl(user_full_name: str, links: list[str]) -> str:
    user = get_or_create_user(user_full_name)
    last_step = crawl_links(user=user, links=links)

    return last_step.invocation_id

Diving deeper ~ we will explore each of the two steps in the above pipeline

Firstly, get_or_create_user

from loguru import logger
from typing_extensions import Annotated
from zenml import get_step_context, step

from llm_engineering.application import utils
from llm_engineering.domain.documents import UserDocument

@step
def get_or_create_user(user_full_name: str) -> Annotated[UserDocument, "user"]:
    logger.info(f"Getting or creating user: {user_full_name}")

    first_name, last_name = utils.split_user_full_name(user_full_name)

    user = UserDocument.get_or_create(first_name=first_name, last_name=last_name)

    step_context = get_step_context()
    step_context.add_output_metadata(output_name="user", metadata=_get_metadata(user_full_name, user))

    return user


def _get_metadata(user_full_name: str, user: UserDocument) -> dict:
    return {
        "query": {
            "user_full_name": user_full_name,
        },
        "retrieved": {
            "user_id": str(user.id),
            "first_name": user.first_name,
            "last_name": user.last_name,
        },
    }

Next is the crawl_links step:

from urllib.parse import urlparse

from loguru import logger
from tqdm import tqdm
from typing_extensions import Annotated
from zenml import get_step_context, step

from llm_engineering.application.crawlers.dispatcher import CrawlerDispatcher
from llm_engineering.domain.documents import UserDocument

@step
def crawl_links(user: UserDocument, links: list[str]) -> Annotated[list[str], "crawled_links"]:
    dispatcher = CrawlerDispatcher.build().register_linkedin().register_medium().register_github()

    logger.info(f"Starting to crawl {len(links)} link(s).")

    metadata = {}
    successfull_crawls = 0
    for link in tqdm(links):
        successfull_crawl, crawled_domain = _crawl_link(dispatcher, link, user)
        successfull_crawls += successfull_crawl

        metadata = _add_to_metadata(metadata, crawled_domain, successfull_crawl)

    step_context = get_step_context()
    step_context.add_output_metadata(output_name="crawled_links", metadata=metadata)

    logger.info(f"Successfully crawled {successfull_crawls} / {len(links)} links.")

    return links


def _crawl_link(dispatcher: CrawlerDispatcher, link: str, user: UserDocument) -> tuple[bool, str]:
    crawler = dispatcher.get_crawler(link)
    crawler_domain = urlparse(link).netloc

    try:
        crawler.extract(link=link, user=user)

        return (True, crawler_domain)
    except Exception as e:
        logger.error(f"An error occurred while crowling: {e!s}")

        return (False, crawler_domain)


def _add_to_metadata(metadata: dict, domain: str, successfull_crawl: bool) -> dict:
    if domain not in metadata:
        metadata[domain] = {}
    metadata[domain]["successful"] = metadata.get(domain, {}).get("successful", 0) + successfull_crawl
    metadata[domain]["total"] = metadata.get(domain, {}).get("total", 0) + 1

    return metadata

I will go over the code on stream (link below) because there is a lot of code related to all the crawlers and interacting with mongodb.

After running the data etl pipeline, we can see the data in mongodb. VSCode has a nice mongodb extension that allows to view the base inside vs code:

And I can see I have two users (because I ran the pipeline for the two authors’ data)

All of the book’s code can be found on the book’s github repo

Company reviewer

Today, I gave a brief report on my progress for the company reviewer LLM fine-tuning as tomorrow he invited me to meet with the client company. Below is a summary of the email I sent to him.

As for OpenAI’s gpt4o-mini fine-tuning

Model	Text RougeLSum	Ratings RMSE
gpt4o-mini	0.8768	0.5924
gpt4o-mini-fewshot	0.8499	0.6797

I suspect a bit of the decrease in performance with the few-shot is due to the larger prompt and maybe it could be improved with some different prompts. I might need some extra credits because these fine-tunings and evaluations are draining my account haha. I have seen online that gpt4o-mini’s performance is very close if not the same to gpt4o, but it might also be worth exploring gpt4o if you think that is something of interest.

As for open-source model LoRA fine-tuning

I have been facing some issues related to the LoRA fine-tuning library I was using (unsloth). I reported the problem to the library developers, it got fixed but it has problems again and I have reported it again just this morning. When there were no errors I got to fine-tune two open-source models - Llama3.2 and Llama3.1 but the responses were not good – even just doing manual reviews I noticed the text did not make sense and was not related at all to our data. This is something that I think requires a more powerful GPU, or to purchase colab GPU hours as the free T4 GPUs are not suitable for the bigger data (bigger compared to the movie review data I was using). As the a single case is bigger I need to increase the input context of the open-source models like Llama, and Qwen, but this increase in input context requires more compute power.

As for prompt tuning

I found this notebook that is an implementation and example of doing prompt tuning. However, whenever I run it on my laptop, or colab’s free CPU/GPU or Kaggle’s free GPU, I run out of memory – which I guess requires me to have a more powerful PC in order to do it. Or maybe this is just not a good implementation of doing PEFT prompt tuning ~

Stream

I streamed a bit late today as I had some university engagements. I covered some of the code from chapter 3 of LLM Engineer’s Handbook.

That is all for today!

See you tomorrow :)