(Day 301) LLM Engineer's Handbook

Ivan Ivanov · October 28, 2024

Hello :) Today is Day 301!

A quick summary of today:

streamed and talked about translating pandas PD model code to pyspark
started reading a new promising book - LLM Engineer’s Handbook

Stream - translating a pd model from pandas to pyspark

Here is a colab notebook with the code from today’s work and stream.

Overall, it was a nice pyspark exercise, using pyspark’s ML library, and also going from model coefficients to a scorecard to assess a customer’s credit score (i.e. FICO). There are a couple extra steps that can be taken to improve the above notebook - adding a test dataset, using the model to assess the whole customer base’ credit scores and overall PDs. The other models for calculating Expected Loss - EAD and LGD ~ I guess I will leave them for the next few days.

RAG on the scikit-learn docs

I finally got time to check out probabl’s last Friday stream that shows the functionality of a custom built RAG tool for the scikit-learn docs. I had already checked the paper and the repo some time ago, but Vincent (the instructor) showed the actual app as well as anyone can clone the open-source repo and run it locally - so this was nice

LLM Engineer’s Handbook

I saw the author of this book advertising it on linkedin and it was getting lots of traction from other known llm/ml/data people so I had it bookmarked. I was refreshing the pages on my chrome tabs now and turns out that all of the chapter came out (last time I checked - last week, only the first 2-3 had come out). So ~ I’m excited to read it as it talks about LLM Ops.

Chapter 1 - Understanding the LLM Twin Concept and Architecture

Understanding the LLM Twin concept

In a nutshell, An LLM twin is me in an AI version of me that has learned my writing style, voice, and personality

To teach an AI to act like me, I could use:

LinkedIn posts and X threads: specialize the LLM in writing social media content
messages with friends and family: adapt the LLM to an unfiltered version of myself
academic papers and articles: calibrate the LLM in writing formal and educative content
code: specialize the LLM in implementing code as I would

All these can be reduced to one core strategy: collecting my digital data (or some parts of it) and feeding it to an LLM using different algorithms. Ultimately, the LLM reflects the voice and style of the collected data

Unfortunately, this raises many technical and moral issues. First, on the technical side, how can we access this data? Do we have enough digital data to project ourselves into an LLM? What kind of data would be valuable? Secondly, on the moral side, is it OK to do this in the first place? Do I want to create a copycat of myself? Will it write using my voice and personality, or just try to replicate it?

Building an LLM Twin is entirely moral. The LLM will be fine-tuned only on my personal digital data, and it won’t use other people’s data.

Planning the MVP of the LLM Twin product

An MVP is a version of a product that includes just enough features to draw in early users and test the viability of the product concept in the initial stages of development

An MVP is a powerful strategy because of the following reasons:

accelerated time-to-market: launch a product quickly to gain early traction
idea validation: test it with real users before investing in the full development of the product
market research: gain insights into what resonates with the target audience
risk minimization: reduces the time and resources needed for a product that might not achieve market success

Features for the LLM twin MVP:

collect data from LinkedIn, Medium, Substack, and GitHub profiles
fine-tune an open-source LLM using the collected data
populate a vector database (DB) using our digital data for RAG
create LinkedIn posts leveraging the following:
- user prompts
- RAG to reuse and reference old content
- new posts, articles, or papers as additional knowledge to the LLM
Have a simple web interface to interact with the LLM Twin and be able to do the following:
- configure our social media links and trigger the collection step
- send prompts or links to external resources

Building ML systems with feature/training/inference (FTI) pipelines

The problem with building ML systems

Common elements/problems to consider as an MLOps engineer:

The critical question is this: How do we connect all these components into a single homogenous system? There are existing solutions

The issue with previous solutions

The main issue with monolithic batch architectures in machine learning (ML) applications is that while they solve the training-serving skew by using the same code for feature creation in training and inference, they have limitations when scaling. These architectures lack feature reusability, require code refactoring for large data, and complicate sharing and modularizing tasks across teams. Transitioning to real-time processing is also difficult, as client applications must pass all state data, leading to inefficiencies and errors.

Real-time architectures further complicate predictions, as clients need full access to user state data. For example, recommending movies for a user requires transferring not just an ID but a complete user profile. This setup is complex, inefficient, and unsuitable for client applications. Although platforms like GCP offer a scalable production-ready architecture, it is complex, making it challenging to scale gradually or understand without advanced experience in ML production.

The solution – ML pipelines for ML systems

The FTI pipeline solution simplifies ML systems by dividing them into three core pipelines: feature, training, and inference. This approach mirrors the structure of traditional software layers (database, business logic, UI), creating a straightforward framework for building ML systems. Each pipeline—feature, training, and inference—serves a specific role and can be independently managed, scaled, or developed using different technologies, making the design flexible.

Feature pipe: processes raw data into features and stores them in a feature store, ensuring versioning and consistency across training and inference
Training pipe: uses stored features to train models, which are then stored in a model registry with metadata, tracking versions and the data used for training
Inference pipe: combines features from the feature store and models from the registry to generate predictions, supporting both batch and real-time modes. Versioning across features, labels, and models allows for controlled updates or rollbacks

This modular design avoids issues like training-serving skew and supports easy adjustments to models and features.

Benefits of the FTI architecture

as there are just three components, it is intuitive to use and easy to understand
each component can be written into its tech stack, so we can quickly adapt them to specific needs, such as big or streaming data. Also, it allows us to pick the best tools for the job
as there is a transparent interface between the three components, each one can be developed by a different team (if necessary), making the development more manageable and scalable
every component can be deployed, scaled, and monitored independently

In most cases, the system will include more than just 3 pipelines. For example, the feature pipeline can be composed of a service that computes the features and one that validates the data. Also, the training pipeline can be composed of the training and evaluation components.

Designing the system architecture of the LLM Twin

Listing the technical details of the LLM Twin architecture

Data

Collect data from LinkedIn, Medium, Substack, and GitHub automatically and on a schedule
Standardize the crawled data and store it in a data warehouse
Clean the raw data
Create instruction datasets for fine-tuning an LLM
Chunk and embed the cleaned data, then store the vectorized data in a vector database for RAG

Training Tasks

Fine-tune LLMs of various sizes (7B, 14B, 30B, or 70B parameters)
Fine-tune on instruction datasets of multiple sizes
Switch between LLM types (e.g.Mistral, Llama, GPT)
Track and compare experiments
Test potential production LLM candidates before deployment
Automatically start training when new instruction datasets are available

Inference Code Properties

Provide a REST API for client interactions with the LLM Twin
Enable real-time access to the vector database for RAG
Support inference with LLMs of various size
Autoscale based on user requests
Automatically deploy LLMs that pass evaluation

LLMOps Features

Instruction dataset versioning, lineage, and reusability
Model versioning, lineage, and reusability
Experiment tracking
Continuous training, continuous integration, and continuous delivery (CT/CI/CD)
Prompt and system monitoring

How to design the LLM Twin architecture using the FTI pipeline design

This pipeline involves four stages:

data collection pipeline: collects text data (articles from Medium/Substack, posts from LinkedIn, and code from GitHub) via ETL (Extract, Transform, Load), stored in a NoSQL database as the data warehouse. Data is categorized by type (article, post, code) for modularity
feature pipeline: processes collected data by cleaning, chunking, and embedding, storing processed data in a vector database (logical feature store) for model fine-tuning and retrieval-augmented generation (RAG)
training pipeline: fine-tunes the LLM using artifacts from the feature store. Experiments determine optimal hyperparameters, with the final model registered for production and tested rigorously before deployment
inference pipeline: uses the fine-tuned LLM and vector DB for RAG-based query responses through a REST API. It includes prompt monitoring for quality control and automated alerts based on set criteria.

Wow, this book seems extremely promising. The authors say that, for learning purposes, they share their data. And that is nice because I do not have much public data on social media haha. In my case I would have just used my personal chats maybe, and also this blog which is pretty substantial in size to be fair.

That is all for today!

See you tomorrow :)