(Day 304) LLMs + some dbt

Ivan Ivanov · October 31, 2024

Hello :) Today is Day 304!

A quick summary of today:

reading more of the LLM Engineer’s handbook
training a 16k llama 3.2 llm on a6000
doing dbt fundamentals course

Re-reading the explanations about Implementing the LLM Twin’s RAG feature pipeline

This is the 2nd part of the 3rd chapter of the LLM Engineer’s handbook. It introduces some not so familiar with me concepts so I wanted to re-read them before covering them on stream and checking my understanding.

These are the concepts they cover:

settings: using pydantic’s BaseSettings, a global settings class that loads sensitive or non-sensitive variables from a .env file is used.
ZenML pipeline and steps

ZenML pipeline and steps

The pipeline performs the five core phases of RAG: extracting raw docs, cleaning, chinking, embedding, and loading them to the logical feature store

from zenml import pipeline
from llm_engineering.interfaces.orchestrator.steps import feature_engineering as fe_steps

@pipeline
def feature_engineering(author_full_names: list[str]) -> None:
    raw_documents = fe_steps.query_data_warehouse(author_full_names)
    cleaned_documents = fe_steps.clean_documents(raw_documents)
    last_step_1 = fe_steps.load_to_vector_db(cleaned_documents)
    embedded_documents = fe_steps.chunk_and_embed(cleaned_documents)
    last_step_2 = fe_steps.load_to_vector_db(embedded_documents)

    return [last_step_1.invocation_id, last_step_2.invocation_id]

In ZenML it looks like:

And each elipse provides metadata to the type of docs from that state and how to load them as artifacts

The pipeline accepts author_full_names. This data is injected through a yaml file at runtime, and we don’t need to modify the code for different input values, we just need to adjust the yaml file and pass it.

Querying the data warehouse

# other imports
from zenml import get_step_context, step

@step
def query_data_warehouse(
    author_full_names: list[str],
) -> Annotated[list, "raw_documents"]:

    documents = []
    authors = []
    for author_full_name in author_full_names:
        logger.info(f"Querying data warehouse for user: {author_full_name}")
        first_name, last_name = utils.split_user_full_name(author_full_name)
        logger.info(f"First name: {first_name}, Last name: {last_name}")
        user = UserDocument.get_or_create(first_name=first_name, last_name=last_name)
        authors.append(user)
        results = fetch_all_data(user)
        user_documents = [doc for query_result in results.values() for doc in query_result]
        documents.extend(user_documents)
    step_context = get_step_context()
    step_context.add_output_metadata(output_name="raw_documents", metadata=_get_metadata(documents))

    return documents

it tries to get or create a UserDocument instance using the 1st and last name; if a user doesn’t exist, an error is thrown
it fetches all the raw data for the user from the data warehouse and extends the documents list to include these user docs
it computes a descriptive metadata dictionary logged and tracked in ZenML

The fetch function leverages a thread pool that runs each query on a different thread. As we have multiple data categories, we have to make a different query for the articles, posts, and repositories, as they are stored in different collections. Each query calls the data warehouse, which is bounded by the network I/O and data warehouse latency, not by the machine’s CPU. Thus, by moving each query to a different thread, we can parallelize them. Ultimately, instead of adding the latency of each query as the total timing, the time to run this fetch function will be the max between all the calls.

In Python, we want to parallelize things with processes only when the operations are CPU or memory-bound because the GIL(Global Interpreter Lock) affects them. Each process has a different GIL. Thus, parallelizing computing logic, such as processing a batch of documents or images already loaded in memory, isn’t affected by Python’s GIL limitations.

def fetch_all_data(user: UserDocument) -> dict[str, list[NoSQLBaseDocument]]:

    user_id = str(user.id)
    with ThreadPoolExecutor() as executor:
        future_to_query = {
            executor.submit(__fetch_articles, user_id): "articles",
            executor.submit(__fetch_posts, user_id): "posts",
            executor.submit(__fetch_repositories, user_id): "repositories",
        }
        results = {}
        for future in as_completed(future_to_query):
            query_name = future_to_query[future]
            try:
                results[query_name] = future.result()
            except Exception:
                logger.exception(f"'{query_name}' request failed.")
                results[query_name] = []

    return results

The _get_metadata() function takes a list of queries docs and authors and counts the number of them relative to each data category:

def _get_metadata(documents: list[Document]) -> dict:
    metadata = {
        "num_documents": len(documents),
    }
    for document in documents:
        collection = document.get_collection_name()
        if collection not in metadata:
            metadata[collection] = {}
        if "authors" not in metadata[collection]:
            metadata[collection]["authors"] = list()
        metadata[collection]["num_documents"] = metadata[collection].get("num_documents", 0) + 1
        metadata[collection]["authors"].append(document.author_full_name)
    for value in metadata.values():
        if isinstance(value, dict) and "authors" in value:
            value["authors"] = list(set(value["authors"]))

    return metadata

This is the metadata that is exposed in the ZenML UI.

Cleaning the documents

Here, we iterate over the docs and a CleaningDispatcher who knows what cleaning logic to apply based on the data category is used.

@step
def clean_documents(
    documents: Annotated[list, "raw_documents"],
) -> Annotated[list, "cleaned_documents"]:

    cleaned_documents = []
    for document in documents:
        cleaned_document = CleaningDispatcher.dispatch(document)
        cleaned_documents.append(cleaned_document)
    step_context = get_step_context()
    step_context.add_output_metadata(output_name="cleaned_documents", metadata=_get_metadata(cleaned_documents))

    return cleaned_documents

Chunk and embed the cleaned documents

Similar to the cleaning dispatcher, the chinking and embedding logic is delegated to a dispatcher who knows how to handle each data category.

@step
def chunk_and_embed(
    cleaned_documents: Annotated[list, "cleaned_documents"],
) -> Annotated[list, "embedded_documents"]:

    metadata = {"chunking": {}, "embedding": {}, "num_documents": len(cleaned_documents)}
    embedded_chunks = []
    for document in cleaned_documents:
        chunks = ChunkingDispatcher.dispatch(document)
        metadata["chunking"] = _add_chunks_metadata(chunks, metadata["chunking"])
        for batched_chunks in utils.misc.batch(chunks, 10):
            batched_embedded_chunks = EmbeddingDispatcher.dispatch(batched_chunks)
            embedded_chunks.extend(batched_embedded_chunks)
    metadata["embedding"] = _add_embeddings_metadata(embedded_chunks, metadata["embedding"])
    metadata["num_chunks"] = len(embedded_chunks)
    metadata["num_embedded_chunks"] = len(embedded_chunks)
    step_context = get_step_context()
    step_context.add_output_metadata(output_name="embedded_documents", metadata=metadata)

    return embedded_chunks

Loading the documents to the vector DB

As each article, post, or code repository sits in a different collection inside the vector DB, we have to group all the documents based on their data category. Then, we load each group in bulk in the Qdrant vector db:

@step
def load_to_vector_db(
    documents: Annotated[list, "documents"],
) -> None:

    logger.info(f"Loading {len(documents)} documents into the vector database.")
    grouped_documents = VectorBaseDocument.group_by_class(documents)
    for document_class, documents in grouped_documents.items():
        logger.info(f"Loading documents into {document_class.get_collection_name()}")
        for documents_batch in utils.misc.batch(documents, size=4):
            try:
                document_class.bulk_insert(documents_batch)
            except Exception:
                return False

    return True

Pydantic domain entities

To some extent, in implementing the LLM Twin, the domain-driven design (DDD) principles are followed. They state that domain entities are the core of an app.

Pydantic is used to model all domain entities as it is the go-to python package for writing data structures with out-of-the-box type validation.

The domain of the LLM Twin application is split into two dimensions:

the data category: post, article, and repository
the state of the data: cleaned, chunked, and embedded

A base class is created for each state of the document:

class CleanedDocument(VectorBaseDocument, ABC)
class Chunk(VectorBaseDocument, ABC)
class EmbeddedChunk(VectorBaseDocument, ABC)

All of them inherit the VectorBaseDocument, which is the book’s custom OVM implementation (explained in a bit). Each also inherits from ABC, which makes the class abstract.

Each base abstract class from above will have a subclass that adds the data category dimension. For instance, the CleanedDocument class will have the below subclasses:

class CleanedPostDocument(CleanedDocument)
class CleanedArticleDocument(CleanedDocument)
class CleanedRepositoryDocument(CleanedDocument)

This is repeated for Chunk and EmbeddedChunk

For example, when ingesting a raw document, the cleaning step will yield a CleanedArticleDocument instance, the chunking step will return a list of ArticleChunk objects, and the embedding operation will return EmbeddedArticleChunk instances that encapsulate the embedding and all the necessary metadata to ingest in the vector DB.

The authors say they chose this type of design because the list of states will rarely change, and that they want to exted the list of data categories. Therefore, structuring the classes after the state allows us to plub another data caegory by inheriting these base abstract classes.

All the attributes of a cleaned document will be saved within the metadata of the vector DB. For example, the metadata of a cleaned article document will always contain the content, platform, author id, author full name, and link of the article.

Another fundamental aspect is the Config internal class, which defines the name of the collection within the vector DB, the data category of the entity, and whether to leverage the vector index when creating the collection:

class CleanedDocument(VectorBaseDocument, ABC):
    content: str
    platform: str
    author_id: UUID4
    author_full_name: str

class CleanedPostDocument(CleanedDocument):
    image: Optional[str] = None
    class Config:
        name = "cleaned_posts"
        category = DataCategory.POSTS
        use_vector_index = False

class CleanedArticleDocument(CleanedDocument):
    link: str
    class Config:
        name = "cleaned_articles"
        category = DataCategory.ARTICLES
        use_vector_index = False

class CleanedRepositoryDocument(CleanedDocument):
    name: str
    link: str
    class Config:
        name = "cleaned_repositories"
        category = DataCategory.REPOSITORIES
        use_vector_index = False

The last but of this section is to llok at the base abstract class of the chunk and embeeded chunk:

class Chunk(VectorBaseDocument, ABC):
    content: str
    platform: str
    document_id: UUID4
    author_id: UUID4
    author_full_name: str
    metadata: dict = Field(default_factory=dict)
# PostChunk, ArticleChunk, RepositoryChunk

class EmbeddedChunk(VectorBaseDocument, ABC):
    content: str
    embedding: list[float] | None
    platform: str
    document_id: UUID4
    author_id: UUID4
    author_full_name: str
    metadata: dict = Field(default_factory=dict)
# EmbeddedPostChunk, EmbeddedArticleChunk, EmbeddedRepositoryChunk

An enum that aggregates all our data categories in a single structure of constants, is also defined (benefits of StrEnum: distinct enumeration members; restricted set of values; comparison and identity; iterability; access to members and values):

class DataCategory(StrEnum):
    POSTS = "posts"
    ARTICLES = "articles"
    REPOSITORIES = "repositories"

The last step to fully understand how the domain objects work is to zoom into the VectorBaseDocument OVM class.

OVM

This is inspired by ORM, but the authors call it OVM because of working with embeddings and vector dbs instead of structured data and sql tables.

The code is similar to the ODM from chapter 3 (where we interacted with mongodb).

The OVM base class is called VectorBaseDocument:

from pydantic import UUID4, BaseModel
from typing import Generic
from llm_engineering.infrastructure.db.qdrant import connection

T = TypeVar("T", bound="VectorBaseDocument")

class VectorBaseDocument(BaseModel, Generic[T], ABC):
    id: UUID4 = Field(default_factory=uuid.uuid4)

    @classmethod
    def from_record(cls: Type[T], point: Record) -> T:
        _id = UUID(point.id, version=4)
        payload = point.payload or {}
        attributes = {
            "id": _id,
            **payload,
        }
        if cls._has_class_attribute("embedding"):
            payload["embedding"] = point.vector or None
        return cls(**attributes)

    def to_point(self: T, **kwargs) -> PointStruct:
        exclude_unset = kwargs.pop("exclude_unset", False)
        by_alias = kwargs.pop("by_alias", True)
        payload = self.dict(exclude_unset=exclude_unset, by_alias=by_alias, **kwargs)
        _id = str(payload.pop("id"))
        vector = payload.pop("embedding", {})
        if vector and isinstance(vector, np.ndarray):
            vector = vector.tolist()

        return PointStruct(id=_id, vector=vector, payload=payload)

The VectorBaseDocument class inherits from Pydantic’s BaseModel and helps us structure a single record’s attributes from the vector DB. Every OVM will be initialized by default with UUID4 as its unique identifier. Using generics—more precisely, by inheriting from Generic[T]—the signatures of all the subclasses of the VectorBaseDocument class will adapt to that given class.

The create and read methods:

the from_record() method adapts a data point from Qdrant’s format to our internal structure based on pydantic
the to_point() method takes the attributes of the current instance and adapts them to Qdrant’s PointStruct() format

Ultimately, all operations made to Qdrant will be done through the connection instance, which is instantiated in the application’s infrastructure layer.

The dispatcher layer

A dispatcher inputs a document and applies dedicated handlers based on its data category (article, post, or repository). A handler can either clean, chunk, or embed a document.

Firstly, the CleaningDispatcher - it mainly implements a dispatch() method that inputs a raw document. Based on its data category, it instantiates and calss a handler that applies the cleaning logic specific to that data point:

class CleaningDispatcher:
    cleaning_factory = CleaningHandlerFactory()

    @classmethod
    def dispatch(cls, data_model: NoSQLBaseDocument) -> VectorBaseDocument:
        data_category = DataCategory(data_model.get_collection_name())
        handler = cls.cleaning_factory.create_handler(data_category)
        clean_model = handler.clean(data_model)
        logger.info(
            "Data cleaned successfully.",
            data_category=data_category,
            cleaned_content_len=len(clean_model.content),
        )

        return clean_model

The key is the CleaningHandlerFactory, which instantiates a different cleaning handler based on the document’s data category:

class CleaningHandlerFactory:
    @staticmethod
    def create_handler(data_category: DataCategory) -> CleaningDataHandler:
        if data_category == DataCategory.POSTS:
            return PostCleaningHandler()
        elif data_category == DataCategory.ARTICLES:
            return ArticleCleaningHandler()
        elif data_category == DataCategory.REPOSITORIES:
            return RepositoryCleaningHandler()
        else:
            raise ValueError("Unsupported data type")

The Dispatcher or Factory classes are nothing fancy, but they offer an intuitive and simple interface for applying various operations to our documents. When manipulating documents, instead of worrying about their data category and polluting our business logic with if-else statements, we have a class dedicated to handling that. We have a single class that cleans any document, which respects the DRY (don’t repeat yourself) principles from software engineering. By respecting DRY, we have a single point of failure, and the code can easily be extended. For example, if we add an extra type, we must extend only the Factory class instead of multiple occurrences in the code.

The ChunkingDispatcher and EmbeddingDispatcher follow the same pattern. They use a ChunkingHandlerFactory and, respectively, an EmbeddingHandlerFactory that initializes the correct handler based on the data category of the input document. Afterward, they call the handler and return the result.

The Factory class leverages the abstract factory creational pattern, which instantiates a family of classes implementing the same interface. In our case, these handlers implement the clean() method regardless of the handler type.

Also, the Handler class family leverages the strategy behavioral pattern used to instantiate when we want to use different variants of an algorithm within an object and be able to switch from one algorithm to another during runtime.

Intuitively, in our dispatcher layer, the combination of the factory and strategy patterns works as follows:

initially, we knew we wanted to clean the data, but as we knew the data category only at runtime, we couldn’t decide on what strategy to apply
we can write the whole code around the cleaning code and abstract away the logic under a Handler() interface, which will represent our strategy
when we get a data point, we apply the abstract factory pattern and create the correct cleaning handler for its data type
ultimately, the dispatcher layer uses the handler and executes the right strategy

The last component of the RAG feature pipeline is the implementation of the cleaning, chunking, and embedding handlers.

The handlers

The handler has a one-on-one structure with the domain, meaning that every entity has its own handler

The cleaning handlers

# Other imports.
from typing import Generic, TypeVar

DocumentT = TypeVar("DocumentT", bound=Document)
CleanedDocumentT = TypeVar("CleanedDocumentT", bound=CleanedDocument)

class CleaningDataHandler(ABC, Generic[DocumentT, CleanedDocumentT]):
    @abstractmethod
    def clean(self, data_model: DocumentT) -> CleanedDocumentT:
        pass

And for every post, article and repository a different handler is implemented:

class PostCleaningHandler(CleaningDataHandler):
    def clean(self, data_model: PostDocument) -> CleanedPostDocument:
        return CleanedPostDocument(
            id=data_model.id,
            content=clean_text(" #### ".join(data_model.content.values())),
            # Copy the rest of the parameters from the data_model object.
        )

class ArticleCleaningHandler(CleaningDataHandler):
    def clean(self, data_model: ArticleDocument) -> CleanedArticleDocument:
        valid_content = [content for content in data_model.content.values() if content]
        return CleanedArticleDocument(
            id=data_model.id,
            content=clean_text(" #### ".join(valid_content)),
            platform=data_model.platform,
            link=data_model.link,
            author_id=data_model.author_id,
            author_full_name=data_model.author_full_name,
        )

class RepositoryCleaningHandler(CleaningDataHandler):
    def clean(self, data_model: RepositoryDocument) -> CleanedRepositoryDocument:
        return CleanedRepositoryDocument(
            id=data_model.id,
            content=clean_text(" #### ".join(data_model.content.values())),
            # Copy the rest of the parameters from the data_model object.
        )

The handlers input a raw document domain entity, clean the content, and return a cleaned document. All the handlers use the clean_text() function to clean the text. Out of simplicity, the same cleaning technique for all the data categories is used.

The chunking handlers

The handler takes cleaned documents as input and returns chunk entities.

This is the abstract class:

# Other imports.
from typing import Generic, TypeVar

CleanedDocumentT = TypeVar("CleanedDocumentT", bound=CleanedDocument)
ChunkT = TypeVar("ChunkT", bound=Chunk)

class ChunkingDataHandler(ABC, Generic[CleanedDocumentT, ChunkT]):
    @property
    def metadata(self) -> dict:
        return {
            "chunk_size": 500,
            "chunk_overlap": 50,
        }

    @abstractmethod
    def chunk(self, data_model: CleanedDocumentT) -> list[ChunkT]:
        pass

The embedding handlers

The embedding handlers differ slightly from the others as the EmbeddingDataHandler() interface contains most of the logic. This is done so when the embedding model is called, as many as samples are batched to optimise the inference process. When running the model on a GPU, the batched samples are processed independently and in parallel. Thus, by batching the chunks, we can optimize the inference process by 10x or more, depending on the batch size and hardware we use

An embed() method is implemented for running inference on a single data point, along with an embed_batch() method. The embed_batch() method takes chunked documents as input, gathers their content into a list, passes them to the embedding model, and maps the results to an embedded chunk domain entity. The mapping is done through the map_model() abstract method, which must be customized for every data category.

# Other imports.
from typing import Generic, TypeVar, cast
from llm_engineering.application.networks import EmbeddingModelSingleton

ChunkT = TypeVar("ChunkT", bound=Chunk)
EmbeddedChunkT = TypeVar("EmbeddedChunkT", bound=EmbeddedChunk)
embedding_model = EmbeddingModelSingleton()

class EmbeddingDataHandler(ABC, Generic[ChunkT, EmbeddedChunkT]):
    """
    Abstract class for all embedding data handlers.
    All data transformations logic for the embedding step is done here
    """
    def embed(self, data_model: ChunkT) -> EmbeddedChunkT:
        return self.embed_batch([data_model])[0]
    def embed_batch(self, data_model: list[ChunkT]) -> list[EmbeddedChunkT]:
        embedding_model_input = [data_model.content for data_model in data_model]
        embeddings = embedding_model(embedding_model_input, to_list=True)
        embedded_chunk = [
            self.map_model(data_model, cast(list[float], embedding))
            for data_model, embedding in zip(data_model, embeddings, strict=False)
        ]
        return embedded_chunk

    @abstractmethod
    def map_model(self, data_model: ChunkT, embedding: list[float]) -> EmbeddedChunkT:
        pass

The last step is to understand how the EmbeddingModelSingleton() works. It is a wrapper over the SentenceTransformer() class from Sentence Transformers that initializes the embedding model.

Writing a wrapper over external packages is often good practice. Thus, when we want to change the third-party tool, you have to modify only the internal logic of the wrapper instead of the whole code base.

This is designed with the singleton pattern in mind.

Next, I read the refactoring guru links referenced in the book:

The Builder design pattern

Intent

It is a creational design pattern that lets us construct complex objects step by step. The patterns allows us to produce different types and representations of an object using the same construction code

Problem

complex initialization: it simplifies the construction of complex objects by allowing step-by-step configuration, avoiding unwieldy constructors
telescoping constructor issue: it prevents cluttered constructors with many optional parameters, improving readability
avoiding subclass explosion: it eliminates the need for multiple subclasses for different configurations, reducing complexity in the class hierarchy
separation of concerns: it separates the object’s construction from its representation, allowing for easier maintenance and flexibility
enhanced usability: client code becomes more intuitive, as parameters can be set explicitly, making the code clearer and easier to understand

Solution

The Builder pattern suggests that we extract the object construction code out of its own class and move it to separate objects called builders.

The pattern organizes object construction into a set of steps. To create an object, we execute a series of these steps on a builder object. The important part is that we don’t need to call all of the steps. We can call only those steps that are necessary for producing a particular configuration of an object.

Some of the construction steps might require different implementation when we need to build various representations of the product. For example, walls of a cabin may be built of wood, but the castle walls must be built with stone.

In this case, we can create several different builder classes that implement the same set of building steps, but in a different manner. Then we can use these builders in the construction process to produce different kinds of objects.

Director

We can go further and extract a series of calls to the builder steps we use to construct a product into a separate class called director. This type of class defines the order in which to execute the building steps, while the builder provides the implementation for those steps.

Structure

Applicability

use the Builder pattern to get rid of a ‘telescoping(overloaded) constructor’
use the Builder pattern when we want our code to be able to create different representations of some product
it can be applied when construction of various representations of the product involves similar steps that differ only in the details
the Builder allows us to construct products step-by-step

The Abstract Factory design pattern

Abstract Factory is a creational design pattern that lets us produce families of related objects without specifying their concrete classes.

Solution

The first thing the Abstract Factory pattern suggests is to explicitly declare interfaces for each distinct product of the product family. Then we can make all variants of products follow those interfaces. For example, all chair variants can implement the Chair interface; all coffee table variants can implement the CoffeeTable interface, and so on

The next move is to declare the Abstract Factory—an interface with a list of creation methods for all products that are part of the product family (for example, createChair, createSofa and createCoffeeTable). These methods must return abstract product types represented by the interfaces we extracted previously: Chair, Sofa, CoffeeTable and so on.

If the client is only exposed to the abstract interfaces, what creates the actual factory objects? Usually, the application creates a concrete factory object at the initialization stage. Just before that, the app must select the factory type depending on the configuration or the environment settings.

Structure

Applicability

use this design pattern when our code needs to work with various families of related products, but we don’t want it to depend on the concrete classes of these products - they might be unknown beforehand or we simply want to allow for future extensibility
consider implementing the Abstract Factory when we have a class with a set of factory methods that blue its primary responsibility

The Strategy design pattern

Strategy is a behavioral design pattern that lets us define a family of algorithms, put each of them into a separate class, and make their objects interchangeable.

Problem

Imagine we build a navigator and over time we need to add new and new features. After some time, the code of the navigator becomes bloated.

Solution

The Strategy pattern suggests that you take a class that does something specific in a lot of different ways and extract all of these algorithms into separate classes called strategies.

The original class, called context, must have a field for storing a reference to one of the strategies. The context delegates the work to a linked strategy object instead of executing it on its own.

The context isn’t responsible for selecting an appropriate algorithm for the job. Instead, the client passes the desired strategy to the context. In fact, the context doesn’t know much about strategies. It works with all strategies through the same generic interface, which only exposes a single method for triggering the algorithm encapsulated within the selected strategy.

This way the context becomes independent of concrete strategies, so you can add new algorithms or modify existing ones without changing the code of the context or other strategies.

Pros

runtime flexibility: algorithms can be swapped at runtime, allowing for dynamic behavior
encapsulation: implementation details of algorithms are isolated from the client code, promoting cleaner architecture
composition over inheritance: replaces inheritance with composition, reducing complexity and increasing flexibility
open/closed principle: new strategies can be introduced without modifying existing code, enhancing maintainability

Cons

overhead for few algorithms: if there are only a couple of algorithms that rarely change, introducing new classes and interfaces may complicate the code unnecessarily
client awareness: clients must understand the differences between strategies to select the appropriate one, which can lead to confusion
functional alternatives: modern languages may support functional types that allow implementing algorithms with anonymous functions, reducing the need for the strategy pattern and avoiding code bloat

The Singleton design pattern

Singleton is a creational design pattern that lets us ensure that a class has only one instance, while providing a global access point to this instance.

Problem/Reason to Use

ensure that a class has just a single instance: why would anyone want to control how many instances a class has? The most common reason for this is to control access to some shared resource—for example, a database or a file
provide a global access point to that instance: remember those global variables that someone used to store some essential objects? While they’re very handy, they’re also very unsafe since any code can potentially overwrite the contents of those variables and crash the app

Nowadays, the Singleton pattern has become so popular that people may call something a singleton even if it solves just one of the listed problems.

Structure

I found that the website has all of its design patterns in python with examples - https://refactoring.guru/design-patterns/python

Definitely something to checkout in the future.

Went to my professor’s lab to use the A6000 again

I tried tuning Gemma 2 today, but to do that I had to install the flash attention package. Not sure if it’s bugged or something or it is just big but the ‘Building wheel for flash-attn’ part took forever. And in the meantime I just decided to fine-tune llama 3.2-3B and using 16K context worked thanks to it being a smaller model.

The results are:

Model	Text RougeLSum	Ratings RMSE
gpt4o-mini	0.8768	0.5924
gpt4o-mini-fewshot	0.8499	0.6797
llama-3.1-8b-8k	0.7827	0.8580
llama-3.2-3b-16k	0.8537	0.7636

Streamed

On stream I started covering the dbt fundamentals course. It’s dbt’s education team official course website. For the practical part it requires me to set up a connection using dbt cloud (which is not hard) with some cloud provider (or a data warehouse tool) (and this is the unfortunate part). So I ended up just watching the videos and doing the quizes.

After I ended the stream I slept and then decided to finish the dbt fundamentals course.

At the end I saw this:

The dbt Certified Developer Path
The courses on the path are:

dbt fundamentals 
Refactoring SQL for Modularity
Jinja, Macros, and Packages
Advanced Materializations
Analyses and Seeds
Advanced Testing
Advanced Deployment

I will continue with the 2nd course next.

That is all for today!

See you tomorrow :)