(Day 364) Learning about LangSmith and reviewing K8s

Ivan Ivanov · December 30, 2024

Hello :) Today is Day 364!

A quick summary of today:

read Real-World ML Systems on Kubernetes - Chapter 2
read Software Engineering for Data Scientists - Chapter 1
read a white paper by Google on MLOps
streamed for the last time

Real-World ML Systems on Kubernetes - Chapter 2 - Fundamentals of Kubernetes

At its core, Kubernetes is a highly extensible orchestration system for containerized workloads.

K8s is cloud-agnostic and it’s one of the reason for its wide adoptation by businesses.

K8s architecture

Consists of 2 major components:

control plane which comprises the internal components that K8s needs to function
data plane where apps run

The control plane

This is where the brains of a K8s cluster live and comprises a highly available set of servers running K8s’ system processes and a database

on the cloud, the control plane is managed by the cloud provider (maintaining, scaling, upgrading, monitoring, troubleshooting)

Components:

api server: the kube-api-server is the central control point of the K8s cluster; it exposes the K8s API which is used by both the internal and external components to interact with the cluster
controller manager: runs a collection of controllers that manage the state of the cluster - it ensures the current state matches the desired state
scheduler: decides where to run a particular workload
cloud controller manager: manages cloud specific resources like VMs, networking and storage
ETCD: a cluster’s state is persisted in highly-available etcd key-value store and the API server interacts with it to read/write this state

The data plane

Consists of the infrastructure that executes the workloads we deploy. Technically, we can run workloads on the same hosts that run a cluster’s control plane, but this is an uncommon practice available only in self-managed clusters.

Key components:

nodes: VMs that run containers
CNI: provides networking for containers
CoreDNS: DNS server in a cluster

Command	Operation
`kubectl get`	List all resources
`kubectl describe`	Describe a resource
`kubectl delete`	Delete a resource
`kubectl edit`	Edit a resource
`kubectl logs`	View container logs
`kubectl patch`	Update a resource
`kubectl apply -f filename.yaml`	Create or update a resource
`kubectl config view`	View kubectl configuration
`kubectl events`	Show events
`kubectl exec`	Execute a command inside a running container
`kubectl port-forward`	Forward a port on the local machine to a Pod
`kubectl run --image=<container_image>`	Create a Pod from an image

K8s Objects

Objects are persistent entities that represent the state of the K8s cluster. When an object is created, the K8s system will constantly work to ensure that object exists, and its desired state is maintained.

Containers

Popularised by Docker, containers are bundled packages that contain all code, dependencies and configs needed to run an app

Pods

A pod contains one or more containers that run on the same node. Typically, there is one container running the main app and sidecar containers that run processes that provide helper functions

All containers within a Pod run on the same node. They also share the Linux network namespace, which enables them to intercommunicate using localhost

Containers in a Pod always run on the same node. They can intercommunicate using the loopback interface. Every Pod gets an IP address where it is accessible. Pods intercommunicate with each other using Pod’s IP address (or DNS name)

A pod has an IP address, and other pods in the cluster can connect to other pods using that IP address. The pod’s private IP address is inaccessible from outside the cluster and we cannot connect to the pod directly. However, we can use port forwarding to test an app running inside a cluster.

Port forwarding tunnels traffic through the Kubernetes API server and Kubelet. This tunnel is temporary and exists only for the duration of the kubectl port-forward command.

resource allocation: within a pod we can define the min and max resources a container gets and the scheduler considers resource requirements when allocating a pod to a node; if no nodes meet the min resource requirements, the pod will not be scheduled onto any node and will remain in a pending state until a suitable node is available for scheduling
probes: K8s can detect failire in an app container and restart or stop sending it more work; probes are used to determine the health and readiness of containers in a pod and are defined per container
init containers: all containers in a pod are started simultaneously; init containers are special containers that run before the main app container starts and their job is to handle any initialisation or config tasks that need to happen before the main app can start (i.e. if a website needs to wait for a db to start so it can connect to it)
sidecar containers: a sidecar container can be created by setting an init container’s restartPolicy to Always. This means the container starts before the main application, ensuring setup tasks are completed, and then continues running for the Pod’s entire lifetime, acting as a sidecar. This combines the benefits of init containers’ startup order with sidecar containers’ persistence

Deployments

Apps are deployed in a K8s cluster by creating Deployments which contain declarative configuration that instructs K8s how we’d like to manage and update a workload.

A deployment is made up of one or more pods, giving us the ability to manage a group of identical pods as single resource.

Behind the scenes, when a Deployment is created, K8s creates a Replica Set which represents a group of pods. This Replica Set maintains the number of desired replicas specified in the config.

K8s Deployments are perfect for horizontal scaling as we can increase/decrease the number or replicas depending on traffic and optimize resource utilization.

Horizontal Pod Autoscaler (HPA) helps with automatically scaling the number of Pods in a Deployment. It has configurable scaling policies that define the min and max number of Pods as well as the rate at which Pods are added/removed.

To update a deployment in Kubernetes, create a new container image for the application, push it to a container registry, and update the deployment’s image field. Kubernetes performs a rolling update by default, creating new Pods with the updated image and terminating old ones gradually to ensure availability. Alternatively, the ‘Recreate’ strategy can be used, where all old Pods are terminated before new ones are created, causing temporary downtime.

We can also easily rollback deployments if needed

Services

As Deployments scale up and down, Pods get created and destroyed, how can apps in the cluster connect to pods that may have different IPs? The Services acts as a stable network endpoint that our apps can use to find and commuicate with each other. A service gives us a stable IP address and DNS name that doesn’t change as we scale and update Pods.

There are different types of K8s Services:

ClusterIP: exposes the service on a cluster-internal IP, allowing other apps running in the cluster to access it
NodePort: exposes the service on each node’s IP at a static port, making it accessible from outside the cluster
LoadBalancer: provisions a cloud-provided load balanced and assigns it a fixed, external IP to access the service from outside the cluster
ExternalName: maps the service to an external DNS name, allowing internal apps to access external services

A Kubernetes Service of type LoadBalancer is designed to expose an application to the public internet by provisioning a cloud load balancer. The LoadBalancer Service will have a public IP address or DNS name that external users can connect to. The load balancer then forwards traffic to the appropriate Kubernetes Service and pods.

In the cloud, we’ll mostly deal with ClusterIP (internal) and LoadBalancer (external).

A service uses labels and selectors to identify the group of Pods to rout to traffic to.

labels are key-value pairs attached to K8s objects like pods, deployments services, nodes, others; labels are applied to K8s objects to provide identifying metadata
selectors are used to target a set of objects based on their labels;

Namespaces

With namespaces, each team can deploy their objects, like apps and services, into their own designated area of the Kubernetes cluster. They can manage and modify their own stuff without interfering with what the other teams are doing.

A pod can connect to any service in the same namespace using just the name of the services. If a Pod wants to call a service in another namespace, it must add the target’s namespace to the DNS name.

Ingress

A LoadBalancer Service provides an external entry point for a single K8s Service, giving each service its own public IP and DNS name. While simple, managing multiple LoadBalancers for numerous services can become expensive and complex.

An Ingress, on the other hand, acts as a centralized entry point for multiple services, routing traffic based on URL paths or hostnames

Volumes

Volumes can be used to attach persistent storage to a container, allowing the app running inside the container to read and write data to a designated location.

For ML workloads, object storage systems (like Amazon S3, GCS) are preferred for their scalability, durability, and compatibility with ML frameworks like TensorFlow and Spark. Applications access data via embedded code or mounted storage systems.

K8s also simplifies traditional storage integration through the Container Storage Interface (CSI). CSI allows storage vendors to develop drivers for seamless K8s compatibility, supporting various storage types (block, file, object). This decoupled approach lets storage providers innovate independently while Kubernetes users access modern storage features efficiently.

We can add storage to workloads through:

static provisioning: administrators manually allocate Persistent Volumes (PVs) beforehand. Persistent Volume Claims (PVCs) are then created by users to request and attach storage to Pods
dynamic provisioning: K8s automatically provisions storage volumes based on PVC requests and a defined StorageClass, simplifying management for dynamic workloads

Persistent Volume (PV) are the actual storage resource. Access modes include:

ReadWriteOnce: single Pod access
ReadOnlyMany: multiple Pods with read-only access
ReadWriteMany: multiple Pods with read-write access

Persistent Volume Claim (PVC) specify storage requirements (size, access mode) and is used by Pods to bind to PVs.

ConfigMaps

Stores app configuration as key-value pairs, enabling us to externalize settings like db connection strings or logging levels. It decouples configuration from application code, allowing the configuration to be provided to containers as environment variables or files

Secrets

Similar to ConfigMaps but for sensitive information.

Jobs

The key difference between a Job and a regular Deployment is that a Job will run a task to completion and then stop, rather than continuously running the application.

There are 3 types of jobs:

non-parallel jobs: simplest type, designed to run a single task to completion
multiple parallel jobs: for tasks that can be parallelised, like processing a large batch of data
parallel jobs with fixed completions: similar to multiple parallel jobs but with a specified number of successful completions required before the job is considered done

We can schedule jobs in a cluster using CronJobs which are similar to traditional cron jobs.

StatefulSets

Manage stateful applications that require stable identities, persistent storage, and ordered deployment.

stable network identities: each Pod gets a predictable DNS name (e.g., web-0, web-1) that persists across rescheduling
persistent storage: Pods are bound to specific Persistent Volumes (PVs) using volumeClaimTemplates. Data persists even if a Pod is replaced
ordered Deployment and scaling: Pods are created, deleted, and scaled sequentially to ensure consistent behavior
headless Service integration: a headless Service (ClusterIP: None) provides direct DNS entries for each Pod, supporting stateful communication patterns

DaemonSets

This is for when we need a system service that we want to run on every single node in our cluster (like to collect logs or metrics) - we don’t need multiple replicas, we need exactly 1 instance on each node. A DaemonSet ensures one copy of a Pod is running on every (or a selection of) nodes in a cluster.

Advanced scheduling techniques

In addition to Deployments, where we let K8s pick the node on which a workload should run, and a Daemon service which runs on every node, there are times when we might need finer control over which node runs a particular Pod.

node labels: we can apply labels to K8s nodes, and then use those labels to specify which nodes should run a particular Pod
affinity and anti-affinity: we can use affinity to schedule a Pod based node labels or other Pods running on the node
taints and tolerations – taints, like labels, are key-value pairs that we can add to nodes. When a node is tainted, K8s requires that a toleration is added to any Pod that should be scheduled on the tainted node

This can be used in ML when we need only workloads tht need a GPU get placed on nodes with GPUs.

GitOps

A practice where infrastructure is treated like source code, using Git as the single source of truth for infrastructure and runtime configurations. Changes to workloads or infrastructure are made via pull requests, and GitOps tools, like Flux or ArgoCD, automatically apply these changes to the cluster when detected. Unlike traditional infrastructure-as-code (IaC), which requires manual intervention to apply changes, GitOps automates this process. The infrastructure’s actual state is continuously aligned with the desired state defined in the Git repository, managing Kubernetes cluster configurations like software deployments.

Software Engineering for Data Scientists

This was another book on my ML-related books, and as the author says in the preface: learning to write better code will let you do more data science

Software Engineering for Data Scientists - Chapter 1 - What is good code?

Why good code matters

Good code is crucial, especially when it integrates with larger systems, like ML models in production or tools for other data scientists. As projects grow, the importance of good code increases.

‘Code as craft’ - much like a carpenter takes pride in building a well-made wooden cabinet. Good code is efficient, elegant, provides satisfaction and is easier to maintain.

However, quick, poorly written code—often referred to as technical debt—can create problems down the line. It may involve missing documentation, bad variable names, or disorganized structure, all of which make the code harder to maintain. While technical debt is sometimes necessary due to deadlines or budget constraints, it can lead to more time spent fixing bugs later.

Adapting to changing requirements

When writing code for ML, and other projects, change is inevitable. And good code can be easility adapted to work well with such changes. Starting to write good code from the beginning is important - setting up these best practices and following them and the benefits will come as a project grows.

The book divides clean code principles into: simplicity, modularity, readability, performance, and robustness.

Simplicity

Simplicity is key in coding, as complex systems are harder to understand and modify. As projects grow, keeping all details in your head becomes impossible, and complexity can lead to unexpected issues, such as errors caused by missing steps in the workflow.

Accidental complexity can be reduced by keeping the code simple, avoiding repetition, and making it modular.

Example principles are:

DRY
avoiding verbose code

Modularity

Modular code:

makes the code easier to read
it’s easier to locate where a problem comes from
it’s easier to reuse code in the next project

It’s important to break down big tasks into smaller modules. It is not something that can be done right from the get go, but it’s important to keep in mind and that our code will change as a project evolves.

For instance, for a task like:

We can start with a skeleton for each function

def load_data(csv_file):
    pass

def clean_data(input_data, max_length):
    pass

def plot_data(clean_data, x_axis_limit, line_width):
    pass

Readability

…code is read much more often than it is written… - PEP8

Code is for humans to read and for machines to execute.

We can follow best practices like PEP8, naming conventions, removing unnecessary prints, write docstrings

Performance

Good code needs to be performant and this can be measured in running time and memory usage. This is especially important for product code.

Robustness

This means code should be reproducible, and respond gracefully if system inputs change unexpectedly.

Can be achieved through:

properly handling errors
logging
writing good tests

I guess this would be a fine book to read. It’s not long either.

Practitioners guide to MLOps - White paper by Google

The MLOps lifecycle addresses the challenges organizations face in deploying and managing machine learning (ML) systems effectively. Despite AI/ML being central to digital transformation, most organizations struggle to move beyond pilot projects, with many failing to deploy models into production or maintain them effectively once deployed.

Challenges

manual and non-reproducible workflows
inefficient handoffs between data scientists and IT teams
lack of mid- to senior-level talent
poor change-management processes and governance models
challenges in deployment, scaling, and versioning

The role of MLE

MLE integrates software engineering principles with the unique complexities of ML:

data preparation and maintenance
monitoring and tracking model performance
experimentation with data, algorithms, and hyperparameters
continuous retraining on fresh data
addressing data inconsistencies between training and serving environments
ensuring model fairness and security

MLOps defined

MLOps unifies ML development (ML) with system operations (Ops), automating and standardizing critical steps in building, deploying, and managing ML systems. It mirrors the role of DevOps in app development but is tailored to ML-specific problems, such as changing data patterns, adversarial risks, and model drift.

Benefits of MLOps

shorter development cycles and faster time-to-market
improved collaboration across teams
enhanced reliability, scalability, and security
streamlined governance and operational processes
higher return on ML investments

In essence, MLOps provides the framework and tools necessary to ensure ML systems are deployed reliably, monitored effectively, and scaled efficiently, aligning them with evolving business goals.

ML-enabled systems

DE is essential for supporting operational, analytics, and ML tasks by ingesting, integrating, curating, and refining data. Effective DE processes are crucial for the success note only for BI and analytics but also ML. ML models are built and deployed using curated data, often provided by the DE team, and integrated into various application systems like BI tools and process control systems. Integrating models requires ensuring their effectiveness within applications and monitoring performance. Additionally, tracking relevant business KPIs helps assess the model’s impact and guide adjustments.

The MLOps lifecycle

ML development consists of experimenting and developing a robust and reproducible model training procedure
training operationalisation is about automating the process of packaging, testing, and deploying repeatable and reliable training pipelines
continuous training is repeatedly executing the training pipes in response to new data or to code changes, or on a schedule
model deployment is about packaging, testing, and deploying a model to a serving env for online experimentation and production serving
prediction serving is about serving the model that is deployed in prod for inference
continuous monitoring is about monitoring the effectiveness and efficiency of a deployed model
data and model management is a central, cross-cutting function for governing ML artifacts to support auditability, traceability, and compliance; this can also promote shareability, reusability, and discoverability of ML assets

And end-to-end workflow

This is not a waterfall workflow that has to sequentially pass through all the processes. The processes can be skipped, or the flow can repeat a given phase or a subsequence of the processes.

the core activity during the ML dev phase is experimentation. As DSs and ML researchers prototype model architectures and training routines, they create labeled datasets, and they use features and other reusable ML artifacts that are governed through the data and model management process. The primary output of this process is a formalized training procedure, which includes data preprocessing, model architecture, and model training settings
if the ML system requires continuous training, the training procedure is operationalized as a training pipe. This requires a CI/CD routing to build, test, and deploy the pipe to the target execution env
the continuous training pipe is executed repeatedly based on retraining triggers, and it produces a model as output. The model is retrained as new data becomes available, or if model performance decay is detached. Other training artifacts and metadata that are produced by a training pipeline are also tracked. If the pipeline produces a successful model candidate, that candidate is then tracked by the model management process as a registered model
the registered model is annotated, reviewd, and approved for release and is then deployed to a prod env. This process might be relatively opaque if you are using a no-code solution, or it can invovle building a custom CI/CD pipeline for progressive delivery
the deployed model serves predictions using the deployment pattern that you have specified: online, batch, or streaming preds. In addition to serving predictions, the serving runtime can generate model explanations and capture serving logs to be used by the continuous monitoring process
the continuous monitoring process monitors the model for predictive effectiveness and service. The primary concern of effectiveness performance monitoring is detecting model decay - for example, data and concept drift. The model deployment can also be monitored for efficiency metrics like latency, thoughput, hardward resource utilisation and exeuction errors

MLOps platform capabilities

To implement MLOps effectively, orgs should establish core technical capabilities, which can be provided by a single ML platform or a combination of vendor tools and custom services.

Key MLOps platform capabilities include foundational infrastructure (reliable, scalable, and secure compute resources), configuration management, and CI/CD tools. Core MLOps capabilities include experimentation, data processing, model training, evaluation, serving, online experimentation, monitoring, pipelines, and model registries. Additionally, cross-cutting capabilities such as an ML metadata repository and an ML dataset/feature repository are necessary for integration and interaction across workflows.

Experimentation

provide notebook envs that are integrated with version control
track experiments, include info about the data, hyperparams, and eval metrics for reproducability and comparison
analyse and visualise data and models
support exploring datasets, finding experiments, and reviewing implementations
integrate with other data services and ML services in your platform

Data processing

support interactive execution for quick experimentation and for long-running jobs in prod
provide data connectors to a wide range of data sources and services, as well as data encoders and decoders for various data structures and formats
provide both rich and efficient data transformations and ML feature engineering for structured and unstructured data
support scalable batch and stream data processing for ML training and serving workloads

Model training

support common ML frameworks and support custom runtime envs
support large-scale distributed training with different strategies for multiple GPUs and multiple workers
enable on-demand use of ML accelerators
allow efficient hparam tuning and target optimisation at scale
ideally, provide built-in AutoML functionality

Model evaluation

perform batch scoring of your models on eval datasets at scale
compute pre-defined or custom eval metrics for your model on different slices of the data
track trained-model predictive performance across different continuous-training executions
visualise and compare performances of different models
provide tools for what-if analysis and for identifying bias and fairness issues
enable model behaviour interpretation using various explanable AI techniques

Model serving

provide support for low-latency, near-real-time (online) preds and high-throughput batch (offline) prediction
provide built-in support for common ML serving frameworks for for custom runtime envs
enable composite prediction routines
allow efficient use of ML inference accelerators with autoscaling
support model explainability
support logging of prediction serving requests and responses for analysis

Online experimentation

support canary and shadow deployments
support traffic splitting and A/B testing
support multi-armed bandit tests

Model monitoring

measure model efficiency metrics
detect data skews, data concept shifts and drifts
integrate monitoring

ML pipelines

trigger pipes on demand, schedule, in response
enable local interactive execution for debugging during development
integrate with ML metadata tracking
provide a set of built-in components for common ML tasks
run on different envs
optionally, provide GUI-based tools for designing and building pipes

Model registry

register, organize, track, and version trained and deployed models
store model metadata and runtime dependencies for deployability
maintain model documentation and reporting
govern the model launching process

Dataset and feature repos

enable shareability, discoverability, reusability, and versioning of data assets
allow real-time ingestion and low-latency serving for event streaming and online prediction workloads
allow high-throughput batch ingestion and serving for ETL process and model training
enable feature versioning for point-in-time queries
support various data modalities

ML metadata and artifact tracking

provide traceability and lineage tracking of ML artifacts
share and track experimentation and pipe param configs
store, access, investigate, visualise, download and archive ML artifacts
integrate with all other MLOps capabilities

Deep dive of MLOps processes

ML development

Training operationalization

Continuous training

Model deployment

A more complex CI/CD system for model deployment:

Prediction serving

Continuous monitoring

Data and model management

ML metadata tracking:

Model governance

Putting it all together

Delivering business value through ML is not only about building the best ML model for the use case at hand. Delivering this value is also about building an integrated ML system that operates continuously to adapt to changes in the dynamics of the business environment. Such an ML system involves collecting, processing, and managing ML datasets and features; training, and evaluating models at scale; serving the model for predictions; monitoring the model performance in production; and tracking model metadata and artifacts.

End-to-end MLOps workflow:

This white paper seems condensed, yet it is in-depth and detailed. Definitely something to refer back to when working on a project, preparing for an interview, or just thinking about MLOps.

Streamed - Intro to LangSmit

Today I saw that LangChain announced a new free course - Introduction to LangSmith

Tldr - it’s amazing, and the LangSmith platform is great for anyone doing LLM observability. It’s important to note that I am just a single learning dev, and I saw that it’s paid for companies ~ so maybe MLflow’s free LLM observability features can suffice.

They also showed this Chat LangChain webapp kind of like the Ask Astro for Airflow.

I also saw that this stream was my 100th video on my youtube channel. Nice ending

That is all for today!

See you tomorrow :)