(Day 363) Data mesh + K8s for MLOps

Ivan Ivanov · December 29, 2024

theory data-eng mlops

Hello :) Today is Day 363!

A quick summary of today:

read more of the Data Mesh book
covered Apache Pinot 201
started reading the book Real-World ML Systems on Kubernetes

Data Mesh - Chapter 6 - The Inflection Point

Great Expectations of Data

Companies have similar goals in mind:

provide the best customer experience based on data and personalization
reduce operational costs and time through data-driven optimizations
empower employees to make better decisions with trend analysis and BI

All of these scenarios require data—a high volume of diverse, up-to-date, and truthful data that can, in turn, fuel the underlying analytics and machine learning models.

Before, companies’ data aspirations were related to make good BI reports, but now it is beyond that - using ML, smart assistants, personalization, streamlining operations. These new expectations require a new approach to data management - an approach that can seamlessly fulfill the diversity of uses for data.

The Great Divide of Data

Many of the technical complexities companies face stem from the division of data between operational and analytical.

The two-plane data topology creates a fragile integration architecture, where operational data feeds analytical data via ETL pipelines lacking explicit contracts. This results in brittle pipelines prone to failures from upstream changes. Over time, these pipelines grow complex, managing bidirectional data flow. These challenges, along with reliance on centralized data warehouses or lakes, drive the need for improved solutions.

Scale: Encounter of a new kind

The data industry has evolved to handle large-scale challenges in volume, velocity, and variety, with parallel processing, stream processing, and specialized storage systems addressing these needs.

However, the new challenge lies in managing data from diverse sources and locations. Unlocking insights requires connecting data from multiple sources rather than centralizing it. For example, intelligent healthcare and banking rely on integrating diverse data types beyond organizational control. This shift necessitates a decentralized approach to data management, focusing on connecting data wherever it resides.

Beyond order

Organizations need to adapt to complexity, uncertainty, and volatility. Modern businesses experience continuous change reflected in their data, showing the need for real-time, high-quality, and trustworthy insights for agile decision-making. Future data management systems must:

enable rapid response to change as a fundamental feature, not an optimization
move away from rigid schemas toward flexible, dynamic models that embrace complexity
facilitate autonomous, peer-to-peer team collaborations
support multi-platform data access and management across cloud and on-premise environments dscape.

Aproaching the Plateau of Return

The gap between investments in data and AI are highlighted. According to NewVantage Partners’ annual reports, while most companies invest heavily in data and AI (99%, with 62% spending over $50M), only a minority report meaningful progress: 24.4% have a data culture, 24% are data-driven, and 41.2% compete using data.

Challenges include:

legacy systems
cultural resistance
competing priorities
skill gaps
governance issues
data usability friction

Future data management strategies must address these barriers to achieve better returns on investment.

llmops-python-package

Apparently this package for LLMOps was announced yesterday (or this week). It includes tools for model registry, experiment tracking, and real-time inference, with a focus on automation, CI/CD, code quality, model management and observability. It’s nice seeing packages like this.

Data Mesh - Chapter 7 - Navigating the Inflection Point

Data mesh is all about:

embracing change: data mesh assumes constant evolution and diversity in data and its uses as the natural state, enabling organizations to respond gracefully to change in a complex business environment
agility at scale: data mesh aims to sustain agility in the face of growth by removing organizational bottlenecks and reducing the need for coordination and synchronization
maximizing data value: data mesh strives to increase the value derived from data relative to investment by simplifying the data landscape and applying product thinking

Key Ideas and Facts:

Responding Gracefully to Change

complexity as the default: data mesh acknowledges the inherent complexity of modern businesses and embraces the continuous evolution of data landscapes
alignment is key: the principle of domain data ownership aligns business, technology, and analytical data by assigning responsibility to those closest to the data. This echoes the decomposition seen in microservices architecture

bridging the gap: data mesh closes the gap between operational and analytical data planes by integrating data products tightly with operational systems, enabling near real-time insights and faster feedback loops

flexibility in data models: data mesh removes the need for a centralized canonical data model, allowing domains to evolve their models independently without hindering downstream data users. Well-defined data contracts ensure compatibility and smooth transitions

Sustaining Agility in the Face of Growth

remove centralised and monolithic bottlenecks: data mesh replaces centralized data lakes and warehouses with a peer-to-peer data sharing model, removing architectural bottlenecks and reducing reliance on intermediary data teams

reduce coordination of data pipelines: data mesh shifts from technical partitioning of data management to a domain-oriented approach, reducing coordination overhead associated with data pipelines

reduce coordination of data governance: federated computational governance empowers domain data product owners with automated policy enforcement, reducing friction in data access and sharing

Increasing the Ratio of Value from Data to Investment

abstract technical complexity with a data platform: a data-product-developer-centric platform abstracts technical complexity, enabling generalist developers to manage data products, reducing reliance on specialized data engineers
open interfaces: standardized interfaces across data products foster interoperability and a collaborative ecosystem of technologies, lowering integration costs and encouraging innovation
embed product thinking everywhere: applying product thinking to both data and the data platform prioritizes user happiness, focusing on delivering value and a seamless user experience
go beyond the boundaries: while data mesh focuses on democratizing data access, the author acknowledges the need for continuous delivery of robust analytical and ML solutions to fully realize the value of data

Data Mesh - Chapter 8 - Before the Inflection Point

Evolution of Analytical data architectures

First generation: The Warehouse

Second generation: The Data lake

Third generation: Multicloud

support streaming for near real-time data avalability with architectures like Kappa
attempt to unify batch and stream
fully embrace cloud-based managed services and use modern cloud-native implementations with isolated compute and storage
converge the warehouse and lake into 1 technology

Characteristics of Analytical data architectures

There has been an explosion in the number of solutions in big data in AI

But there are fundamental assumptions that have remained the same:

data must be centralised to be useful - managed by a centralised organisation, with an intention to have an enterprise-wide taxonomy
data management architecture, technology, and organisation are monolithic
the enabling technologies drive the paradigm - architecture and organisation

Monolithic structure

A monolithic architecture, technology stack, and organizational structure lead to bottlenecks, slow data delivery, and difficulty in adapting to evolving data needs - and one of the core assumptions of data mesh challenges this

The 30,000-foot view of the monolithic data platform:

Siloed hyper-specialized data team:

Centralised data ownership

This approach, while aiming to combat siloed data, creates a disconnect between data originators and consumers, impacting data quality and hindering domain-specific data understanding.

Centralized data ownership with no clear domain boundaries:

Technology oriented

Modern data architectures prioritize technical functions over business needs, creating silos and requiring heavy synchronization. Teams organized by activities (like ingestion or serving) struggle to deliver outcomes efficiently which can slow down innovation. Shifting to domain-oriented architectures embeds functions within business domains and enables faster and independent changes and better scalability.

Apache Pinot 201

What is a Pinot cluster?

At it’s core, it’s a collection of servers or processes that work together to handle ingestion, storage and query processes. A typical cluster consists of:

Controller - manages cluster metadata
Broker - handles query routing
Server - stores and processes data in segments
Minions - perform different tasks

I used this github repo from StarTree to setup a local Pinot cluster and follow the exercises.

One thing that is not ideal is that for every exercise there is a different docker compose file and each next one adds something. I will use the docker compose in the last exercise hoping it’s sufficient and I won’t need to run docker-compose for each exercise.

This is what the Pinot UI looks like when I ran the last docker-compose:

We can see servers, brokers, servers, minions, tenants, tables and minion task manager.

Tenants

A tenant is a logical component defined as a group of server/broker nodes with the same tag. Tenants can be defined for broker and/or server

Broker tenants

Allow for separation of data and compute at the broker level. A sample use case might be a cluster that hosts data for multiple clients, and might need to segregate traffic as well as data.

Server tenants

Allow for separation of data and compute at a server level.

A typical use case might be segregating data by department within a company of critical and non-critical tasks

Data Flow in Pinot

in the case of real time tables, data is written to a segment during the commit phase
the server responsible for the segments communicates the segment changes to the controller
the controller writes the segment to the deep store
tables can bypass the controller and load data directly into the deep store
in case of batch, the data is written directly to the deep store when configured

We can insert tables through the CLI, the UI, or API.

Exercise

Here the exercise was to add a table. When I open the main pinot url and go to /help, I see (like in FastAPI):

This is the exercise:

- Navigate to the Swagger APIs located at [http://localhost:9000/help](http://localhost:9000/help)
- Find the "Schema" section in the APIs.  
- Find the POST - create schema, and click the button "Try It Out"
- Paste the following code and hit Enter: {json schema for a table}

Added it:

I can also see it in the UI:

Ingesting data

batch
streaming
hybrid

Exercise

I ran the docker compose for this specific exercise, and there already was a table:

And I loaded data through the CLI:

curl -X POST -F file=movies.csv  -H "Content-Type: multipart/form-data"  "http://localhost:9000/ingestFromFile?tableNameWithType=movies_OFFLINE&batchConfigMapStr=%7B%22inputFormat%22:%22csv%22,%22recordReader.prop.delimiter%22:%22,%22%7D"

Response:

{"status":"Successfully ingested file into table: movies_OFFLINE as segment: movies_1735447958741_1735447958741_1735447958617"}%  

How to query data

The single query engine

The multi stage query engine

Indexes exercise

create table and load data

docker exec -it pinot-controller sh
/opt/pinot/bin/pinot-admin.sh AddTable -schemaFile /scripts/gitHub_events_schema.json -tableConfigFile /scripts/gitHub_events_offline_table_config.json -exec

next, populate some data in the tables

cd data
wget https://data.gharchive.org/2021-07-21-9.json.gz
wget https://data.gharchive.org/2021-07-21-10.json.gz
wget https://data.gharchive.org/2021-07-21-11.json.gz
wget https://data.gharchive.org/2021-07-21-12.json.gz
gunzip *.gz
/opt/pinot/bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile /scripts/job-spec.yaml

Next, I can create an index just by adding this:

"bloomFilterColumns": [
      "id",
      "commit_author_names",
      "label_ids"
    ],

a bloom filter to the column ‘commit_author_names’ in the tableIndexConfig section

(I wish 1 docker compose worked for all exercises. It does not as there are different config files and extra assets for each exercise. :/ )

Upserts

can be full or partial
utilise comparison columns to determine the latest iteration of data
snapshots help create resiliency in metadata map to stay on top of any lost in memory metadata related to upserts in case of server outage
enable preload helps faster recovery of upsert metadata map by using snapshots
StarTree index is not supported when upserts are on
to use upserts, a primary key must be defined in the schema config, as used to partition data
in the table config, define a section upsertConfig with the appropriate mode (full or partial; partial can use overwrite, increment, append, union, max, min strategies)

(I feel like the exercises are a bit incomplete maybe rushed. Some parts are copy-pasted from previous exericses)

Transformations during ingestion

We can use groovy functions.

to use Groovy functions, we need to modify the controller config to enable it, set disabled = false
- controller.disable.ingestion.groovy=false

Groovy Function Syntax:

Groovy({groovy script}, argument1, argument2...argumentN)

Example:

"ingestionConfig": {
  "transformConfigs": [{
    "columnName": "fullName",
    "transformFunction": "Groovy({firstName+' '+lastName}, firstName, lastName)"
  }]
}

Minions in Pinot

a minion is a standby component that leverages a task framework to offload computationally intensive tasks from other components
some typical minion tasks are: batch ingestion, segment creation, segment purge, segment merge

Use cases that work for Pinot

Real-time User Activity Monitoring

monitoring user interactions, analyzing engagement, and detecting anomalies in real-time on large-scale platforms like social media
Pinot leverages Kafka integration to ingest massive data streams and deliver instant analytics

Fraud Detection

rapid detection of fraudulent transactions in finance and e-commerce
Pinot can be used for real-time analysis of transaction patterns to flag suspicious activities as they occur

Real-time Personalization

delivering personalized experiences across e-commerce, streaming services, and online gaming to boost user engagement
Pinot enables real-time recommendation engines that tailor content based on individual preferences and behaviours

Operational Analytics

real-time monitoring of infrastructure, applications, and services to maintain efficiency and uptime
Pinot processes logs, metrics, and traces in real-time, providing actionable insights for smooth operations

Data Mesh - Chapter 9 - The Logical Architecture

Domain ownership of data results in a domain-centric organization of analytical data - this means that a domain’s interfaces must extend to sharing its analytical data

Data as a product introduces a new architecture quantum, aka data quantum

A data quantum is the core building block of a data mesh architecture, encapsulating all components needed to ensure usability, secure data sharing, and policy enforcement. Each data quantum represents a domain-oriented data product, including the data itself, transformation code, sharing mechanisms, and governing policies. This concept forms the foundation of data mesh by treating data products as self-contained architectural units.

The self-serve data platform drives a multiplane platform architecture

The self-serve platform provides a variety of services that enable different types of data mesh users to perform their roles effectively: data producers (such as data product developers and owners), data consumers (like data analysts and scientists), and data governance team members. The platform is structured into three planes, which consist of coordinated services organized according to the needs and experiences of data mesh users.

Federated computational governance embeds computational policies into each data product

The principle of federated computational governance enhances each data product with a computational container capable of hosting a sidecar process. This sidecar enforces computational policies at the appropriate stages of the data lifecycle—such as building, deploying, accessing, reading, or writing—integrating policy execution seamlessly into the data flow.

A data product sidecar operates within the execution context of a data product, handling domain-independent, cross-cutting concerns like policy enforcement. It can also be extended to support standardized features like discovery. Notably, the sidecar’s implementation remains consistent across all data products.

Data Mesh - Chapter 10 - The Multiplane Data Platform Architecture

Design a Platform Driven by User Journeys

The ultimate purpose of the platform is to serve the cross-functional domain teams so they can deliver or consume data products. Hence, the best way to begin the design of the platform is to understand the main journeys of your platform users and evaluate how you can make it easy for them to complete their journeys successfully using the platform.

High-level personas:

data product developer: from generalist developers with general programming skills to specialist data engineers who are well-versed in the existing analytical data processing technologies
data product consumers: they need access and use data to do their job (i.e. DSs, DAs, data product devs, app devs)
data product owners: deliver and promote domain-specific data products, ensuring adoption, value delivery, policy compliance, and interoperability using the platform
data governance members: operate within a federated structure, collectively ensuring secure and optimal mesh operations, with roles including security, legal experts, and data product owners
data platform product owner: develops platform services as a product, prioritizing user needs to deliver an optimal experience for all user roles
data platform developer: builds and operates the data platform, contributing to its design while using its utility and product experience services

Data Product Developer Journey

High-level example of the data product development journey:

High-level example of the data product development journey using the platform:

Incept, Explore, Bootstrap, and Source

Data products begin with identifying real-world analytical use cases to demonstrate their value. Consumer-aligned data products are created through direct collaboration with their users. Developers explore potential data sources, which may include upstream data products, external systems, or organizational systems, assessing their suitability based on guarantees, documentation, and profiling. Once sources are identified, the developer bootstraps the data product using platform-provisioned infrastructure to connect to and experiment with source data. This exploratory phase includes rapid discovery, source access, quick scaffolding, and infrastructure setup.

Build, Test, Deploy, and Run

Data product developers have an end-to-end responsibility of building, testing, deploying, and operating their data products. This stage is a continuous and iterative series of activities that data product developers perform to deliver all the necessary components of a successful data product.

Maintain, Evolve, and Retire

maintaining and evolving a data product involves continuous updates to transformation logic, data models, access methods, and policies while ensuring uninterrupted data processing and sharing. The platform simplifies maintenance by handling complexities like schema versioning and resource management, allowing developers to focus on product-specific changes.
evolution can range from minor updates, like bug fixes, to significant changes, such as migrating storage vendors. Monitoring operational health, including performance, reliability, and costs, is crucial, both for individual data products and the mesh as a whole, with insights and alarms for potential bottlenecks.
data products may retire if migrated to new versions or if their data is no longer needed. The platform ensures graceful retirement, allowing downstream users to transition. Dormant products serve old data and enforce policies, while fully retired products cease to exist.

Example ML model development journey

Data Product Consumer Journey

Incept, Explore, Bootstrap, Source

inception: developers hypothesize intelligent actions or decisions based on existing data, such as creating a playlist loved by listeners
exploration: they search and evaluate data products for bias and fairness using apis and data sampling

Build, Test, Deploy, Run

build: developers train models using data transformed into features (e.g playlist features), treating these transformations as data products
test: they track and version data using timestamps for model reproducibility
deploy: models are packaged to run as part of a data product and executed using specialized infrastructure like gpus or tpus

Maintain, Evolve, and Retire

developers monitor model performance by evaluating user behavior metrics (playlist replay rates and listening duration)
metrics are collected using operational monitoring tools integrated with user applications

Real-World ML Systems on Kubernetes

Saw this intriguing book on Manning’s MEAP and decided to get it with one of my free credits from when I had a sub. I want to learn more about K8s’ application in MLOps and also the authors seem interesting - Re Alvarez Parmar - a Principal Specialist Solutions Architect at AWS, and Elamaran (Ela) Shanmugam - a Sr. Specialist Solutions Architect with AWS as well.

This is the TOC at the time of writing this post:

Real-World ML Systems on Kubernetes - Chapter 1 - Accelerating machine learning innovation using MLOps and Kubernetes

The complexity of taking ML models to production has led to the rise of MLOps, a blend of ML and DevOps, focused on creating production-grade ML systems. MLOps integrates software engineering, data management, and operational principles to streamline the development and scaling of machine learning models. Kubernetes, an open-source platform for managing containerized applications, plays a vital role in MLOps by enabling scalability, flexibility, and alignment with software development best practices.

This book explores designing scalable MLOps platforms using Kubernetes, offering insights into improving developer productivity, managing systems at scale, and ensuring reliability. Kubernetes’ ability to run on diverse infrastructures and its widespread adoption makes it an ideal foundation for modern MLOps solutions.

The need for speed

Developing machine learning applications is a highly iterative process - EDA, features, hyperparams - getting a good model might require lots of experimenting. To make this faster we could use some parallelisation to conduct experiments concurrently and reduce the time to learn what works and doesn’t.

MLOps is the application of DevOps principles to machine learning engineering, and automation starts as soon as developers merge their code changes into a central repository. Changes in code trigger automated builds and tests to ensure that the code meets standards and is free of any known bugs. And since there is a lot of iterativeness in model development, adopting DevOps practices is a good idea for MLOps engineers.

Developing ML systems

Here the CRISP-DM process is mentioned (which was also referenced in the CRISP-ML(Q) paper), and then the CRISP-ML(Q) itself is mentioned as well.

CRISP-DM consists of 6 major phases of data mining:

business understanding
data understanding
data preparation
model training
evaluation
deployment

And CRISP-ML(Q) adds a 7th - monitoring and maintenance, to address the unique challenges in ML.

Typical ML lifecycle:

And it is a circle - an ongoing process.

MLOps platforms

An MLOps platform streamlines the development of machine learning systems by providing end-to-end capabilities that support collaboration among data engineers and machine learning engineers. It automates hardware provisioning, software configuration, and workflows for tasks like data processing, model training, deployment, and monitoring, enabling efficient and scalable solutions.

K8s-based MLOps platforms enhance efficiency by automating repetitive tasks and offering services tailored to different development roles. Some features include single sign-on, unified logging and monitoring, resource management, version control, collaboration tools, and integration with machine learning frameworks, creating a seamless environment for building and deploying machine learning systems.

And building an ML system requires collaboration between different roles

Benefits of building MLOps on K8s

the platform will be cloud-native
K8s has a vibrant community
it enables the creation of immutable infrastructure, enhancing a system’s consistency, predictability, and security
it abstracts the underlying infrastructure
provides a highly scalable infrastructure for deploying and managing machine learning models
supports scalable distributed training to speed up model development
provides the scalable infrastructure for distributed data ingestion data analytics, providing near-native integration with tools like Airflow, Spark, and others
integrates seamlessly with CI/CD pipelines
has extensive support observability through Prometheus, Grafana, and 3rd-party tools
offers capabilities to isolate workloads, making it ideal for multitenant scenarios
it supports policies and the enforcement of security standards in clusters

When to choose K8s as your ML platform

your organization is already familiar with it
you prefer to build your own systems instead of buying solutions
you must remain cloud agnostic at all costs

Today I wrote about ~ 8 pages on google docs for my final post. I am not sure if this will be the final version as I keep changing things. I don’t want to be philosophical but I also want to share my experience and mindset throughout this journey ~ we’ll see ~

That is all for today!

See you tomorrow :)