Hello :) Today is Day 363!
A quick summary of today:
- read more of the Data Mesh book
- covered Apache Pinot 201
- started reading the book Real-World ML Systems on Kubernetes
Data Mesh - Chapter 6 - The Inflection Point
Great Expectations of Data
Companies have similar goals in mind:
- provide the best customer experience based on data and personalization
- reduce operational costs and time through data-driven optimizations
- empower employees to make better decisions with trend analysis and BI
All of these scenarios require data—a high volume of diverse, up-to-date, and truthful data that can, in turn, fuel the underlying analytics and machine learning models.
Before, companies’ data aspirations were related to make good BI reports, but now it is beyond that - using ML, smart assistants, personalization, streamlining operations. These new expectations require a new approach to data management - an approach that can seamlessly fulfill the diversity of uses for data.
The Great Divide of Data
Many of the technical complexities companies face stem from the division of data between operational and analytical.
The two-plane data topology creates a fragile integration architecture, where operational data feeds analytical data via ETL pipelines lacking explicit contracts. This results in brittle pipelines prone to failures from upstream changes. Over time, these pipelines grow complex, managing bidirectional data flow. These challenges, along with reliance on centralized data warehouses or lakes, drive the need for improved solutions.
Scale: Encounter of a new kind
The data industry has evolved to handle large-scale challenges in volume, velocity, and variety, with parallel processing, stream processing, and specialized storage systems addressing these needs.
However, the new challenge lies in managing data from diverse sources and locations. Unlocking insights requires connecting data from multiple sources rather than centralizing it. For example, intelligent healthcare and banking rely on integrating diverse data types beyond organizational control. This shift necessitates a decentralized approach to data management, focusing on connecting data wherever it resides.
Beyond order
Organizations need to adapt to complexity, uncertainty, and volatility. Modern businesses experience continuous change reflected in their data, showing the need for real-time, high-quality, and trustworthy insights for agile decision-making. Future data management systems must:
- enable rapid response to change as a fundamental feature, not an optimization
- move away from rigid schemas toward flexible, dynamic models that embrace complexity
- facilitate autonomous, peer-to-peer team collaborations
- support multi-platform data access and management across cloud and on-premise environments dscape.
Aproaching the Plateau of Return
The gap between investments in data and AI are highlighted. According to NewVantage Partners’ annual reports, while most companies invest heavily in data and AI (99%, with 62% spending over $50M), only a minority report meaningful progress: 24.4% have a data culture, 24% are data-driven, and 41.2% compete using data.
Challenges include:
- legacy systems
- cultural resistance
- competing priorities
- skill gaps
- governance issues
- data usability friction
Future data management strategies must address these barriers to achieve better returns on investment.
llmops-python-package
Apparently this package for LLMOps was announced yesterday (or this week). It includes tools for model registry, experiment tracking, and real-time inference, with a focus on automation, CI/CD, code quality, model management and observability. It’s nice seeing packages like this.
Data Mesh - Chapter 7 - Navigating the Inflection Point
Data mesh is all about:
- embracing change: data mesh assumes constant evolution and diversity in data and its uses as the natural state, enabling organizations to respond gracefully to change in a complex business environment
- agility at scale: data mesh aims to sustain agility in the face of growth by removing organizational bottlenecks and reducing the need for coordination and synchronization
- maximizing data value: data mesh strives to increase the value derived from data relative to investment by simplifying the data landscape and applying product thinking
Key Ideas and Facts:
Responding Gracefully to Change
- complexity as the default: data mesh acknowledges the inherent complexity of modern businesses and embraces the continuous evolution of data landscapes
- alignment is key: the principle of domain data ownership aligns business, technology, and analytical data by assigning responsibility to those closest to the data. This echoes the decomposition seen in microservices architecture
- bridging the gap: data mesh closes the gap between operational and analytical data planes by integrating data products tightly with operational systems, enabling near real-time insights and faster feedback loops
- flexibility in data models: data mesh removes the need for a centralized canonical data model, allowing domains to evolve their models independently without hindering downstream data users. Well-defined data contracts ensure compatibility and smooth transitions
Sustaining Agility in the Face of Growth
- remove centralised and monolithic bottlenecks: data mesh replaces centralized data lakes and warehouses with a peer-to-peer data sharing model, removing architectural bottlenecks and reducing reliance on intermediary data teams
- reduce coordination of data pipelines: data mesh shifts from technical partitioning of data management to a domain-oriented approach, reducing coordination overhead associated with data pipelines
- reduce coordination of data governance: federated computational governance empowers domain data product owners with automated policy enforcement, reducing friction in data access and sharing
Increasing the Ratio of Value from Data to Investment
- abstract technical complexity with a data platform: a data-product-developer-centric platform abstracts technical complexity, enabling generalist developers to manage data products, reducing reliance on specialized data engineers
- open interfaces: standardized interfaces across data products foster interoperability and a collaborative ecosystem of technologies, lowering integration costs and encouraging innovation
- embed product thinking everywhere: applying product thinking to both data and the data platform prioritizes user happiness, focusing on delivering value and a seamless user experience
- go beyond the boundaries: while data mesh focuses on democratizing data access, the author acknowledges the need for continuous delivery of robust analytical and ML solutions to fully realize the value of data
Data Mesh - Chapter 8 - Before the Inflection Point
Evolution of Analytical data architectures
First generation: The Warehouse
Second generation: The Data lake
Third generation: Multicloud
- support streaming for near real-time data avalability with architectures like Kappa
- attempt to unify batch and stream
- fully embrace cloud-based managed services and use modern cloud-native implementations with isolated compute and storage
- converge the warehouse and lake into 1 technology
Characteristics of Analytical data architectures
There has been an explosion in the number of solutions in big data in AI
But there are fundamental assumptions that have remained the same:
- data must be centralised to be useful - managed by a centralised organisation, with an intention to have an enterprise-wide taxonomy
- data management architecture, technology, and organisation are monolithic
- the enabling technologies drive the paradigm - architecture and organisation
Monolithic structure
A monolithic architecture, technology stack, and organizational structure lead to bottlenecks, slow data delivery, and difficulty in adapting to evolving data needs - and one of the core assumptions of data mesh challenges this
The 30,000-foot view of the monolithic data platform:
Siloed hyper-specialized data team:
Centralised data ownership
This approach, while aiming to combat siloed data, creates a disconnect between data originators and consumers, impacting data quality and hindering domain-specific data understanding.
Centralized data ownership with no clear domain boundaries:
Technology oriented
Modern data architectures prioritize technical functions over business needs, creating silos and requiring heavy synchronization. Teams organized by activities (like ingestion or serving) struggle to deliver outcomes efficiently which can slow down innovation. Shifting to domain-oriented architectures embeds functions within business domains and enables faster and independent changes and better scalability.
Apache Pinot 201
What is a Pinot cluster?
At it’s core, it’s a collection of servers or processes that work together to handle ingestion, storage and query processes. A typical cluster consists of:
- Controller - manages cluster metadata
- Broker - handles query routing
- Server - stores and processes data in segments
- Minions - perform different tasks
I used this github repo from StarTree to setup a local Pinot cluster and follow the exercises.
One thing that is not ideal is that for every exercise there is a different docker compose file and each next one adds something. I will use the docker compose in the last exercise hoping it’s sufficient and I won’t need to run docker-compose for each exercise.
This is what the Pinot UI looks like when I ran the last docker-compose:
We can see servers, brokers, servers, minions, tenants, tables and minion task manager.
Tenants
A tenant is a logical component defined as a group of server/broker nodes with the same tag. Tenants can be defined for broker and/or server
Broker tenants
Allow for separation of data and compute at the broker level. A sample use case might be a cluster that hosts data for multiple clients, and might need to segregate traffic as well as data.
Server tenants
Allow for separation of data and compute at a server level.
A typical use case might be segregating data by department within a company of critical and non-critical tasks
Data Flow in Pinot
- in the case of real time tables, data is written to a segment during the commit phase
- the server responsible for the segments communicates the segment changes to the controller
- the controller writes the segment to the deep store
- tables can bypass the controller and load data directly into the deep store
- in case of batch, the data is written directly to the deep store when configured
We can insert tables through the CLI, the UI, or API.
Exercise
Here the exercise was to add a table. When I open the main pinot url and go to /help, I see (like in FastAPI):
This is the exercise:
- Navigate to the Swagger APIs located at [http://localhost:9000/help](http://localhost:9000/help)
- Find the "Schema" section in the APIs.
- Find the POST - create schema, and click the button "Try It Out"
- Paste the following code and hit Enter: {json schema for a table}
Added it:
I can also see it in the UI:
Ingesting data
- batch
- streaming
- hybrid
Exercise
I ran the docker compose for this specific exercise, and there already was a table:
And I loaded data through the CLI:
curl -X POST -F file=movies.csv -H "Content-Type: multipart/form-data" "http://localhost:9000/ingestFromFile?tableNameWithType=movies_OFFLINE&batchConfigMapStr=%7B%22inputFormat%22:%22csv%22,%22recordReader.prop.delimiter%22:%22,%22%7D"
Response:
{"status":"Successfully ingested file into table: movies_OFFLINE as segment: movies_1735447958741_1735447958741_1735447958617"}%
How to query data
The single query engine
The multi stage query engine
Indexes exercise
- create table and load data
docker exec -it pinot-controller sh
/opt/pinot/bin/pinot-admin.sh AddTable -schemaFile /scripts/gitHub_events_schema.json -tableConfigFile /scripts/gitHub_events_offline_table_config.json -exec
- next, populate some data in the tables
cd data
wget https://data.gharchive.org/2021-07-21-9.json.gz
wget https://data.gharchive.org/2021-07-21-10.json.gz
wget https://data.gharchive.org/2021-07-21-11.json.gz
wget https://data.gharchive.org/2021-07-21-12.json.gz
gunzip *.gz
/opt/pinot/bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile /scripts/job-spec.yaml
Next, I can create an index just by adding this:
"bloomFilterColumns": [
"id",
"commit_author_names",
"label_ids"
],
a bloom filter to the column ‘commit_author_names’ in the tableIndexConfig section
(I wish 1 docker compose worked for all exercises. It does not as there are different config files and extra assets for each exercise. :/ )
Upserts
- can be full or partial
- utilise comparison columns to determine the latest iteration of data
- snapshots help create resiliency in metadata map to stay on top of any lost in memory metadata related to upserts in case of server outage
- enable preload helps faster recovery of upsert metadata map by using snapshots
- StarTree index is not supported when upserts are on
- to use upserts, a primary key must be defined in the schema config, as used to partition data
- in the table config, define a section
upsertConfig
with the appropriate mode (full or partial; partial can use overwrite, increment, append, union, max, min strategies)
(I feel like the exercises are a bit incomplete maybe rushed. Some parts are copy-pasted from previous exericses)
Transformations during ingestion
We can use groovy functions.
- to use Groovy functions, we need to modify the controller config to enable it, set
disabled = false
controller.disable.ingestion.groovy=false
Groovy Function Syntax:
Groovy({groovy script}, argument1, argument2...argumentN)
Example:
"ingestionConfig": {
"transformConfigs": [{
"columnName": "fullName",
"transformFunction": "Groovy({firstName+' '+lastName}, firstName, lastName)"
}]
}
Minions in Pinot
- a minion is a standby component that leverages a task framework to offload computationally intensive tasks from other components
- some typical minion tasks are: batch ingestion, segment creation, segment purge, segment merge
Use cases that work for Pinot
Real-time User Activity Monitoring
- monitoring user interactions, analyzing engagement, and detecting anomalies in real-time on large-scale platforms like social media
- Pinot leverages Kafka integration to ingest massive data streams and deliver instant analytics
Fraud Detection
- rapid detection of fraudulent transactions in finance and e-commerce
- Pinot can be used for real-time analysis of transaction patterns to flag suspicious activities as they occur
Real-time Personalization
- delivering personalized experiences across e-commerce, streaming services, and online gaming to boost user engagement
- Pinot enables real-time recommendation engines that tailor content based on individual preferences and behaviours
Operational Analytics
- real-time monitoring of infrastructure, applications, and services to maintain efficiency and uptime
- Pinot processes logs, metrics, and traces in real-time, providing actionable insights for smooth operations
Data Mesh - Chapter 9 - The Logical Architecture
Domain ownership extends domains with analytical data sharing interfaces
Domain ownership of data results in a domain-centric organization of analytical data - this means that a domain’s interfaces must extend to sharing its analytical data
Data as a product introduces a new architecture quantum, aka data quantum
A data quantum is the core building block of a data mesh architecture, encapsulating all components needed to ensure usability, secure data sharing, and policy enforcement. Each data quantum represents a domain-oriented data product, including the data itself, transformation code, sharing mechanisms, and governing policies. This concept forms the foundation of data mesh by treating data products as self-contained architectural units.
The self-serve data platform drives a multiplane platform architecture
The self-serve platform provides a variety of services that enable different types of data mesh users to perform their roles effectively: data producers (such as data product developers and owners), data consumers (like data analysts and scientists), and data governance team members. The platform is structured into three planes, which consist of coordinated services organized according to the needs and experiences of data mesh users.
Federated computational governance embeds computational policies into each data product
The principle of federated computational governance enhances each data product with a computational container capable of hosting a sidecar process. This sidecar enforces computational policies at the appropriate stages of the data lifecycle—such as building, deploying, accessing, reading, or writing—integrating policy execution seamlessly into the data flow.
A data product sidecar operates within the execution context of a data product, handling domain-independent, cross-cutting concerns like policy enforcement. It can also be extended to support standardized features like discovery. Notably, the sidecar’s implementation remains consistent across all data products.
Data Mesh - Chapter 10 - The Multiplane Data Platform Architecture
Design a Platform Driven by User Journeys
The ultimate purpose of the platform is to serve the cross-functional domain teams so they can deliver or consume data products. Hence, the best way to begin the design of the platform is to understand the main journeys of your platform users and evaluate how you can make it easy for them to complete their journeys successfully using the platform.
High-level personas:
- data product developer: from generalist developers with general programming skills to specialist data engineers who are well-versed in the existing analytical data processing technologies
- data product consumers: they need access and use data to do their job (i.e. DSs, DAs, data product devs, app devs)
- data product owners: deliver and promote domain-specific data products, ensuring adoption, value delivery, policy compliance, and interoperability using the platform
- data governance members: operate within a federated structure, collectively ensuring secure and optimal mesh operations, with roles including security, legal experts, and data product owners
- data platform product owner: develops platform services as a product, prioritizing user needs to deliver an optimal experience for all user roles
- data platform developer: builds and operates the data platform, contributing to its design while using its utility and product experience services
Data Product Developer Journey
High-level example of the data product development journey:
High-level example of the data product development journey using the platform:
Incept, Explore, Bootstrap, and Source
Data products begin with identifying real-world analytical use cases to demonstrate their value. Consumer-aligned data products are created through direct collaboration with their users. Developers explore potential data sources, which may include upstream data products, external systems, or organizational systems, assessing their suitability based on guarantees, documentation, and profiling. Once sources are identified, the developer bootstraps the data product using platform-provisioned infrastructure to connect to and experiment with source data. This exploratory phase includes rapid discovery, source access, quick scaffolding, and infrastructure setup.
Build, Test, Deploy, and Run
Data product developers have an end-to-end responsibility of building, testing, deploying, and operating their data products. This stage is a continuous and iterative series of activities that data product developers perform to deliver all the necessary components of a successful data product.
Maintain, Evolve, and Retire
-
maintaining and evolving a data product involves continuous updates to transformation logic, data models, access methods, and policies while ensuring uninterrupted data processing and sharing. The platform simplifies maintenance by handling complexities like schema versioning and resource management, allowing developers to focus on product-specific changes.
-
evolution can range from minor updates, like bug fixes, to significant changes, such as migrating storage vendors. Monitoring operational health, including performance, reliability, and costs, is crucial, both for individual data products and the mesh as a whole, with insights and alarms for potential bottlenecks.
-
data products may retire if migrated to new versions or if their data is no longer needed. The platform ensures graceful retirement, allowing downstream users to transition. Dormant products serve old data and enforce policies, while fully retired products cease to exist.
Example ML model development journey
Data Product Consumer Journey
Incept, Explore, Bootstrap, Source
- inception: developers hypothesize intelligent actions or decisions based on existing data, such as creating a playlist loved by listeners
- exploration: they search and evaluate data products for bias and fairness using apis and data sampling
Build, Test, Deploy, Run
- build: developers train models using data transformed into features (e.g playlist features), treating these transformations as data products
- test: they track and version data using timestamps for model reproducibility
- deploy: models are packaged to run as part of a data product and executed using specialized infrastructure like gpus or tpus
Maintain, Evolve, and Retire
- developers monitor model performance by evaluating user behavior metrics (playlist replay rates and listening duration)
- metrics are collected using operational monitoring tools integrated with user applications
Real-World ML Systems on Kubernetes
Saw this intriguing book on Manning’s MEAP and decided to get it with one of my free credits from when I had a sub. I want to learn more about K8s’ application in MLOps and also the authors seem interesting - Re Alvarez Parmar - a Principal Specialist Solutions Architect at AWS, and Elamaran (Ela) Shanmugam - a Sr. Specialist Solutions Architect with AWS as well.
This is the TOC at the time of writing this post:
Real-World ML Systems on Kubernetes - Chapter 1 - Accelerating machine learning innovation using MLOps and Kubernetes
The complexity of taking ML models to production has led to the rise of MLOps, a blend of ML and DevOps, focused on creating production-grade ML systems. MLOps integrates software engineering, data management, and operational principles to streamline the development and scaling of machine learning models. Kubernetes, an open-source platform for managing containerized applications, plays a vital role in MLOps by enabling scalability, flexibility, and alignment with software development best practices.
This book explores designing scalable MLOps platforms using Kubernetes, offering insights into improving developer productivity, managing systems at scale, and ensuring reliability. Kubernetes’ ability to run on diverse infrastructures and its widespread adoption makes it an ideal foundation for modern MLOps solutions.
The need for speed
Developing machine learning applications is a highly iterative process - EDA, features, hyperparams - getting a good model might require lots of experimenting. To make this faster we could use some parallelisation to conduct experiments concurrently and reduce the time to learn what works and doesn’t.
MLOps is the application of DevOps principles to machine learning engineering, and automation starts as soon as developers merge their code changes into a central repository. Changes in code trigger automated builds and tests to ensure that the code meets standards and is free of any known bugs. And since there is a lot of iterativeness in model development, adopting DevOps practices is a good idea for MLOps engineers.
Developing ML systems
Here the CRISP-DM process is mentioned (which was also referenced in the CRISP-ML(Q) paper), and then the CRISP-ML(Q) itself is mentioned as well.
CRISP-DM consists of 6 major phases of data mining:
- business understanding
- data understanding
- data preparation
- model training
- evaluation
- deployment
And CRISP-ML(Q) adds a 7th - monitoring and maintenance, to address the unique challenges in ML.
Typical ML lifecycle:
And it is a circle - an ongoing process.
MLOps platforms
An MLOps platform streamlines the development of machine learning systems by providing end-to-end capabilities that support collaboration among data engineers and machine learning engineers. It automates hardware provisioning, software configuration, and workflows for tasks like data processing, model training, deployment, and monitoring, enabling efficient and scalable solutions.
K8s-based MLOps platforms enhance efficiency by automating repetitive tasks and offering services tailored to different development roles. Some features include single sign-on, unified logging and monitoring, resource management, version control, collaboration tools, and integration with machine learning frameworks, creating a seamless environment for building and deploying machine learning systems.
And building an ML system requires collaboration between different roles
Benefits of building MLOps on K8s
- the platform will be cloud-native
- K8s has a vibrant community
- it enables the creation of immutable infrastructure, enhancing a system’s consistency, predictability, and security
- it abstracts the underlying infrastructure
- provides a highly scalable infrastructure for deploying and managing machine learning models
- supports scalable distributed training to speed up model development
- provides the scalable infrastructure for distributed data ingestion data analytics, providing near-native integration with tools like Airflow, Spark, and others
- integrates seamlessly with CI/CD pipelines
- has extensive support observability through Prometheus, Grafana, and 3rd-party tools
- offers capabilities to isolate workloads, making it ideal for multitenant scenarios
- it supports policies and the enforcement of security standards in clusters
When to choose K8s as your ML platform
- your organization is already familiar with it
- you prefer to build your own systems instead of buying solutions
- you must remain cloud agnostic at all costs
Today I wrote about ~ 8 pages on google docs for my final post. I am not sure if this will be the final version as I keep changing things. I don’t want to be philosophical but I also want to share my experience and mindset throughout this journey ~ we’ll see ~
That is all for today!
See you tomorrow :)