(Day 361) More of Architecting Data and Machine Learning Platforms + Some Flink Forward talks

Ivan Ivanov · December 27, 2024

mlops theory data-eng

Hello :) Today is Day 361!

A quick summary of today:

read a few chapters of the Architecting Data and Machine Learning Platforms book
watched talks on Apache Paimon from Flink Forward Jakarta ‘24

Architecting Data and Machine Learning Platforms - Chapter 3 - Designing for Your Data Team

When designing a data platform, technical aspects like performance, cost, and integration of new approaches are important. However, addressing company culture, employee skills, and willingness to adapt is crucial for success. Implementing a data platform often requires employees to learn new skills, change workflows, and potentially transition into new roles.

We should tailor our strategies to existing talent:

focus on existing strengths
consider workforce domain knowledge and evaluate the feasibility of upskilling
explore tools like generative AI or no-code platforms to empower a broader workforce

However, public cloud tech and modern tools have blurred traditional distinctions between roles (data analysts, engineers, scientists) as analysts can now handle tasks previously requiring engineers, while blended ETL-ELT approaches enable flexibility. This convergence of skills leads to more efficient data processing.

Organizations can generally be classified as:

data analysis-driven: focus on leveraging existing analytical skills
data engineering-driven: build robust engineering capabilities
data science-driven: prioritize advanced algorithm development

Data Analysis–Driven Organization

DAs play a central role in decision making. It is important to note that whether an organization is analyst driven is not a black-and-white issue but rather a spectrum of overlapping characteristics

The vision

The vision for modern cloud data platforms is to empower analysts to perform advanced analytics using familiar interfaces like spreadsheets, SQL, and BI tools. Traditional ETL processes are being replaced by ELT, where data is ingested into the cloud and transformations occur within the cloud data warehouse (DWH). This approach simplifies data pipelines, reduces dependency on data engineering teams, and enables analysts to leverage SQL skills for faster, more direct data insights.

To scale cloud data platform adoption, analysts should have access to user-friendly tools (like dbt) and interfaces (Excel, Tableau) for analytics and visualization. Modern BI tools, such as Power BI, Tableau, or Looker, are recommended for handling large datasets efficiently, overcoming the limitations of traditional spreadsheets. This setup accelerates data analysis and enhances the timeliness and usability of insights.

The personas

Data analysts:

focus on fulfilling business information needs by designing, maintaining, and transforming data
tasks include creating table layouts, reorganizing data, and generating reports with insights
skill expansion is crucial, emphasizing deep business domain knowledge and technical proficiency in analyzing large datasets using tools like SQL and BI platforms

Business analysts:

domain experts using data for actionable insights
cloud-based data warehouses (DWHs) and serverless technologies enable analysts to focus on value-driven analysis rather than technical tasks
while proficient in no-code/low-code ML models, they lack skills for complex workflows like ranking or recommendations, necessitating data science teams for advanced tasks

Data Engineers:

manage the downstream data pipeline, including loading, integration, governance, and data quality
promote ELT processes over ETL, leveraging SQL for post-loading transformations, improving efficiency and accessibility
handle complex tasks like batch jobs, mainframe data integration, and real-time analytics using tools like Pub/Sub, Kafka, or Kinesis
set up templates for generic tasks (i.e. DQ checks) that analysts can reconfigure

The technological framework

There are three fundamental principles underlying the high-level reference architecture for an analysis-driven organization:

SQL as a standard
from EDW/data lake to a structured data lake
schema-on-read first approach

Combining these principles, we can define a high-level architecture that supports a few key data analysis patterns:

the ‘traditional’ BI workloads, such as creating a dashboard or report
an ad hoc analytics interface that allows for the management of data pipelines through SQL (ELT)
enabling data science use cases with ML techniques
real-time streaming of data into the DWH and processing of real-time events

Data Engineering–Driven Organization

An engineering-driven organization is focused on data integration.

The vision

When your data transformation needs are complex, you need data engineers to play a central role in the company’s data strategy to ensure that you can build reliable systems cost-effectively. Data engineers are at the crossroads between data owners and data consumers.

Data engineers have the following responsibilities:

transporting data and enriching data while building integrations between analytical and operational systems (as in the real-time use cases)
parsing and transforming messy data coming from business units and external suppliers into meaningful and clean data, with documented metadata
applying DataOps — that is, functional knowledge of the business plus software engineering methodologies applied to the data lifecycle; this includes monitoring and maintenance of data feeds
deploying ML models and other data science artifacts analyzing or consuming data

Building complex data engineering pipelines is expensive but enables increased capabilities:

enriched data
better ML training data
using unstructured data
productionising
real-time analytics

The personas

knowledge: data warehousing and data lakes, db systems, ETL tools, programming languages, data structures and algorithms, automation and scripting, container technology, orchestration platforms
responsibilities: data pipeline development, data integration, DQ assurance, automation, collaboration with data consumers, scalability and performance optimization, monitoring and troubleshooting, deployment and maintenance of data systems, support for machine learning models, data security and compliance

The technological framework

The technological framework emphasizes that data engineers should focus on data preparation (cleaning, transformation, enrichment) and not on maintaining infrastructure. The ingestion layer uses scalable solutions like Kinesis, Event Hubs, or Pub/Sub for real-time data and S3, Azure Data Lake, or Cloud Storage for batch data. These solutions automatically scale based on workload demands.

Example reference architecture on AWS:

On GCP:

The reference architecture’s main benefits include the use of serverless, NoOps technologies that reduce maintenance burdens and allow for flexibility and scalability, such as handling traffic spikes. It also supports custom code development for complex parsing and low-latency business logic, with an emphasis on reusing code through libraries and templates. Additionally, it enables the transformation of existing batch pipelines into streaming pipelines.

Data Science–Driven Organization

These organizations leverage data science and machine learning to gain a competitive advantage. They rely on automated decision-making and strive for high business impact from their data initiatives.

For example, a bank that is data analysis driven would have data analysts assess each commercial loan opportunity, build an investment case, and have an executive sign off on it. A data science–driven fintech, on the other hand, would build a loan approval system that makes decisions on the majority of loans using some sort of automated algorithm.

The vision

Principles to follow:

adaptability
standardization
accountability
business impact
implementation

A science-driven organization should be built on a technical platform that is highly adaptable in terms of technological openness.

The personas

DEs
MLEs
DSs
DAs

The technological framework

The authors strongly recommend standardizing on the ML pipelines infrastructure of your public cloud provider rather than cobbling together one-off training and deployment solutions.

Architecting Data and Machine Learning Platforms - Chapter 4 - A Migration Framework

Unless you are at a startup, it is rare that you will build a data platform from scratch. Instead, you will stand up a new data platform by migrating things into it from legacy systems

Modernise data workflows

Before starting to make a migration plan, we ought to have a comprehensive vision of why we are doing it and what we are migrating toward

Holistic view

Data modernization transformation should be considered holistically. There are 3 main pillars:

what are the business outcomes of the workflows we aim to modernise
identify and maybe help educate the stakeholders
tech stack

It’s important to not treat the migration as a pure IT or data science project.

Modernise workflows

Modernizing workflows involves focusing on end-to-end tasks, not just upgrading tools. Instead of like-for-like replacements, rethink workflows from first principles to achieve transformational change. Key principles include:

automation
streaming by default
automatic scaling
built-in optimization
automated ML

Transform the workflow itself

Focus on transforming workflows by questioning whether they need precomputation by data engineers. Prebuilt data pipelines are optimizations, not necessities. Instead, consider enabling self-serve, ad hoc workflows using tools like Looker to move calculations into a declarative semantic layer.

This approach empowers business teams to define and manage metrics, ensuring consistent KPIs and reducing dependency on data engineering resources while fostering a library of reusable measures.

A 4-step migration framework

Prepare and discover

All stakeholders should conduct a preliminary analysis to identify the list of workloads that need to be migrated and current pain points (e.g., inability to scale, process engine cannot be updated, unwanted dependencies, etc.).

Assess and plan

Assess the information collected in the previous stage, define the key measures of success, and plan the migration of each component.

There will be different groups depending on their business value and effort to migrate.

Execute

For each identified use case, decide whether to decommission it, migrate it entirely (data, schema, downstream and upstream applications), or offload it (by migrating downstream applications to a different source). Afterward, test and validate any migration done.

Optimize

Once the process has begun, it can be expanded and improved through continuous iterations. A first modernization step could focus only on core capabilities.

Estimating the overall cost of the solution

It is important to remember that it is not just a matter of the cost of technology — there are always people and process costs that have to be taken into account.

Audit of the existing infrastructure

Can be done:

manually by the internal IT/infrastructure team
automatically by the internal IT/infrastructure team
leveraging a 3rd-party

Request for information/proposal and quotation

Consulting companies do these cost calculations for a living so we can consult with them as well. We can send some documents to potential vendors/partners and once we have all the information we can pick the best path forward. Docs:

request for information (questionnaire used to collect detailed info about vendors’/possible partners’ solutions and services)
request for proposal (questionnaire used to collect detailed info about how vendors/partners will leverage their products and services to solve a specific organization problem)
request for quotation (questionnaire used to collect detailed info about pricing of different vendors/possible partners based on specific requirements)

PoC/MVP

proof of concept: builds a small, testable portion of the solution to assess feasibility, usability, and scalability. It focuses on potential redesign needs without covering all features
minimum viable product: develops a fully functional, narrowly scoped product deployable in production, enabling feedback from real users for improvement and cost estimation
hybrid: starts with a broad but shallow PoC, transitioning to an MVP to refine and progress toward the complete solution

Setting up security and data governance

Framework

There are three risk factors that security and governance seek to address:

unauthorised data access
regulatory compliance
visibility

Given these risk factors a data governance framework should include:

data lineage
data classification
data catalog
data quality management
access management
auditing
data protection

Artifacts

To provide the above capabilities to the organization, a central data governance team needs to maintain the following artifacts:

enterprise dictionary
data classes
policy book
use case policies
data catalog

Governance over the life of the data

Data governance involves bringing together people, processes, and technology over the life of the data.

The data lifecycle consists of the following stages:

data creation
data processing
data storage
data catalog
data archive
data destruction

When it comes to data, we can have different ‘hats’:

legal: ensures that data usage conforms to contractual requirements and government/industry regulations
data steward: the owner of the data, who sets the policy for a specific item of data
data governor: sets policies for data classes and identifies which class a particular item of data belongs to
privacy tsar: ensures that use cases do not leak personally identifiable information
data user: typically a data analyst or data scientist who is using the data to make business decisions

Schema, Pipeline, and Data migration

Schema migration

When migrating a legacy application to a new system, it is best to first transfer the existing schema as-is, connecting upstream and downstream processes. After ensuring functionality in the new environment, we need to make schema changes in a second phase to minimize downtime risks. Using the facade pattern, we can create views that mask underlying table complexities, presenting a simplified schema to downstream processes while leveraging target system features. If this is not feasible, data must be transformed and converted before ingestion, handled by dedicated transformation pipelines.

Pipeline migration

There are two strategies:

we are offloading the worklod: we retain the upstream data pipes feeding our source system, and put incremental copy of the data into the target system. Finally, we update our downstream processes to read from the target system - then we continue the offload with the next workload until we reach the end. Once completed, we can start fully migrating the data pipeline
we are fully migrating the workload: we migrate everything in the new system, and then deprecate the corresponding legacy tables

General data pipeline patterns:

Data migration

data transfer requires plnanning. We need to identify and involve technical owners, approvers, the migration (delivery) team

Before proceeding we need to answer:

what are the datasets we need to move
where does the underlying data sit within the organisation
what are the datasets we are allowed to move
is there any specific regulatory requirement we have to respect
where is the data going to land
what is the destination region
do we need to perform any transformation before the transfer
what are the data access policies we want to apply
is it a one-off transfer, or do we need to move data regularly
what are the resources available for the data transfer
what is the allocated budget
do you have adequate bandwidth to accomplish the transfer, and for an adequate period
do you need to leverage offline solutions
what is the time needed to accomplish the entire data migration

Regional capacity and network to the cloud

public internet connection
partner interconnect
direct interconnect

Transfer option considerations

cost
time
offline vs. online transfer
available tools

Migration stages

upstream processes feeding current legacy data solutions are modified to feed the target env
downstream processes reading from the legacy env are modified to read from the target env
historical data is migrated in bult into the target env; at this point, the upstream processes are migrated to also write to the target env
downstream processes are now connected to the target env
data can be kept in sync between the legacy and target envs, leveraging CDC pipelines until the old env is fully dismissed
downstream processes become fully operational, leveraging the target env

Checks to do to ensure minimum bottlenecks or data transfer issues:

perform a functional test
perform a performance test
perform a data integrity check

Architecting Data and Machine Learning Platforms - Chapter 5 - Architecting a data lake

Data lake and the Cloud - a perfect marriage

Challenges with on-premises data lakes

TCO
scalability
governance
agility
resource-intensive data and analytics processing

Benefits of cloud data lakes

no need to store all data in an expensive HDFS cluster
hadoop clusters can be created on demand in a short amount of time
hyperscalers usually offer capabilities to leverage less expensive VMs
managed VMs
separation between storage and compute which gives orgs better flexibility in handling data governance challenges

Design and implementation

Batch and Stream

Whether batch or stream, there are four storage areas in a data lake:

raw/landing/bronze (where raw data is collected and ingested directly from source systems)
staging/dev/silver (where more advanced users process the data to prepare it for final users)
production/gold (where the final data used by prod systems is stored)
sensitive (optional)

Lambda architecture:

Kappa architecture:

Data catalog

A centralized repository of metadata describing organizational datasets, making scattered data across databases, data warehouses, filesystems, and blob storage more discoverable and accessible. It supports various users like data scientists, engineers, and business analysts by enabling efficient data discovery and rationalization.

connects to data lakes and other platforms, ensuring a comprehensive metadata view without duplicating data
helps identify and managed duplicated, unused, or similar data, enabling cost savings by focusing on relevant data
facilitates data syncing, transformations, and updates in the lake while maintaining clarity on dataset levels
uses json/yaml files to formalise agreements on data schema, ingestion freq, ownership, and access levels

Cloud data lake reference architecture

AWS

data sources, which are S3 and NoSQL dbs
storage zones on top of S3 buckets
data catalog, data governance, security and process engine orchestrated by AWS lake formation
analytical services such as Athena, RedShift, and EMR to provide access to the data

Azure

GCP

When choosing a cloud vendor, you should also consider the datasets that are natively available through their offerings. In the marketing or advertising space, for example, it can be very impactful to implement your business solutions using these datasets. Some examples of datasets include Open Data on AWS, the Google Cloud Public Datasets, and the Azure Open Datasets. If you advertise on Google or sell on Amazon, the ready-built integrations between the different divisions of these companies and their respective cloud platforms can be particularly helpful.

Integrating the data lake: the real superpower

The superpower of a data lake is its ability to connect the data with an unlimited number of process engines to activate potentially any use case you have in mind.

there are many data sources that can be accessed through APIs

The evolution of data lakes with Apache Iceberg, Apache Hudi, and Delta Lake

ACID compliance, giving users the certainty that the info they are querying is consistent
overcoming the inherent limitations of HDFS in terms of file size - with this approach, even a small file can work perfectly
logging of every single change made on data, guaranteeing a complete audit if ever cenessary and enabling time travel queries
no difference in handling batch and streaming ingestions and elaboration
full compatibility with the Spark engine
storage based on Parquet that can achieve a high level of compression
stream processing enablement
ability to treat the object store as a db
applying governance at the row and column level
partition evolution
schema evolution

Interactive analytics with notebooks

Jupyter general notebook architecture and Spark-based kernel:

Democratising data processing and reporting

Build trust in the data

Establishing trust in data is essential for organizations, particularly when transitioning from IT-managed to democratized data access. The example of a retail consultancy highlights the importance of transparency in data extraction, transformation, and reporting. When IT teams lack in-depth knowledge, they may struggle to assure decision-makers of data reliability. Empowering end users with tools to explore, audit, and manage data fosters autonomy and trust.

identify the key stakeholders who have the right information on a timely basis
defend the data from any kind of exfiltration, both internal and external, with a focus on personnel
cooperate with other people inside or outside the company to unlock the value of data

Data ingestion is still an IT matter

Summary: Data Ingestion and IT’s Role

Data ingestion is a critical step in maintaining an efficient data lake and preventing it from becoming a data swamp filled with unused, stale, and inconsistent data. Poor planning during ingestion can lead to wasted storage, errors, and inefficiencies. To avoid this, organizations should follow some best practices:

file formats (JSON, Avro, Parquet)
file sizes (in hadoop - larger is better; for smaller it is a good idea to use a streaming engine to consolidate the data into a few large files)
data compression to save space
directory structure (tailor the structure to the use case)

Watching projects from Zach Wilson’s Analytics Eng boot camp

1st project

Big Bag Data - build an automated pipe that scrapes, cleans, and transforms data from sold handbags - then build a website that presents this data in a meaningful way

2nd project

F1 Insights: Real-Time Replay & Historical Analytics - github

3rd project

Global historical climatology network daily

Data model:

4th project

Real-time sports betting analytics platform - github

Real-time pipe:

Batch pipe:

5th project

Chess.com analytics project

6th project

Meme coin tracker - understand who are profitable traders, find what factors make a coin promote to better markets

Architecting Data and Machine Learning Platforms - Chapter 6 - Innovating with an EDW

A modern data platform

Organisational goals

no silos
democratisation
discoverability
stewardship
single source
security and compliance
ease of use
data science
agility

Tech challenges

size and scale
complex data and use cases
integration
real time

Tech trends and tools

separation of compute and storage
multitenancy
separation of authentication and authorisation
analytics hubs
multicloud semantic layer
consistent admin interface
cross-cloud control pane
multicloud platforms
converging of data lakes and DWHs
built-in ML
streaming ingest
streaming analytics
CDC
embedded analytics

Hub-and-Spoke architecture

This architecture is an ideal design for modern cloud data platforms centered around a DWH. The DWH acts as the central hub, collecting and managing all enterprise data, while the spokes—custom applications, dashboards, ML models, and more—interact with the DWH via standard interfaces like APIs. This structure enables seamless data access without duplication and offers scalability, flexibility, and resilience.

well-suited for startups, organizations seeking transformation, and enterprises with minimal legacy constraints
efficient data ingestion is achieved through mechanisms like scheduled exports for SaaS data, real-time CDC tools for operational databases, and dynamic queries for federated datasets
tech advancements like the separation of compute and storage, stateless ML APIs, and unified batch/stream processing have made this architecture a powerful alternative to older paradigms that relied on data marts, ETL, and sharded storage

Data ingest

3 ingest mechanisms:

prebuilt connectors

Third-party vendors such as Fivetran can automatically handle landing data from a wide range of sources into a cloud DWH such as BigQuery, Redshift, or Snowflake

real-time data capture

federated querying

make sure that the BI tool pushes all queries to the enterprise DWH and doesn’t do them on OLAP cubes (extracts of the database/DWH)
visualisations
offer embedded analytics
use a semantic layer

Transformations

A better option is to define transformations within the DWH using SQL to create views or tables, making the results available to all use cases. Another method, reverse ETL, uses event-driven serverless functions to transform and push data into backend systems.

For handling large datasets, ELT (Extract, Load, Transform) can be used instead of traditional ETL

Alternatively, scheduled queries periodically extract and aggregate data, which reduces query costs but can result in outdated data.

Materialized views offer a balanced approach by storing precomputed results, which are automatically updated in modern DWHs, improving query performance.

Organisational structure

In many organizations, there are many more business analysts than engineers. Often, this ratio is 100:1. A hub-and-spoke architecture is ideal for organizations that wish to build a data and ML platform that primarily serves business analysts. Because the hub-and-spoke architecture assumes that business analysts are capable of writing ad hoc SQL queries and building dashboards, some training may be necessary.

The central data engineering team is responsible for the filled shapes, while the business unit is responsible for the unfilled shapes

DWH to enable data scientists

Query interface

Before automating decision making, DSs need to do EDA and lots of experimentation. Thankfully tools like jupyter notebook allow to execute both python and sql in their cells (two common languages for DS)

Storage API

Using the storage API that ML frameworks support to read data efficiently and in parallel out of the DWH (i.e. Spark and TensorFlow offer such API)

QI and Storage APIs are the two methods DSs use to access data in the DWH for analysis, but nowadays there is a trend of DWH having the capability of implementing, training and using an ML alg directly in them.

Architecting Data and Machine Learning Platforms - Chapter 7 - Converging to a Lakehouse

The need for a unique architecture

Data lakes and DWHs emerged to meet the needs of different users. An organization with both types of users is faced with an unappealing choice.

User personas

Traditional DWH users are BI analysts closer to the business, focusing on deriving insights from data; they tend to be proficient in SQL

Data lake users include DAs, DEs and DSs. They not only transform the data to make it more accessible by the business but also experiment with it and use it for ML.

Antipattern: Disconnected systems

Different IT departments or teams will manage the DWH and the data lake - resulting in spending resources on operational aspects rather than focusing on business insight and not being able to allocate resources to focus on the key business drivers.

Maintaining 2 separate systems can cause DQ and consistency problems.

Antipattern: Duplicated data

Why not simply connect the two systems with data being synced by the platform team? We will end up with something like:

Drawbacks of such architecture:

proliferation of the data (data exists in various forms and maybe even outside the boundaries of the platform, which may cause security and compliance risks, data freshness issues, or data versioning issues)
slowdown in time to market (orgs may spend up to 2 weeks implementing a minor change to a report for management because htye must first request that DEs modify the ETLs in order to access the data necessary to complete the task)
limitations in data size
infrastructure and operational cost

Converged Architecture

Two forms

The common storage can take one of two forms:

data is stored in an open source format (parquet, avro, etc) on the cloud and we can use Spark to process it (using Python, SQL, Scala, R); we can also use open source formats like Delta Lake and Apache Iceberg, or
the common storage can be a highly optimised DWH format where we can natively leverage SQL

We need to choose based on user skills and do a proper evaluation based on some criteria.

Lakehouse on cloud storage

cloud storage allows compute resources (like Spark clusters) to be scaled independently from storage, and compute can move to the data, rather than the other way around. This is a major advantage over on-premise systems
instead of separate engines for data lakes and DWHs, a single analytical engine accesses all data, streamlining analysis and eliminating silos
cloud storage’s high throughput and the ability to bring compute to data result in faster query speeds
data democratization: enables self-service queries for data engineers, data scientists, and even business user
centralized data governance and security, ensuring consistent policies
supports both schema-on-read (data lake) and schema-on-write (DWH) approaches, and easily integrates with streaming data using ACID transactions with technologies like Apache Iceberg, Apache Hudi, and Delta Lake
by using a single storage layer, this architecture reduces the need to download or copy data for offline processing

Migration strategy:

iterative approach
gain org buy-in by showing a quick solution
phased decommissioning of the old system
start with new workloads

SQL-first lakehouse

This enables users to execute analytics and BI, but also carry out orchestration and ML.

SQL’s wide familiarity makes data access and analysis easier for business users, not just technical staff
SQL handles most transformations, while Spark is used for complex tasks and ML
business users can train and deploy ML models with SQL, empowering data-driven decisions
provides a single, integrated platform for data processing, analytics, and ML

Migration:

iterative approach
direct DWH ingestion
elevate DWH SQL engine
insource ML worklouds into the DWH env
Emphasize SQL-based data transformation skills, potentially requiring a new role (analytics engineers) or skillset within your organization

The benefits of convergence

quicker time to market
reduced risk
predictive analytics
data sharing
ACID transactions
multimodal data support
unified env
schema and governance
streaming analytics

Apache Flink 2.0 talk at Flink Forward Jakarta ‘24

First thing that caught my attention was a new ForSt DB (for streaming db) that will be used in Flink 2.0

Next, stream-batch unification

Learned about Apache Paimon - a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

(above pic is from another Paimon talk (links below))

They introduce this materialized table:

ML models in Flink SQL

Other talks from Flink Forward Jakarta

That is all for today!

See you tomorrow :)