Hello :) Today is Day 361!
A quick summary of today:
- read a few chapters of the Architecting Data and Machine Learning Platforms book
- watched talks on Apache Paimon from Flink Forward Jakarta ‘24
Architecting Data and Machine Learning Platforms - Chapter 3 - Designing for Your Data Team
When designing a data platform, technical aspects like performance, cost, and integration of new approaches are important. However, addressing company culture, employee skills, and willingness to adapt is crucial for success. Implementing a data platform often requires employees to learn new skills, change workflows, and potentially transition into new roles.
We should tailor our strategies to existing talent:
- focus on existing strengths
- consider workforce domain knowledge and evaluate the feasibility of upskilling
- explore tools like generative AI or no-code platforms to empower a broader workforce
However, public cloud tech and modern tools have blurred traditional distinctions between roles (data analysts, engineers, scientists) as analysts can now handle tasks previously requiring engineers, while blended ETL-ELT approaches enable flexibility. This convergence of skills leads to more efficient data processing.
Organizations can generally be classified as:
- data analysis-driven: focus on leveraging existing analytical skills
- data engineering-driven: build robust engineering capabilities
- data science-driven: prioritize advanced algorithm development
Data Analysis–Driven Organization
DAs play a central role in decision making. It is important to note that whether an organization is analyst driven is not a black-and-white issue but rather a spectrum of overlapping characteristics
The vision
The vision for modern cloud data platforms is to empower analysts to perform advanced analytics using familiar interfaces like spreadsheets, SQL, and BI tools. Traditional ETL processes are being replaced by ELT, where data is ingested into the cloud and transformations occur within the cloud data warehouse (DWH). This approach simplifies data pipelines, reduces dependency on data engineering teams, and enables analysts to leverage SQL skills for faster, more direct data insights.
To scale cloud data platform adoption, analysts should have access to user-friendly tools (like dbt) and interfaces (Excel, Tableau) for analytics and visualization. Modern BI tools, such as Power BI, Tableau, or Looker, are recommended for handling large datasets efficiently, overcoming the limitations of traditional spreadsheets. This setup accelerates data analysis and enhances the timeliness and usability of insights.
The personas
Data analysts:
- focus on fulfilling business information needs by designing, maintaining, and transforming data
- tasks include creating table layouts, reorganizing data, and generating reports with insights
- skill expansion is crucial, emphasizing deep business domain knowledge and technical proficiency in analyzing large datasets using tools like SQL and BI platforms
Business analysts:
- domain experts using data for actionable insights
- cloud-based data warehouses (DWHs) and serverless technologies enable analysts to focus on value-driven analysis rather than technical tasks
- while proficient in no-code/low-code ML models, they lack skills for complex workflows like ranking or recommendations, necessitating data science teams for advanced tasks
Data Engineers:
- manage the downstream data pipeline, including loading, integration, governance, and data quality
- promote ELT processes over ETL, leveraging SQL for post-loading transformations, improving efficiency and accessibility
- handle complex tasks like batch jobs, mainframe data integration, and real-time analytics using tools like Pub/Sub, Kafka, or Kinesis
- set up templates for generic tasks (i.e. DQ checks) that analysts can reconfigure
The technological framework
There are three fundamental principles underlying the high-level reference architecture for an analysis-driven organization:
- SQL as a standard
- from EDW/data lake to a structured data lake
- schema-on-read first approach
Combining these principles, we can define a high-level architecture that supports a few key data analysis patterns:
- the ‘traditional’ BI workloads, such as creating a dashboard or report
- an ad hoc analytics interface that allows for the management of data pipelines through SQL (ELT)
- enabling data science use cases with ML techniques
- real-time streaming of data into the DWH and processing of real-time events
Data Engineering–Driven Organization
An engineering-driven organization is focused on data integration.
The vision
When your data transformation needs are complex, you need data engineers to play a central role in the company’s data strategy to ensure that you can build reliable systems cost-effectively. Data engineers are at the crossroads between data owners and data consumers.
Data engineers have the following responsibilities:
- transporting data and enriching data while building integrations between analytical and operational systems (as in the real-time use cases)
- parsing and transforming messy data coming from business units and external suppliers into meaningful and clean data, with documented metadata
- applying DataOps — that is, functional knowledge of the business plus software engineering methodologies applied to the data lifecycle; this includes monitoring and maintenance of data feeds
- deploying ML models and other data science artifacts analyzing or consuming data
Building complex data engineering pipelines is expensive but enables increased capabilities:
- enriched data
- better ML training data
- using unstructured data
- productionising
- real-time analytics
The personas
- knowledge: data warehousing and data lakes, db systems, ETL tools, programming languages, data structures and algorithms, automation and scripting, container technology, orchestration platforms
- responsibilities: data pipeline development, data integration, DQ assurance, automation, collaboration with data consumers, scalability and performance optimization, monitoring and troubleshooting, deployment and maintenance of data systems, support for machine learning models, data security and compliance
The technological framework
The technological framework emphasizes that data engineers should focus on data preparation (cleaning, transformation, enrichment) and not on maintaining infrastructure. The ingestion layer uses scalable solutions like Kinesis, Event Hubs, or Pub/Sub for real-time data and S3, Azure Data Lake, or Cloud Storage for batch data. These solutions automatically scale based on workload demands.
Example reference architecture on AWS:
On GCP:
The reference architecture’s main benefits include the use of serverless, NoOps technologies that reduce maintenance burdens and allow for flexibility and scalability, such as handling traffic spikes. It also supports custom code development for complex parsing and low-latency business logic, with an emphasis on reusing code through libraries and templates. Additionally, it enables the transformation of existing batch pipelines into streaming pipelines.
Data Science–Driven Organization
These organizations leverage data science and machine learning to gain a competitive advantage. They rely on automated decision-making and strive for high business impact from their data initiatives.
For example, a bank that is data analysis driven would have data analysts assess each commercial loan opportunity, build an investment case, and have an executive sign off on it. A data science–driven fintech, on the other hand, would build a loan approval system that makes decisions on the majority of loans using some sort of automated algorithm.
The vision
Principles to follow:
- adaptability
- standardization
- accountability
- business impact
- implementation
A science-driven organization should be built on a technical platform that is highly adaptable in terms of technological openness.
The personas
- DEs
- MLEs
- DSs
- DAs
The technological framework
The authors strongly recommend standardizing on the ML pipelines infrastructure of your public cloud provider rather than cobbling together one-off training and deployment solutions.
Architecting Data and Machine Learning Platforms - Chapter 4 - A Migration Framework
Unless you are at a startup, it is rare that you will build a data platform from scratch. Instead, you will stand up a new data platform by migrating things into it from legacy systems
Modernise data workflows
Before starting to make a migration plan, we ought to have a comprehensive vision of why we are doing it and what we are migrating toward
Holistic view
Data modernization transformation should be considered holistically. There are 3 main pillars:
- what are the business outcomes of the workflows we aim to modernise
- identify and maybe help educate the stakeholders
- tech stack
It’s important to not treat the migration as a pure IT or data science project.
Modernise workflows
Modernizing workflows involves focusing on end-to-end tasks, not just upgrading tools. Instead of like-for-like replacements, rethink workflows from first principles to achieve transformational change. Key principles include:
- automation
- streaming by default
- automatic scaling
- built-in optimization
- automated ML
Transform the workflow itself
Focus on transforming workflows by questioning whether they need precomputation by data engineers. Prebuilt data pipelines are optimizations, not necessities. Instead, consider enabling self-serve, ad hoc workflows using tools like Looker to move calculations into a declarative semantic layer.
This approach empowers business teams to define and manage metrics, ensuring consistent KPIs and reducing dependency on data engineering resources while fostering a library of reusable measures.
A 4-step migration framework
- Prepare and discover
All stakeholders should conduct a preliminary analysis to identify the list of workloads that need to be migrated and current pain points (e.g., inability to scale, process engine cannot be updated, unwanted dependencies, etc.).
- Assess and plan
Assess the information collected in the previous stage, define the key measures of success, and plan the migration of each component.
There will be different groups depending on their business value and effort to migrate.
- Execute
For each identified use case, decide whether to decommission it, migrate it entirely (data, schema, downstream and upstream applications), or offload it (by migrating downstream applications to a different source). Afterward, test and validate any migration done.
- Optimize
Once the process has begun, it can be expanded and improved through continuous iterations. A first modernization step could focus only on core capabilities.
Estimating the overall cost of the solution
It is important to remember that it is not just a matter of the cost of technology — there are always people and process costs that have to be taken into account.
Audit of the existing infrastructure
Can be done:
- manually by the internal IT/infrastructure team
- automatically by the internal IT/infrastructure team
- leveraging a 3rd-party
Request for information/proposal and quotation
Consulting companies do these cost calculations for a living so we can consult with them as well. We can send some documents to potential vendors/partners and once we have all the information we can pick the best path forward. Docs:
- request for information (questionnaire used to collect detailed info about vendors’/possible partners’ solutions and services)
- request for proposal (questionnaire used to collect detailed info about how vendors/partners will leverage their products and services to solve a specific organization problem)
- request for quotation (questionnaire used to collect detailed info about pricing of different vendors/possible partners based on specific requirements)
PoC/MVP
- proof of concept: builds a small, testable portion of the solution to assess feasibility, usability, and scalability. It focuses on potential redesign needs without covering all features
- minimum viable product: develops a fully functional, narrowly scoped product deployable in production, enabling feedback from real users for improvement and cost estimation
- hybrid: starts with a broad but shallow PoC, transitioning to an MVP to refine and progress toward the complete solution
Setting up security and data governance
Framework
There are three risk factors that security and governance seek to address:
- unauthorised data access
- regulatory compliance
- visibility
Given these risk factors a data governance framework should include:
- data lineage
- data classification
- data catalog
- data quality management
- access management
- auditing
- data protection
Artifacts
To provide the above capabilities to the organization, a central data governance team needs to maintain the following artifacts:
- enterprise dictionary
- data classes
- policy book
- use case policies
- data catalog
Governance over the life of the data
Data governance involves bringing together people, processes, and technology over the life of the data.
The data lifecycle consists of the following stages:
- data creation
- data processing
- data storage
- data catalog
- data archive
- data destruction
When it comes to data, we can have different ‘hats’:
- legal: ensures that data usage conforms to contractual requirements and government/industry regulations
- data steward: the owner of the data, who sets the policy for a specific item of data
- data governor: sets policies for data classes and identifies which class a particular item of data belongs to
- privacy tsar: ensures that use cases do not leak personally identifiable information
- data user: typically a data analyst or data scientist who is using the data to make business decisions
Schema, Pipeline, and Data migration
Schema migration
When migrating a legacy application to a new system, it is best to first transfer the existing schema as-is, connecting upstream and downstream processes. After ensuring functionality in the new environment, we need to make schema changes in a second phase to minimize downtime risks. Using the facade pattern, we can create views that mask underlying table complexities, presenting a simplified schema to downstream processes while leveraging target system features. If this is not feasible, data must be transformed and converted before ingestion, handled by dedicated transformation pipelines.
Pipeline migration
There are two strategies:
- we are offloading the worklod: we retain the upstream data pipes feeding our source system, and put incremental copy of the data into the target system. Finally, we update our downstream processes to read from the target system - then we continue the offload with the next workload until we reach the end. Once completed, we can start fully migrating the data pipeline
- we are fully migrating the workload: we migrate everything in the new system, and then deprecate the corresponding legacy tables
General data pipeline patterns:
- ETL
- ELT
- EL
- CDC
Data migration
- data transfer requires plnanning. We need to identify and involve technical owners, approvers, the migration (delivery) team
Before proceeding we need to answer:
- what are the datasets we need to move
- where does the underlying data sit within the organisation
- what are the datasets we are allowed to move
- is there any specific regulatory requirement we have to respect
- where is the data going to land
- what is the destination region
- do we need to perform any transformation before the transfer
- what are the data access policies we want to apply
- is it a one-off transfer, or do we need to move data regularly
- what are the resources available for the data transfer
- what is the allocated budget
- do you have adequate bandwidth to accomplish the transfer, and for an adequate period
- do you need to leverage offline solutions
- what is the time needed to accomplish the entire data migration
Regional capacity and network to the cloud
- public internet connection
- partner interconnect
- direct interconnect
Transfer option considerations
- cost
- time
- offline vs. online transfer
- available tools
Migration stages
- upstream processes feeding current legacy data solutions are modified to feed the target env
- downstream processes reading from the legacy env are modified to read from the target env
- historical data is migrated in bult into the target env; at this point, the upstream processes are migrated to also write to the target env
- downstream processes are now connected to the target env
- data can be kept in sync between the legacy and target envs, leveraging CDC pipelines until the old env is fully dismissed
- downstream processes become fully operational, leveraging the target env
Checks to do to ensure minimum bottlenecks or data transfer issues:
- perform a functional test
- perform a performance test
- perform a data integrity check
Architecting Data and Machine Learning Platforms - Chapter 5 - Architecting a data lake
Data lake and the Cloud - a perfect marriage
Challenges with on-premises data lakes
- TCO
- scalability
- governance
- agility
- resource-intensive data and analytics processing
Benefits of cloud data lakes
- no need to store all data in an expensive HDFS cluster
- hadoop clusters can be created on demand in a short amount of time
- hyperscalers usually offer capabilities to leverage less expensive VMs
- managed VMs
- separation between storage and compute which gives orgs better flexibility in handling data governance challenges
Design and implementation
Batch and Stream
Whether batch or stream, there are four storage areas in a data lake:
- raw/landing/bronze (where raw data is collected and ingested directly from source systems)
- staging/dev/silver (where more advanced users process the data to prepare it for final users)
- production/gold (where the final data used by prod systems is stored)
- sensitive (optional)
Lambda architecture:
Kappa architecture:
Data catalog
A centralized repository of metadata describing organizational datasets, making scattered data across databases, data warehouses, filesystems, and blob storage more discoverable and accessible. It supports various users like data scientists, engineers, and business analysts by enabling efficient data discovery and rationalization.
- connects to data lakes and other platforms, ensuring a comprehensive metadata view without duplicating data
- helps identify and managed duplicated, unused, or similar data, enabling cost savings by focusing on relevant data
- facilitates data syncing, transformations, and updates in the lake while maintaining clarity on dataset levels
- uses json/yaml files to formalise agreements on data schema, ingestion freq, ownership, and access levels
Cloud data lake reference architecture
AWS
- data sources, which are S3 and NoSQL dbs
- storage zones on top of S3 buckets
- data catalog, data governance, security and process engine orchestrated by AWS lake formation
- analytical services such as Athena, RedShift, and EMR to provide access to the data
Azure
GCP
When choosing a cloud vendor, you should also consider the datasets that are natively available through their offerings. In the marketing or advertising space, for example, it can be very impactful to implement your business solutions using these datasets. Some examples of datasets include Open Data on AWS, the Google Cloud Public Datasets, and the Azure Open Datasets. If you advertise on Google or sell on Amazon, the ready-built integrations between the different divisions of these companies and their respective cloud platforms can be particularly helpful.
Integrating the data lake: the real superpower
The superpower of a data lake is its ability to connect the data with an unlimited number of process engines to activate potentially any use case you have in mind.
- there are many data sources that can be accessed through APIs
The evolution of data lakes with Apache Iceberg, Apache Hudi, and Delta Lake
- ACID compliance, giving users the certainty that the info they are querying is consistent
- overcoming the inherent limitations of HDFS in terms of file size - with this approach, even a small file can work perfectly
- logging of every single change made on data, guaranteeing a complete audit if ever cenessary and enabling time travel queries
- no difference in handling batch and streaming ingestions and elaboration
- full compatibility with the Spark engine
- storage based on Parquet that can achieve a high level of compression
- stream processing enablement
- ability to treat the object store as a db
- applying governance at the row and column level
- partition evolution
- schema evolution
Interactive analytics with notebooks
Jupyter general notebook architecture and Spark-based kernel:
Democratising data processing and reporting
Build trust in the data
Establishing trust in data is essential for organizations, particularly when transitioning from IT-managed to democratized data access. The example of a retail consultancy highlights the importance of transparency in data extraction, transformation, and reporting. When IT teams lack in-depth knowledge, they may struggle to assure decision-makers of data reliability. Empowering end users with tools to explore, audit, and manage data fosters autonomy and trust.
- identify the key stakeholders who have the right information on a timely basis
- defend the data from any kind of exfiltration, both internal and external, with a focus on personnel
- cooperate with other people inside or outside the company to unlock the value of data
Data ingestion is still an IT matter
Summary: Data Ingestion and IT’s Role
Data ingestion is a critical step in maintaining an efficient data lake and preventing it from becoming a data swamp filled with unused, stale, and inconsistent data. Poor planning during ingestion can lead to wasted storage, errors, and inefficiencies. To avoid this, organizations should follow some best practices:
- file formats (JSON, Avro, Parquet)
- file sizes (in hadoop - larger is better; for smaller it is a good idea to use a streaming engine to consolidate the data into a few large files)
- data compression to save space
- directory structure (tailor the structure to the use case)
Watching projects from Zach Wilson’s Analytics Eng boot camp
1st project
Big Bag Data - build an automated pipe that scrapes, cleans, and transforms data from sold handbags - then build a website that presents this data in a meaningful way
2nd project
F1 Insights: Real-Time Replay & Historical Analytics - github
3rd project
Global historical climatology network daily
Data model:
4th project
Real-time sports betting analytics platform - github
Real-time pipe:
Batch pipe:
5th project
Chess.com analytics project
6th project
Meme coin tracker - understand who are profitable traders, find what factors make a coin promote to better markets
Architecting Data and Machine Learning Platforms - Chapter 6 - Innovating with an EDW
A modern data platform
Organisational goals
- no silos
- democratisation
- discoverability
- stewardship
- single source
- security and compliance
- ease of use
- data science
- agility
Tech challenges
- size and scale
- complex data and use cases
- integration
- real time
Tech trends and tools
- separation of compute and storage
- multitenancy
- separation of authentication and authorisation
- analytics hubs
- multicloud semantic layer
- consistent admin interface
- cross-cloud control pane
- multicloud platforms
- converging of data lakes and DWHs
- built-in ML
- streaming ingest
- streaming analytics
- CDC
- embedded analytics
Hub-and-Spoke architecture
This architecture is an ideal design for modern cloud data platforms centered around a DWH. The DWH acts as the central hub, collecting and managing all enterprise data, while the spokes—custom applications, dashboards, ML models, and more—interact with the DWH via standard interfaces like APIs. This structure enables seamless data access without duplication and offers scalability, flexibility, and resilience.
- well-suited for startups, organizations seeking transformation, and enterprises with minimal legacy constraints
- efficient data ingestion is achieved through mechanisms like scheduled exports for SaaS data, real-time CDC tools for operational databases, and dynamic queries for federated datasets
- tech advancements like the separation of compute and storage, stateless ML APIs, and unified batch/stream processing have made this architecture a powerful alternative to older paradigms that relied on data marts, ETL, and sharded storage
Data ingest
3 ingest mechanisms:
- prebuilt connectors
Third-party vendors such as Fivetran can automatically handle landing data from a wide range of sources into a cloud DWH such as BigQuery, Redshift, or Snowflake
- real-time data capture
- federated querying
BI
- make sure that the BI tool pushes all queries to the enterprise DWH and doesn’t do them on OLAP cubes (extracts of the database/DWH)
- visualisations
- offer embedded analytics
- use a semantic layer
Transformations
A better option is to define transformations within the DWH using SQL to create views or tables, making the results available to all use cases. Another method, reverse ETL, uses event-driven serverless functions to transform and push data into backend systems.
For handling large datasets, ELT (Extract, Load, Transform) can be used instead of traditional ETL
Alternatively, scheduled queries periodically extract and aggregate data, which reduces query costs but can result in outdated data.
Materialized views offer a balanced approach by storing precomputed results, which are automatically updated in modern DWHs, improving query performance.
Organisational structure
In many organizations, there are many more business analysts than engineers. Often, this ratio is 100:1. A hub-and-spoke architecture is ideal for organizations that wish to build a data and ML platform that primarily serves business analysts. Because the hub-and-spoke architecture assumes that business analysts are capable of writing ad hoc SQL queries and building dashboards, some training may be necessary.
The central data engineering team is responsible for the filled shapes, while the business unit is responsible for the unfilled shapes
DWH to enable data scientists
Query interface
Before automating decision making, DSs need to do EDA and lots of experimentation. Thankfully tools like jupyter notebook allow to execute both python and sql in their cells (two common languages for DS)
Storage API
Using the storage API that ML frameworks support to read data efficiently and in parallel out of the DWH (i.e. Spark and TensorFlow offer such API)
QI and Storage APIs are the two methods DSs use to access data in the DWH for analysis, but nowadays there is a trend of DWH having the capability of implementing, training and using an ML alg directly in them.
Architecting Data and Machine Learning Platforms - Chapter 7 - Converging to a Lakehouse
The need for a unique architecture
Data lakes and DWHs emerged to meet the needs of different users. An organization with both types of users is faced with an unappealing choice.
User personas
Traditional DWH users are BI analysts closer to the business, focusing on deriving insights from data; they tend to be proficient in SQL
Data lake users include DAs, DEs and DSs. They not only transform the data to make it more accessible by the business but also experiment with it and use it for ML.
Antipattern: Disconnected systems
Different IT departments or teams will manage the DWH and the data lake - resulting in spending resources on operational aspects rather than focusing on business insight and not being able to allocate resources to focus on the key business drivers.
Maintaining 2 separate systems can cause DQ and consistency problems.
Antipattern: Duplicated data
Why not simply connect the two systems with data being synced by the platform team? We will end up with something like:
Drawbacks of such architecture:
- proliferation of the data (data exists in various forms and maybe even outside the boundaries of the platform, which may cause security and compliance risks, data freshness issues, or data versioning issues)
- slowdown in time to market (orgs may spend up to 2 weeks implementing a minor change to a report for management because htye must first request that DEs modify the ETLs in order to access the data necessary to complete the task)
- limitations in data size
- infrastructure and operational cost
Converged Architecture
Two forms
The common storage can take one of two forms:
- data is stored in an open source format (parquet, avro, etc) on the cloud and we can use Spark to process it (using Python, SQL, Scala, R); we can also use open source formats like Delta Lake and Apache Iceberg, or
- the common storage can be a highly optimised DWH format where we can natively leverage SQL
We need to choose based on user skills and do a proper evaluation based on some criteria.
Lakehouse on cloud storage
- cloud storage allows compute resources (like Spark clusters) to be scaled independently from storage, and compute can move to the data, rather than the other way around. This is a major advantage over on-premise systems
- instead of separate engines for data lakes and DWHs, a single analytical engine accesses all data, streamlining analysis and eliminating silos
- cloud storage’s high throughput and the ability to bring compute to data result in faster query speeds
- data democratization: enables self-service queries for data engineers, data scientists, and even business user
- centralized data governance and security, ensuring consistent policies
- supports both schema-on-read (data lake) and schema-on-write (DWH) approaches, and easily integrates with streaming data using ACID transactions with technologies like Apache Iceberg, Apache Hudi, and Delta Lake
- by using a single storage layer, this architecture reduces the need to download or copy data for offline processing
Migration strategy:
- iterative approach
- gain org buy-in by showing a quick solution
- phased decommissioning of the old system
- start with new workloads
SQL-first lakehouse
This enables users to execute analytics and BI, but also carry out orchestration and ML.
- SQL’s wide familiarity makes data access and analysis easier for business users, not just technical staff
- SQL handles most transformations, while Spark is used for complex tasks and ML
- business users can train and deploy ML models with SQL, empowering data-driven decisions
- provides a single, integrated platform for data processing, analytics, and ML
Migration:
- iterative approach
- direct DWH ingestion
- elevate DWH SQL engine
- insource ML worklouds into the DWH env
- Emphasize SQL-based data transformation skills, potentially requiring a new role (analytics engineers) or skillset within your organization
The benefits of convergence
- quicker time to market
- reduced risk
- predictive analytics
- data sharing
- ACID transactions
- multimodal data support
- unified env
- schema and governance
- streaming analytics
Apache Flink 2.0 talk at Flink Forward Jakarta ‘24
First thing that caught my attention was a new ForSt DB (for streaming db) that will be used in Flink 2.0
Next, stream-batch unification
Learned about Apache Paimon - a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
(above pic is from another Paimon talk (links below))
They introduce this materialized table:
ML models in Flink SQL
Other talks from Flink Forward Jakarta
- Apache Flink 101
- Flink + Paimon: Build Streaming Pipeline on Lakehouse
- Paimon 1.0 Unified Lake Format for Data + AI (similar to the above)
- Embrace Unified Stream and Batch Processing with Flink SQL
That is all for today!
See you tomorrow :)