(Day 361) More of Architecting Data and Machine Learning Platforms + Some Flink Forward talks

Ivan Ivanov · December 27, 2024

Hello :) Today is Day 361!

A quick summary of today:

  • read a few chapters of the Architecting Data and Machine Learning Platforms book
  • watched talks on Apache Paimon from Flink Forward Jakarta ‘24

Architecting Data and Machine Learning Platforms - Chapter 3 - Designing for Your Data Team

When designing a data platform, technical aspects like performance, cost, and integration of new approaches are important. However, addressing company culture, employee skills, and willingness to adapt is crucial for success. Implementing a data platform often requires employees to learn new skills, change workflows, and potentially transition into new roles.

We should tailor our strategies to existing talent:

  • focus on existing strengths
  • consider workforce domain knowledge and evaluate the feasibility of upskilling
  • explore tools like generative AI or no-code platforms to empower a broader workforce

However, public cloud tech and modern tools have blurred traditional distinctions between roles (data analysts, engineers, scientists) as analysts can now handle tasks previously requiring engineers, while blended ETL-ELT approaches enable flexibility. This convergence of skills leads to more efficient data processing.

Organizations can generally be classified as:

  • data analysis-driven: focus on leveraging existing analytical skills
  • data engineering-driven: build robust engineering capabilities
  • data science-driven: prioritize advanced algorithm development

Data Analysis–Driven Organization

DAs play a central role in decision making. It is important to note that whether an organization is analyst driven is not a black-and-white issue but rather a spectrum of overlapping characteristics

The vision

The vision for modern cloud data platforms is to empower analysts to perform advanced analytics using familiar interfaces like spreadsheets, SQL, and BI tools. Traditional ETL processes are being replaced by ELT, where data is ingested into the cloud and transformations occur within the cloud data warehouse (DWH). This approach simplifies data pipelines, reduces dependency on data engineering teams, and enables analysts to leverage SQL skills for faster, more direct data insights.

To scale cloud data platform adoption, analysts should have access to user-friendly tools (like dbt) and interfaces (Excel, Tableau) for analytics and visualization. Modern BI tools, such as Power BI, Tableau, or Looker, are recommended for handling large datasets efficiently, overcoming the limitations of traditional spreadsheets. This setup accelerates data analysis and enhances the timeliness and usability of insights.

The personas

image

Data analysts:

  • focus on fulfilling business information needs by designing, maintaining, and transforming data
  • tasks include creating table layouts, reorganizing data, and generating reports with insights
  • skill expansion is crucial, emphasizing deep business domain knowledge and technical proficiency in analyzing large datasets using tools like SQL and BI platforms

Business analysts:

  • domain experts using data for actionable insights
  • cloud-based data warehouses (DWHs) and serverless technologies enable analysts to focus on value-driven analysis rather than technical tasks
  • while proficient in no-code/low-code ML models, they lack skills for complex workflows like ranking or recommendations, necessitating data science teams for advanced tasks

Data Engineers:

  • manage the downstream data pipeline, including loading, integration, governance, and data quality
  • promote ELT processes over ETL, leveraging SQL for post-loading transformations, improving efficiency and accessibility
  • handle complex tasks like batch jobs, mainframe data integration, and real-time analytics using tools like Pub/Sub, Kafka, or Kinesis
  • set up templates for generic tasks (i.e. DQ checks) that analysts can reconfigure

The technological framework

There are three fundamental principles underlying the high-level reference architecture for an analysis-driven organization:

  • SQL as a standard
  • from EDW/data lake to a structured data lake
  • schema-on-read first approach

Combining these principles, we can define a high-level architecture that supports a few key data analysis patterns:

  • the ‘traditional’ BI workloads, such as creating a dashboard or report
  • an ad hoc analytics interface that allows for the management of data pipelines through SQL (ELT)
  • enabling data science use cases with ML techniques
  • real-time streaming of data into the DWH and processing of real-time events

image

Data Engineering–Driven Organization

An engineering-driven organization is focused on data integration.

The vision

When your data transformation needs are complex, you need data engineers to play a central role in the company’s data strategy to ensure that you can build reliable systems cost-effectively. Data engineers are at the crossroads between data owners and data consumers.

Data engineers have the following responsibilities:

  • transporting data and enriching data while building integrations between analytical and operational systems (as in the real-time use cases)
  • parsing and transforming messy data coming from business units and external suppliers into meaningful and clean data, with documented metadata
  • applying DataOps — that is, functional knowledge of the business plus software engineering methodologies applied to the data lifecycle; this includes monitoring and maintenance of data feeds
  • deploying ML models and other data science artifacts analyzing or consuming data

Building complex data engineering pipelines is expensive but enables increased capabilities:

  • enriched data
  • better ML training data
  • using unstructured data
  • productionising
  • real-time analytics

The personas

  • knowledge: data warehousing and data lakes, db systems, ETL tools, programming languages, data structures and algorithms, automation and scripting, container technology, orchestration platforms
  • responsibilities: data pipeline development, data integration, DQ assurance, automation, collaboration with data consumers, scalability and performance optimization, monitoring and troubleshooting, deployment and maintenance of data systems, support for machine learning models, data security and compliance

The technological framework

The technological framework emphasizes that data engineers should focus on data preparation (cleaning, transformation, enrichment) and not on maintaining infrastructure. The ingestion layer uses scalable solutions like Kinesis, Event Hubs, or Pub/Sub for real-time data and S3, Azure Data Lake, or Cloud Storage for batch data. These solutions automatically scale based on workload demands.

Example reference architecture on AWS:

image

On GCP:

image

The reference architecture’s main benefits include the use of serverless, NoOps technologies that reduce maintenance burdens and allow for flexibility and scalability, such as handling traffic spikes. It also supports custom code development for complex parsing and low-latency business logic, with an emphasis on reusing code through libraries and templates. Additionally, it enables the transformation of existing batch pipelines into streaming pipelines.

Data Science–Driven Organization

These organizations leverage data science and machine learning to gain a competitive advantage. They rely on automated decision-making and strive for high business impact from their data initiatives.

For example, a bank that is data analysis driven would have data analysts assess each commercial loan opportunity, build an investment case, and have an executive sign off on it. A data science–driven fintech, on the other hand, would build a loan approval system that makes decisions on the majority of loans using some sort of automated algorithm.

The vision

Principles to follow:

  • adaptability
  • standardization
  • accountability
  • business impact
  • implementation

A science-driven organization should be built on a technical platform that is highly adaptable in terms of technological openness.

The personas

  • DEs
  • MLEs
  • DSs
  • DAs

The technological framework

The authors strongly recommend standardizing on the ML pipelines infrastructure of your public cloud provider rather than cobbling together one-off training and deployment solutions.

image


Architecting Data and Machine Learning Platforms - Chapter 4 - A Migration Framework

Unless you are at a startup, it is rare that you will build a data platform from scratch. Instead, you will stand up a new data platform by migrating things into it from legacy systems

Modernise data workflows

Before starting to make a migration plan, we ought to have a comprehensive vision of why we are doing it and what we are migrating toward

Holistic view

Data modernization transformation should be considered holistically. There are 3 main pillars:

  • what are the business outcomes of the workflows we aim to modernise
  • identify and maybe help educate the stakeholders
  • tech stack

It’s important to not treat the migration as a pure IT or data science project.

Modernise workflows

Modernizing workflows involves focusing on end-to-end tasks, not just upgrading tools. Instead of like-for-like replacements, rethink workflows from first principles to achieve transformational change. Key principles include:

  • automation
  • streaming by default
  • automatic scaling
  • built-in optimization
  • automated ML

Transform the workflow itself

Focus on transforming workflows by questioning whether they need precomputation by data engineers. Prebuilt data pipelines are optimizations, not necessities. Instead, consider enabling self-serve, ad hoc workflows using tools like Looker to move calculations into a declarative semantic layer.

This approach empowers business teams to define and manage metrics, ensuring consistent KPIs and reducing dependency on data engineering resources while fostering a library of reusable measures.

A 4-step migration framework

image

  • Prepare and discover

All stakeholders should conduct a preliminary analysis to identify the list of workloads that need to be migrated and current pain points (e.g., inability to scale, process engine cannot be updated, unwanted dependencies, etc.).

  • Assess and plan

Assess the information collected in the previous stage, define the key measures of success, and plan the migration of each component.

There will be different groups depending on their business value and effort to migrate.

image

  • Execute

For each identified use case, decide whether to decommission it, migrate it entirely (data, schema, downstream and upstream applications), or offload it (by migrating downstream applications to a different source). Afterward, test and validate any migration done.

  • Optimize

Once the process has begun, it can be expanded and improved through continuous iterations. A first modernization step could focus only on core capabilities.

Estimating the overall cost of the solution

It is important to remember that it is not just a matter of the cost of technology — there are always people and process costs that have to be taken into account.

Audit of the existing infrastructure

Can be done:

  • manually by the internal IT/infrastructure team
  • automatically by the internal IT/infrastructure team
  • leveraging a 3rd-party

Request for information/proposal and quotation

Consulting companies do these cost calculations for a living so we can consult with them as well. We can send some documents to potential vendors/partners and once we have all the information we can pick the best path forward. Docs:

  • request for information (questionnaire used to collect detailed info about vendors’/possible partners’ solutions and services)
  • request for proposal (questionnaire used to collect detailed info about how vendors/partners will leverage their products and services to solve a specific organization problem)
  • request for quotation (questionnaire used to collect detailed info about pricing of different vendors/possible partners based on specific requirements)

PoC/MVP

  • proof of concept: builds a small, testable portion of the solution to assess feasibility, usability, and scalability. It focuses on potential redesign needs without covering all features
  • minimum viable product: develops a fully functional, narrowly scoped product deployable in production, enabling feedback from real users for improvement and cost estimation
  • hybrid: starts with a broad but shallow PoC, transitioning to an MVP to refine and progress toward the complete solution

Setting up security and data governance

Framework

There are three risk factors that security and governance seek to address:

  • unauthorised data access
  • regulatory compliance
  • visibility

Given these risk factors a data governance framework should include:

  • data lineage
  • data classification
  • data catalog
  • data quality management
  • access management
  • auditing
  • data protection

Artifacts

To provide the above capabilities to the organization, a central data governance team needs to maintain the following artifacts:

  • enterprise dictionary
  • data classes
  • policy book
  • use case policies
  • data catalog

Governance over the life of the data

Data governance involves bringing together people, processes, and technology over the life of the data.

The data lifecycle consists of the following stages:

  1. data creation
  2. data processing
  3. data storage
  4. data catalog
  5. data archive
  6. data destruction

When it comes to data, we can have different ‘hats’:

  • legal: ensures that data usage conforms to contractual requirements and government/industry regulations
  • data steward: the owner of the data, who sets the policy for a specific item of data
  • data governor: sets policies for data classes and identifies which class a particular item of data belongs to
  • privacy tsar: ensures that use cases do not leak personally identifiable information
  • data user: typically a data analyst or data scientist who is using the data to make business decisions

Schema, Pipeline, and Data migration

Schema migration

When migrating a legacy application to a new system, it is best to first transfer the existing schema as-is, connecting upstream and downstream processes. After ensuring functionality in the new environment, we need to make schema changes in a second phase to minimize downtime risks. Using the facade pattern, we can create views that mask underlying table complexities, presenting a simplified schema to downstream processes while leveraging target system features. If this is not feasible, data must be transformed and converted before ingestion, handled by dedicated transformation pipelines.

Pipeline migration

There are two strategies:

  • we are offloading the worklod: we retain the upstream data pipes feeding our source system, and put incremental copy of the data into the target system. Finally, we update our downstream processes to read from the target system - then we continue the offload with the next workload until we reach the end. Once completed, we can start fully migrating the data pipeline
  • we are fully migrating the workload: we migrate everything in the new system, and then deprecate the corresponding legacy tables

General data pipeline patterns:

  • ETL
  • ELT
  • EL
  • CDC

Data migration

  • data transfer requires plnanning. We need to identify and involve technical owners, approvers, the migration (delivery) team

Before proceeding we need to answer:

  • what are the datasets we need to move
  • where does the underlying data sit within the organisation
  • what are the datasets we are allowed to move
  • is there any specific regulatory requirement we have to respect
  • where is the data going to land
  • what is the destination region
  • do we need to perform any transformation before the transfer
  • what are the data access policies we want to apply
  • is it a one-off transfer, or do we need to move data regularly
  • what are the resources available for the data transfer
  • what is the allocated budget
  • do you have adequate bandwidth to accomplish the transfer, and for an adequate period
  • do you need to leverage offline solutions
  • what is the time needed to accomplish the entire data migration

Regional capacity and network to the cloud

  • public internet connection
  • partner interconnect
  • direct interconnect

Transfer option considerations

  • cost
  • time
  • offline vs. online transfer
  • available tools

Migration stages

image

  1. upstream processes feeding current legacy data solutions are modified to feed the target env
  2. downstream processes reading from the legacy env are modified to read from the target env
  3. historical data is migrated in bult into the target env; at this point, the upstream processes are migrated to also write to the target env
  4. downstream processes are now connected to the target env
  5. data can be kept in sync between the legacy and target envs, leveraging CDC pipelines until the old env is fully dismissed
  6. downstream processes become fully operational, leveraging the target env

Checks to do to ensure minimum bottlenecks or data transfer issues:

  • perform a functional test
  • perform a performance test
  • perform a data integrity check

Architecting Data and Machine Learning Platforms - Chapter 5 - Architecting a data lake

Data lake and the Cloud - a perfect marriage

Challenges with on-premises data lakes

  • TCO
  • scalability
  • governance
  • agility
  • resource-intensive data and analytics processing

Benefits of cloud data lakes

  • no need to store all data in an expensive HDFS cluster
  • hadoop clusters can be created on demand in a short amount of time
  • hyperscalers usually offer capabilities to leverage less expensive VMs
  • managed VMs
  • separation between storage and compute which gives orgs better flexibility in handling data governance challenges

Design and implementation

Batch and Stream

Whether batch or stream, there are four storage areas in a data lake:

  • raw/landing/bronze (where raw data is collected and ingested directly from source systems)
  • staging/dev/silver (where more advanced users process the data to prepare it for final users)
  • production/gold (where the final data used by prod systems is stored)
  • sensitive (optional)

Lambda architecture:

image

Kappa architecture:

image

Data catalog

image

A centralized repository of metadata describing organizational datasets, making scattered data across databases, data warehouses, filesystems, and blob storage more discoverable and accessible. It supports various users like data scientists, engineers, and business analysts by enabling efficient data discovery and rationalization.

  • connects to data lakes and other platforms, ensuring a comprehensive metadata view without duplicating data
  • helps identify and managed duplicated, unused, or similar data, enabling cost savings by focusing on relevant data
  • facilitates data syncing, transformations, and updates in the lake while maintaining clarity on dataset levels
  • uses json/yaml files to formalise agreements on data schema, ingestion freq, ownership, and access levels

Cloud data lake reference architecture

AWS

image

  • data sources, which are S3 and NoSQL dbs
  • storage zones on top of S3 buckets
  • data catalog, data governance, security and process engine orchestrated by AWS lake formation
  • analytical services such as Athena, RedShift, and EMR to provide access to the data

Azure

image

GCP

image

When choosing a cloud vendor, you should also consider the datasets that are natively available through their offerings. In the marketing or advertising space, for example, it can be very impactful to implement your business solutions using these datasets. Some examples of datasets include Open Data on AWS, the Google Cloud Public Datasets, and the Azure Open Datasets. If you advertise on Google or sell on Amazon, the ready-built integrations between the different divisions of these companies and their respective cloud platforms can be particularly helpful.

Integrating the data lake: the real superpower

The superpower of a data lake is its ability to connect the data with an unlimited number of process engines to activate potentially any use case you have in mind.

  • there are many data sources that can be accessed through APIs

The evolution of data lakes with Apache Iceberg, Apache Hudi, and Delta Lake

  • ACID compliance, giving users the certainty that the info they are querying is consistent
  • overcoming the inherent limitations of HDFS in terms of file size - with this approach, even a small file can work perfectly
  • logging of every single change made on data, guaranteeing a complete audit if ever cenessary and enabling time travel queries
  • no difference in handling batch and streaming ingestions and elaboration
  • full compatibility with the Spark engine
  • storage based on Parquet that can achieve a high level of compression
  • stream processing enablement
  • ability to treat the object store as a db
  • applying governance at the row and column level
  • partition evolution
  • schema evolution

Interactive analytics with notebooks

Jupyter general notebook architecture and Spark-based kernel:

image

Democratising data processing and reporting

Build trust in the data

Establishing trust in data is essential for organizations, particularly when transitioning from IT-managed to democratized data access. The example of a retail consultancy highlights the importance of transparency in data extraction, transformation, and reporting. When IT teams lack in-depth knowledge, they may struggle to assure decision-makers of data reliability. Empowering end users with tools to explore, audit, and manage data fosters autonomy and trust.

  • identify the key stakeholders who have the right information on a timely basis
  • defend the data from any kind of exfiltration, both internal and external, with a focus on personnel
  • cooperate with other people inside or outside the company to unlock the value of data

Data ingestion is still an IT matter

Summary: Data Ingestion and IT’s Role

Data ingestion is a critical step in maintaining an efficient data lake and preventing it from becoming a data swamp filled with unused, stale, and inconsistent data. Poor planning during ingestion can lead to wasted storage, errors, and inefficiencies. To avoid this, organizations should follow some best practices:

  • file formats (JSON, Avro, Parquet)
  • file sizes (in hadoop - larger is better; for smaller it is a good idea to use a streaming engine to consolidate the data into a few large files)
  • data compression to save space
  • directory structure (tailor the structure to the use case)

Watching projects from Zach Wilson’s Analytics Eng boot camp

1st project

Big Bag Data - build an automated pipe that scrapes, cleans, and transforms data from sold handbags - then build a website that presents this data in a meaningful way

image

2nd project

F1 Insights: Real-Time Replay & Historical Analytics - github

image

3rd project

Global historical climatology network daily

Data model:

image

4th project

Real-time sports betting analytics platform - github

Real-time pipe:

image

Batch pipe:

image

5th project

Chess.com analytics project

6th project

Meme coin tracker - understand who are profitable traders, find what factors make a coin promote to better markets

image


Architecting Data and Machine Learning Platforms - Chapter 6 - Innovating with an EDW

A modern data platform

Organisational goals

  • no silos
  • democratisation
  • discoverability
  • stewardship
  • single source
  • security and compliance
  • ease of use
  • data science
  • agility

Tech challenges

  • size and scale
  • complex data and use cases
  • integration
  • real time

Tech trends and tools

  • separation of compute and storage
  • multitenancy
  • separation of authentication and authorisation
  • analytics hubs
  • multicloud semantic layer
  • consistent admin interface
  • cross-cloud control pane
  • multicloud platforms
  • converging of data lakes and DWHs
  • built-in ML
  • streaming ingest
  • streaming analytics
  • CDC
  • embedded analytics

Hub-and-Spoke architecture

image

This architecture is an ideal design for modern cloud data platforms centered around a DWH. The DWH acts as the central hub, collecting and managing all enterprise data, while the spokes—custom applications, dashboards, ML models, and more—interact with the DWH via standard interfaces like APIs. This structure enables seamless data access without duplication and offers scalability, flexibility, and resilience.

  • well-suited for startups, organizations seeking transformation, and enterprises with minimal legacy constraints
  • efficient data ingestion is achieved through mechanisms like scheduled exports for SaaS data, real-time CDC tools for operational databases, and dynamic queries for federated datasets
  • tech advancements like the separation of compute and storage, stateless ML APIs, and unified batch/stream processing have made this architecture a powerful alternative to older paradigms that relied on data marts, ETL, and sharded storage

Data ingest

3 ingest mechanisms:

  • prebuilt connectors

image

Third-party vendors such as Fivetran can automatically handle landing data from a wide range of sources into a cloud DWH such as BigQuery, Redshift, or Snowflake

  • real-time data capture

image

  • federated querying

BI

image

  • make sure that the BI tool pushes all queries to the enterprise DWH and doesn’t do them on OLAP cubes (extracts of the database/DWH)
  • visualisations
  • offer embedded analytics
  • use a semantic layer

Transformations

A better option is to define transformations within the DWH using SQL to create views or tables, making the results available to all use cases. Another method, reverse ETL, uses event-driven serverless functions to transform and push data into backend systems.

For handling large datasets, ELT (Extract, Load, Transform) can be used instead of traditional ETL

Alternatively, scheduled queries periodically extract and aggregate data, which reduces query costs but can result in outdated data.

Materialized views offer a balanced approach by storing precomputed results, which are automatically updated in modern DWHs, improving query performance.

Organisational structure

In many organizations, there are many more business analysts than engineers. Often, this ratio is 100:1. A hub-and-spoke architecture is ideal for organizations that wish to build a data and ML platform that primarily serves business analysts. Because the hub-and-spoke architecture assumes that business analysts are capable of writing ad hoc SQL queries and building dashboards, some training may be necessary.

image

The central data engineering team is responsible for the filled shapes, while the business unit is responsible for the unfilled shapes

DWH to enable data scientists

Query interface

Before automating decision making, DSs need to do EDA and lots of experimentation. Thankfully tools like jupyter notebook allow to execute both python and sql in their cells (two common languages for DS)

Storage API

Using the storage API that ML frameworks support to read data efficiently and in parallel out of the DWH (i.e. Spark and TensorFlow offer such API)

QI and Storage APIs are the two methods DSs use to access data in the DWH for analysis, but nowadays there is a trend of DWH having the capability of implementing, training and using an ML alg directly in them.


Architecting Data and Machine Learning Platforms - Chapter 7 - Converging to a Lakehouse

The need for a unique architecture

Data lakes and DWHs emerged to meet the needs of different users. An organization with both types of users is faced with an unappealing choice.

User personas

Traditional DWH users are BI analysts closer to the business, focusing on deriving insights from data; they tend to be proficient in SQL

Data lake users include DAs, DEs and DSs. They not only transform the data to make it more accessible by the business but also experiment with it and use it for ML.

Antipattern: Disconnected systems

Different IT departments or teams will manage the DWH and the data lake - resulting in spending resources on operational aspects rather than focusing on business insight and not being able to allocate resources to focus on the key business drivers.

Maintaining 2 separate systems can cause DQ and consistency problems.

Antipattern: Duplicated data

Why not simply connect the two systems with data being synced by the platform team? We will end up with something like:

image

Drawbacks of such architecture:

  • proliferation of the data (data exists in various forms and maybe even outside the boundaries of the platform, which may cause security and compliance risks, data freshness issues, or data versioning issues)
  • slowdown in time to market (orgs may spend up to 2 weeks implementing a minor change to a report for management because htye must first request that DEs modify the ETLs in order to access the data necessary to complete the task)
  • limitations in data size
  • infrastructure and operational cost

Converged Architecture

Two forms

The common storage can take one of two forms:

  • data is stored in an open source format (parquet, avro, etc) on the cloud and we can use Spark to process it (using Python, SQL, Scala, R); we can also use open source formats like Delta Lake and Apache Iceberg, or
  • the common storage can be a highly optimised DWH format where we can natively leverage SQL

We need to choose based on user skills and do a proper evaluation based on some criteria.

Lakehouse on cloud storage

  • cloud storage allows compute resources (like Spark clusters) to be scaled independently from storage, and compute can move to the data, rather than the other way around. This is a major advantage over on-premise systems
  • instead of separate engines for data lakes and DWHs, a single analytical engine accesses all data, streamlining analysis and eliminating silos
  • cloud storage’s high throughput and the ability to bring compute to data result in faster query speeds
  • data democratization: enables self-service queries for data engineers, data scientists, and even business user
  • centralized data governance and security, ensuring consistent policies
  • supports both schema-on-read (data lake) and schema-on-write (DWH) approaches, and easily integrates with streaming data using ACID transactions with technologies like Apache Iceberg, Apache Hudi, and Delta Lake
  • by using a single storage layer, this architecture reduces the need to download or copy data for offline processing

Migration strategy:

  • iterative approach
  • gain org buy-in by showing a quick solution
  • phased decommissioning of the old system
  • start with new workloads

SQL-first lakehouse

This enables users to execute analytics and BI, but also carry out orchestration and ML.

  • SQL’s wide familiarity makes data access and analysis easier for business users, not just technical staff
  • SQL handles most transformations, while Spark is used for complex tasks and ML
  • business users can train and deploy ML models with SQL, empowering data-driven decisions
  • provides a single, integrated platform for data processing, analytics, and ML

Migration:

  • iterative approach
  • direct DWH ingestion
  • elevate DWH SQL engine
  • insource ML worklouds into the DWH env
  • Emphasize SQL-based data transformation skills, potentially requiring a new role (analytics engineers) or skillset within your organization

The benefits of convergence

  • quicker time to market
  • reduced risk
  • predictive analytics
  • data sharing
  • ACID transactions
  • multimodal data support
  • unified env
  • schema and governance
  • streaming analytics

First thing that caught my attention was a new ForSt DB (for streaming db) that will be used in Flink 2.0

image

Next, stream-batch unification

Learned about Apache Paimon - a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

image

(above pic is from another Paimon talk (links below))

image

image

They introduce this materialized table:

image

image

ML models in Flink SQL

image



That is all for today!

See you tomorrow :)