(Day 341) Kafka + Streaming

Ivan Ivanov · December 7, 2024

data-eng theory applying-knowledge

Hello :) Today is Day 341!

A quick summary of today:

long stream
starting the book Stream Processing with Apache Flink

Stream for ~3.5hrs

Found and covered this nice free course by confluent on Kafka and Confluent Platform fundamentals

It went in depth on how Kafka works and its core components - producers, brokers, and consumers. The videos are only 1.5hrs long but there are also practical exercises where I spun up a VM and executed different commands in the terminal to play with producers/consumers/topics.

The 2nd part of the course is an optional mini-test with 23 questions, and after I passed it I got:

Stream Processing with Apache Flink - Chapter 1 - Introduction to Stateful Stream Processing

Traditional data infrastructures

Transactional processing

Analytical processing

Stateful stream processing

Stateful stream processing is a design pattern for handling unbounded streams of events, such as user interactions, orders, or sensor data. It enables applications to store and access intermediate data (state) for complex computations, with storage options including program variables, local files, or databases. Apache Flink is a distributed system that manages state locally in memory or embedded databases, protecting against data loss through periodic checkpoints stored remotely.

Applications often connect to event logs, such as Apache Kafka, which ensure durable, ordered streams that can be replayed. This connection allows state recovery after failures, updates, bug fixes, or A/B testing by restoring state and replaying events.

Stateful stream processing is used across three primary application categories:

event-driven applications
data pipeline applications
data analytics applications

Real-world use cases often combine aspects of these categories.

Event-driven applications

Typical use cases for event-driven applications include:

real-time recommendations
pattern detection or complex event processing
anomaly detection

Event-driven applications are an evolution of microservices. They communicate via event logs instead of REST calls and hold application data as local state instead of writing it to and reading it from an external datastore, such as a relational database or key-value store

The applications above are connected by event logs. One application emits its output to an event log and another application consumes the events the other application emitted. The event log decouples senders and receivers and provides asynchronous, nonblocking event transfer. Each application can be stateful and can locally manage its own state without accessing external datastores. Applications can also be individually operated and scaled.

Event-driven applications demand robust stream processors with advanced API expressiveness, reliable state management, and strong event-time processing support. Key requirements include exactly-once state consistency and scalability. Apache Flink meets these criteria, making it an excellent choice for running event-driven applications.

Data pipelines

Modern IT architectures involve multiple data stores optimized for specific access patterns, often replicating the same data across systems like databases, caches, and search indexes. Synchronizing these data stores is crucial. While periodic ETL jobs have traditionally been used, they often fail to meet the low-latency demands of modern applications.

An alternative is using event logs to distribute updates, with consumers applying transformations before syncing data to target systems. Stateful stream processing, such as with Apache Flink, excels in building data pipelines, enabling low-latency ingestion, transformation, and synchronization of large data volumes across diverse storage systems.

Streaming analytics

A streaming analytics application continuously ingests streams of events and updates its result by incorporating the latest events with low latency. This is similar to the maintenance techniques database systems use to update materialized views. Typically, streaming applications store their result in an external data store that supports efficient updates, such as a database or key-value store

Streaming analytics offers faster integration of events into analytical results compared to traditional pipelines, which rely on separate ETL processes, storage systems, and schedulers. Stream processors like Apache Flink handle all these steps, including ingestion, computation, state maintenance, and result updates, with features like exactly-once state consistency, event-time processing, and dynamic resource scaling.

A quick look at Flink

Apache Flink is a third-generation distributed stream processor with a competitive feature set. It provides accurate stream processing with high throughput and low latency at scale. Features that make it stand out:

event-time and processing-time semantics. Event-time semantics provide consistent and accurate results despite out-of-order events. Processing-time semantics can be used for applications with very low latency requirements
exactly-once state consistency guarantees
millisecond latencies while processing millions of events per second. Flink applications can be scaled to run on thousands of cores
layered APIs with varying tradeoffs for expressiveness and ease of use. The book covers the DataStream API and process functions, which provide primitives for common stream processing operations, such as windowing and asynchronous operations, and interfaces to precisely control state and time
connectors to the most commonly used storage systems such as Apache Kafka, Apache Cassandra, Elasticsearch, JDBC, Kinesis, and (distributed) filesystems such as HDFS and S3
ability to run streaming applications 24/7 with very little downtime due to its highly available setup (no single point of failure), tight integration with Kubernetes, YARN, and Apache Mesos, fast recovery from failures, and the ability to dynamically scale jobs
ability to update the application code of jobs and migrate jobs to different Flink clusters without losing the state of the application
detailed and customizable collection of system and application metrics to identify and react to problems ahead of time
last but not least, Flink is also a full-fledged batch processor

At the end of the chapter there were instructions on installing Flink locally ~

Exciting (this is some Scala version as per the book’s suggestion) - I’m excited to learn some Scala

Submitted a job:

And once the job is running I can see logs:

That is all for today!

See you tomorrow :)