(Day 242) PySpark day

Ivan Ivanov · August 30, 2024

Hello :) Today is Day 242!

A quick summary of today:

  • reading about ‘the modern data stack’
  • successful interview for the Hannam Design Factory

Big Data on K8s: Chapter 4 - The Modern Data Stack

The Lambda architecture

  • Batch layer: Responsible for managing the master dataset. This layer ingests and processes data in bulk at regular intervals, typically every 24 hours. Once processed, the batch views are considered immutable and stored.
  • Speed layer: Responsible for recent data that has not yet been processed by the batch layer. This layer processes data in real time as it arrives to provide low-latency views.
  • Serving layer: Responsible for responding to queries by merging views from both the batch and speed layers.

image

The Kappa architecture

The Kappa architecture handles all data processing through a single stream processing pathway. The key components include the following:

  • Stream processing layer: Responsible for ingesting and processing all data as streams. This layer handles both historical data (via replay of logs/files) as well as new incoming data.
  • Serving layer: Responsible for responding to queries by accessing views produced by the stream processing layer.

image

Lambda vs Kappa

Feature Lambda Architecture Kappa Architecture
Complexity Manages separate batch and real-time systems Consolidates processing through streams
Reprocessing Reprocesses historical batches Relies on stream replay and algorithms
Latency Lower latencies for recent data in the speed layer Same latency for all data
Throughput Leverages batch engines optimized for throughput Processes all data as streams

In practice, modern data architectures often blend these approaches. For example, a batch layer on Lambda may run only weekly or monthly while real-time streams fill the gap. Kappa may leverage small batches within its streams to optimize throughput. The core ideas around balancing latency, throughput, and reprocessing are shared.

For data lakes, Lambda provides a proven blueprint while Kappa offers a powerful alternative. While some may argue that Kappa offers a simpler operation, it is hard to implement and its costs can grow rapidly with scale. Another advantage of Lambda is that it is fully adaptable. We can implement only the batch layer if no data streaming is necessary (or financially viable).

Data lake builders should understand the key principles of each to craft optimal architectures aligned to their businesses, analytics, and operational needs. By leveraging the scale and agility of cloud infrastructure, modern data lakes can implement these patterns to handle today’s data volumes and power advanced analytics.

Data lake design for big data

Data Warehouses

Data warehouses have been essential for business intelligence by storing structured data from various sources optimized for reporting. However, they face limitations such as difficulties with unstructured data, rigid schema-on-write designs, and latency due to batch processing. This makes them less flexible in handling large, diverse, and real-time data, leading to the rise of more adaptive architectures.

The Rise of Big Data and Data Lakes

Data lakes address the limitations of traditional data warehouses by enabling large-scale storage of diverse data types (structured, unstructured, and semi-structured) in native formats. They use a schema-on-read approach, allowing flexibility in data usage. However, data lakes face challenges like governance issues, complex data preparation, and inefficiencies in handling data modifications and schema management.

The Rise of the Data Lakehouse

Data lakehouses combine the strengths of both data warehouses and data lakes. They support diverse data types, offer SQL analytics, and ensure ACID transactions. Lakehouses allow schema enforcement, data updates, and querying within a unified system, streamlining the analytics life cycle and improving agility.

Lakehouse Storage Layers

  • Bronze Layer: Raw data stored as received, acting as the source of truth.
  • Silver Layer: Refined and standardized data, ready for analysis.
  • Gold Layer: Aggregated and curated data models for business intelligence.
  • Landing Zone (optional): Prepares raw data before entering the bronze layer.

Implementing the lakehouse architecture

Below is a possible implementation of a data lakehouse architecture in a Lambda design

image

The first group on the left represents the possible data sources to work with this architecture. One of the key advantages of this approach is its ability to ingest and store data from a wide variety of sources and in diverse formats. As shown in the diagram, the data lake can connect to and integrate structured data from databases as well as unstructured data such as API responses, images, videos, XML, and text files. This schema-on-read approach allows the raw data to be loaded quickly without needing upfront modeling, making the architecture highly scalable. When analysis is required, the lakehouse layer enables querying across all these datasets in one place using schema-on-query. This makes it simpler to integrate data from disparate sources to gain new insights. The separation of loading from analysis also enables iterative analytics as a new understanding of the data emerges. Overall, the modern data lakehouse is optimized for rapidly landing multi-structured and multi-sourced data while also empowering users to analyze it flexibly.

Chapter 5 - Big Data Processing with Apache Spark

image

The centerpiece that coordinates the Spark application is called the driver program. The driver program instantiates a SparkSession object that integrates directly with a Spark context. The Spark Context connects to a cluster manager that can provision resources across a computing cluster. When running locally, an embedded cluster manager runs within the same Java Virtual Machine (JVM) as the driver program. But in production, Spark should be configured to use a standalone cluster resource manager such as Yarn or Mesos.

The cluster manager is responsible for allocating computational resources and isolating computations on the cluster. When the driver program requests resources, the cluster manager launches Spark executors to perform the required computations.

Spark executors are processes launched on worker nodes in the cluster by the cluster manager. They run computations and store data for the Spark application. Each application has its own executors that stay up for the duration of the whole application and run tasks in multiple threads. Spark executes code snippets called tasks to perform distributed data processing.

A Spark job triggers the execution of a Spark program. This gets divided into smaller sets of tasks called stages that depend on each other.

The DataFrame API and the Spark SQL API

Spark provides different APIs built on top of the core RDD API (the native, low-level Spark language) to make it easier to develop distributed data processing applications. The two most popular higher-level APIs are the DataFrame API and the Spark SQL API.

The DataFrames API provides a domain-specific language to manipulate distributed datasets organized into named columns. Conceptually, it is equivalent to a table in a relational database or a DataFrame in Python pandas, but with richer optimizations under the hood. The DataFrames API enables users to abstract data processing operations behind domain-specific terminology such as grouping and joining instead of thinking in map and reduce operations.

The Spark SQL API builds further on top of the DataFrames API by exposing Spark SQL, a Spark module for structured data processing. Spark SQL allows users to run SQL queries against DataFrames to filter or aggregate data. The SQL queries get optimized and translated into native Spark code to be executed. This makes it easy for users familiar with SQL to run ad hoc queries against data.

Both APIs rely on the Catalyst optimizer, which leverages advanced programming techniques such as predicate pushdown, projection pruning, and a variety of join optimizations to build efficient query plans before execution. This differentiates Spark from other distributed data processing frameworks by optimizing queries based on business logic instead of on hardware considerations.

When working with Spark SQL and the DataFrames API, it is important to understand some key concepts that allow Spark to run fast, optimized data processing. These concepts are transformations, actions, lazy evaluation, and data partitioning.

  • Transformations define computations that will be done, while actions trigger the actual execution of those transformations
  • While transformations describe operations on DataFrames, actions actually execute the computation and return results
  • Lazy evaluation is a key technique that allows Apache Spark to run efficiently. As mentioned previously, when you apply transformations to a DataFrame, no actual computation happens at that time. Instead, Spark internally records each transformation as an operation to apply to the data. The actual execution is deferred until an action is called
  • Data partitioning - Spark’s speed comes from its ability to distribute data processing across a cluster. To enable parallel processing, Spark breaks up data into independent partitions that can be processed in parallel on different nodes in the cluster. When you read data into a Spark DataFrame or RDD, the data is divided into logical partitions. On a cluster, Spark will then schedule task execution so that partitions run in parallel on different nodes. Each node may process multiple partitions. This allows the overall job to process data much faster than if run sequentially on a single node. Understanding data partitioning in Spark is key to understanding the differences between narrow and wide transformations.
    • Narrow transformations are operations that can be performed on each partition independently without any data shuffling across nodes. Examples include map, filter, and other per-record transformations. These allow parallel processing without network traffic overhead
    • Wide transformations require data to be shuffled between partitions and nodes. Examples include groupBy aggregations, joins, sorts, and window functions. These involve either combining data from multiple partitions or repartitioning data based on a key

Chapter 6 - Building Pipelines with Apache Airflow

Airflow architecture

Airflow is composed of different components that fit together to provide a scalable and reliable orchestration platform for data pipelines.

At a high level, Airflow has the following:

  • a metadata database that stores state for DAGs, task instances, XComs, and so on
  • a web server that serves the Airflow UI
  • a scheduler that handles triggering DAGs and task instances
  • executors that run task instances
  • workers that execute tasks
  • other components, such as the CLI

image

Airflow relies heavily on the metadata database as the source of truth for state. The web server, scheduler, and worker processes talk to this database. When you look at the Airflow UI, underneath, it simply queries this database to get info to display.

The metadata database is also used to enforce certain constraints. For example, the scheduler uses database locks when examining task instances to determine what to schedule next. This prevents race conditions between multiple scheduler processes.

distributed architecture

  • Separation of concerns: Each component focuses on a specific job. The scheduler handles examining DAGs and scheduling. The worker runs task instances. This separation of concerns keeps components simple and maintainable.
  • Scalability: Components such as the scheduler, worker, and database can be easily scaled out. Run multiple schedulers or workers as your workload grows. Leverage a hosted database for automatic scaling.
  • Reliability: If one scheduler or worker dies, there is no overall outage since components are decoupled. The single source of truth in the database also provides consistency across Airflow.
  • Extensibility: You can swap out certain components, such as the executor or queueing service.

Example

Using astro an example pipeline is created

Init a new project with astro dev init

Start it with astro dev start

And an airflow webserver and a postgres db are started

image

Sample pipeline:

from airflow.decorators import task, dag
from airflow.operators.dummy import DummyOperator
from airflow.operators.bash import BashOperator
from datetime import datetime
import requests
import pandas as pd

default_args = {
    'owner': 'Ney',
    'start_date': datetime(2024, 2, 12)
}

@dag(
        default_args=default_args, 
        schedule_interval="@once", 
        description="Simple Pipeline with Titanic", 
        catchup=False, 
        tags=['Titanic']
)
def titanic_processing():

    # Task Definition
    start = DummyOperator(task_id='start')

    @task
    def first_task():
        print("And so, it begins!")

    @task
    def download_data():
        destination = "/tmp/titanic.csv"
        response = requests.get(
            "https://raw.githubusercontent.com/neylsoncrepalde/titanic_data_with_semicolon/main/titanic.csv", 
            stream=True
        )
        with open(destination, mode="wb") as file:
            file.write(response.content)
        return destination

    @task
    def analyze_survivors(source):
        df = pd.read_csv(source, sep=";")
        res = df.loc[df.Survived == 1, "Survived"].sum()
        print(res)

    @task
    def survivors_sex(source):
        df = pd.read_csv(source, sep=";")
        res = df.loc[df.Survived == 1, ["Survived", "Sex"]].groupby("Sex").count()
        print(res)

    last = BashOperator(
        task_id="last_task",
        bash_command='echo "This is the last task performed with Bash."',
    )

    end = DummyOperator(task_id='end')

    # Orchestration
    first = first_task()
    downloaded = download_data()
    start >> first >> downloaded
    surv_count = analyze_survivors(downloaded)
    surv_sex = survivors_sex(downloaded)

    [surv_count, surv_sex] >> last >> end

execution = titanic_processing()

image

After running it:

image

Real-time events and data ingestion using Kafka

Kafka’s architecture

Kafka’s architecture is distributed and consists of brokers, producers, consumers, topics, partitions, and replicas. Producers publish messages to topics, which are managed by brokers. Consumers subscribe to topics to process messages. Kafka uses Zookeeper for cluster management, including broker coordination and leader election.

Kafka follows a publish-subscribe (PubSub) design where producers write data to topics, and consumers read from them. Topics are split into partitions to allow parallelism and scalability, with partitions replicated across brokers for fault tolerance. Consumers are organized into groups, ensuring each message is processed by one consumer per group. Kafka guarantees ordered, at-least-once message delivery within a partition.

Producers batch messages for efficiency and durability, choosing acknowledgment guarantees. Consumers track their reading position using offsets, which allow them to resume from the correct place after failures. Offsets also enable multiple consumers to read the same partition independently.

Kafka can deliver exactly-once semantics by tightly integrating offset tracking with event delivery, ensuring each record is processed exactly once. This is achieved through careful coordination between storage, processing, and Kafka’s infrastructure.

image

Kafka Connect

Kafka Connect is a framework within the Apache Kafka ecosystem that facilitates the integration of Kafka with external systems, such as databases, message queues, and storage platforms. It simplifies the process of streaming data between Kafka and these systems by providing a standardized way to move large volumes of data into and out of Kafka.

  1. Scalability and Distributed Nature:
    • Kafka Connect is designed to be distributed and scalable. It can run as a standalone process or as a distributed cluster, enabling it to handle large-scale data integration tasks.
    • In distributed mode, Kafka Connect automatically balances the data load and handles connector tasks across multiple nodes.
  2. Source and Sink Connectors:
    • Source Connectors: Import data from external systems into Kafka topics. For example, a source connector could pull data from a database and publish it to Kafka.
    • Sink Connectors: Export data from Kafka topics to external systems. For example, a sink connector could send data from Kafka to a data warehouse or file system.
  3. Schema Management:
    • Kafka Connect supports schema management, allowing for the validation and transformation of data as it moves between systems. This ensures data consistency and format are maintained across the pipeline.
  4. Fault Tolerance and Resilience:
    • Kafka Connect is designed for high availability. In distributed mode, it can automatically recover from failures by restarting tasks on other nodes.
    • It also supports offset management, allowing connectors to resume data streaming from where they left off in case of failures.
  5. Configuration and Management:
    • Kafka Connect simplifies configuration and management through its REST API. Administrators can manage connectors, tasks, and workers programmatically, using simple HTTP requests.
  6. Transforms:
    • Kafka Connect supports Single Message Transforms (SMTs), which are lightweight transformations applied to records as they flow through connectors. SMTs allow for modifying, filtering, or enriching data before it is published to Kafka or written to an external system.
  7. Compatibility with Confluent Hub:
    • A wide range of pre-built connectors are available on the Confluent Hub, an online repository of connectors for Kafka Connect. These connectors cover various use cases, from databases to cloud storage, making it easy to integrate Kafka with many different systems.

Databricks Certified Associate Developer for Apache Spark Using Python: Chapter 1 - learning about the exam

The chapter provides a guide to preparing for the Spark certification exam. It outlines a step-by-step methodology to approach the exam, starting with an overview of the exam structure, which consists of 60 questions to be answered in 120 minutes. To pass, a score of 70% (42 correct answers) is required.

The exam covers three main topics: Spark Architecture (concepts and applications) and Spark DataFrame API applications, with the latter constituting 72% of the exam. The book emphasizes the importance of focusing on the DataFrame API while also covering architecture concepts for easy points.

Resources such as mock exams and Databricks practice exams are recommended. During the exam, access to Spark documentation is provided via Webassessor, and familiarity with this interface is advised.

The chapter also discusses the online proctored exam process, which includes webcam monitoring and specific requirements for the exam environment.

Different types of questions in the exam include theoretical questions (e.g., explanation, connection, scenario, categorization, and configuration questions) and code-based questions (e.g., function identification, fill-in-the-blank, and ordering code).

Databricks Certified Associate Developer for Apache Spark Using Python: Chapter 2 - Understanding Apache Spark and Its Applications

I have no plans for taking this certification exam but still covering the material would be a good idea.

Components of Spark

  • Spark Core: the backbone of Spark, provides core functionality and features
  • Spark SQL: provides SQL support for Spark, allows for querying data in RDDs and external sources
  • Spark Streaming: enables real-time data processing and ingestion from multiple sources
  • Spark MLlib: provides a framework for distributed and scalable machine learning
  • GraphX: provides a graph processing API for Spark

Why Choose Apache Spark?

  • Speed: one of the fastest processing frameworks for data
  • Reusability: can join batch and stream data seamlessly
  • In-memory computation: eliminates overhead of reading and writing to disks
  • Unified platform: integrates multiple components for a unified experience

Use Cases of Spark

  • Big data processing: handles high-volume, high-velocity, and high-variety data
  • Machine learning applications: provides a framework for distributed and scalable machine learning
  • Real-time streaming: enables real-time data processing and ingestion
  • Graph analytics: provides a graph processing API for Spark

Spark Users

Data Analysts

  • Gather requirements from stakeholders
  • Identify relevant data sources
  • Collaborate with subject matter experts
  • Slice and dice data
  • Generate reports
  • Share results

Data Engineers

  • Create, maintain, and optimize data pipelines
  • Collaborate with data analysts and data scientists

Data Scientists

  • Create and test hypotheses
  • Transform data
  • Decide on machine learning algorithm
  • Prototype with different machine learning models
  • Create a baseline model
  • Tune the model
  • Tune the data
  • Transition models to production

Machine Learning Engineers

  • Deploy machine learning models to production environments
  • Build machine learning pipelines
  • Monitor model performance
  • Handle data drift and model drift

Sample Qs from Chapter 3: Spark Architecture

Normally, I would briefly talk about it, but given that the previous book I read today (chapter 5 to be specific), I do not want to write the same thing twice.

Below are some sample Qs related to Spark’s architecture.

Question 1:

What’s true about Spark’s execution hierarchy?

1. In Spark’s execution hierarchy, a job may reach multiple stage boundaries.

  1. In Spark’s execution hierarchy, manifests are one layer above jobs.
  2. In Spark’s execution hierarchy, a stage comprises multiple jobs.
  3. In Spark’s execution hierarchy, executors are the smallest unit.
  4. In Spark’s execution hierarchy, tasks are one layer above slots.

Question 2:

What do executors do?

  1. Executors host the Spark driver on a worker-node basis. 2. Executors are responsible for carrying out work that they get assigned by the driver.
  2. After the start of the Spark application, executors are launched on a per-task basis.
  3. Executors are located in slots inside worker nodes.
  4. The executors’ storage is ephemeral and as such it defers the task of caching data directly to the worker node thread.

Chapter 4: Spark DataFrames and their Operations

This chapter just introduced the basic PySpark API for

  • The Spark DataFrame API
  • Creating DataFrames
  • Viewing DataFrames
  • Manipulating DataFrames
  • Aggregating DataFrames

Sample Q:

Which of the following operations will trigger evaluation?

  1. DataFrame.filter()
  2. DataFrame.distinct()
  3. DataFrame.intersect()
  4. DataFrame.join() 5. DataFrame.count()

MLP by hand

Today I found this nice little excel sheet that teaches doing a forward prop in a neural network by hand (manually) - to practice matrix multiplication basically

image

Hannam Design Factory interview

This morning, I went to the department and met with the team leader Ricky Park. Just a short talk about myself, my background in programming and projects I have developed, and finally some info about the projects that I might be doing when I take the practice course in the department. The course itself is worth 9 credicts (3x the amount of credits of an usual course), and is about taking on projects needed by actual companies and making them a reality. So from next week, every Friday from 9am to 6pm I will be doing this ‘class’. 🥳


That is all for today!

See you tomorrow :)