(Day 334) Starting to learn about Apache Iceberg

Ivan Ivanov · November 30, 2024

data-eng theory applying-knowledge

Hello :) Today is Day 334!

A quick summary of today:

more DQ fundamentals
learning about apache iceberg
streamed

DQ Fundamentals - Chapter 4 - Monitoring and Anomaly Detection for Your Data Pipelines

When it comes to understanding when data breaks, our best course of action is to lean on monitoring, specifically anomaly detection techniques that identify when our expected thresholds for volume, freshness, distribution, and other values don’t meet expectations.

Up until recently, anomaly detection was considered a nice-to-have—not a need-to-have—for many data teams. Now, as data systems become increasingly complex and companies empower employees across functions to use data, it’s imperative that teams take both proactive and reactive approaches to solving for data quality.

Knowing your known unknowns and unknown unknowns

Unknown unknowns refer to data downtime that even the most comprehensive testing can’t account for, issues that arise across our entire data pipeline, not just the sections covered by specific tests. Unknown unknowns might include:

a distribution anomaly in a critical field that causes our Tableau dashboard to malfunction
a JSON schema change made by another team that turns 6 columns into 600
an unintended change to ETL leading to tests not running and bad data being missed
incomplete or stale data that goes unnoticed until several weeks later, affecting key marketing metrics
a code change that causes an API to stop collecting data feeding an important new product
data drift over time, which can be challenging to catch, particularly if our tests look only at the data being written at the time of our ETL jobs, which don’t normally take into account data that is already in a given table

While testing and circuit breakers can handle many of our known unknowns, monitoring and anomaly detection can cover our bases when it comes to unknown unknowns.

Building an anomaly detection algorithm

The first pillar of data observability we monitor for is freshness, which can give us a strong indicator of when critical data assets were last updated. If a report that is regularly updated on the hour suddenly looks very stale, this type of anomaly should give us a strong indication that something is inaccurate or otherwise wrong.

We also want to assess the field-level, distributional health of our data. Distribution tells us all of the expected values of our data, as well as how frequently each value occurs. One of the simplest questions is, “How often is my data NULL?” In many cases, some level of incomplete data is acceptable—but if a 10% null rate turns into 90%, we’ll want to know.

Building monitors for schema and lineage

Good anomaly detection is certainly part of the data observability puzzle, but it’s not everything. Equally important is context. If a data anomaly occurred, great. But where? What upstream pipelines may be the cause? What downstream dashboards will be affected by a data anomaly? And has the formal structure of my data changed? Good data observability hinges on our ability to properly leverage metadata to answer these data anomaly questions.

Scaling anomaly detection with python and ML

set up monitoring tools
account for false positives and false negatives
improving precision and recall
F-scores
being careful with model accuracy

Other useful anomaly detection approaches

rule definitions and hard thresholding
autoregressive models for time-series anomaly detection
exponential smoothing
clustering
hyperparam tuning
ensemble method approaches

Designing DQ monitors for warehouses vs lakes

It’s important to distinguish whether we’re working with structured, monolithic data from a warehouse or entering the wild west of the modern data lake ecosystem.

The primary differences between designing anomaly detection algorithms for warehouses and lakes:

the number of entrypoints we have to account for
how the metadata is collected and stored
how we can access that metadata

First, data lake systems tend to have high numbers of entrypoints, meaning one should assume high heterogeneity in data entering from different sources. In monitoring, say, null rates in tabular data entering from Postgres, application logs, and a web API, a data scientist might notice clusters of table behavior corresponding to the different endpoints. More likely than not, different model architectures (and different hyperparameters) may work better at predicting anomalies in each different format. One way to do that is to condition on the endpoint of the data itself, forming a new feature for input into the machine learning model. Another is to use an ensemble model architecture, or simply to have separate models for each of our use cases.

Second, metadata collected straight into a data lake may need varying levels of preprocessing before we can expect an anomaly detection algorithm to derive anything of value from it. Types may need coercion, schemas may need alignment, and we may find deriving entirely new augmented features in the data before running the detector’s training task.

Chapter 5 - Architecting for data reliability

Measuring and maintaining high data reliability at ingestion

Key practices include data cleaning, enrichment, and testing. Data cleaning removes irrelevant, incomplete, or duplicate data, while data enrichment adds value by merging first- or third-party data. Data testing validates assumptions about data through methods like unit, functional, and integration testing, checking aspects like null values, freshness, and distribution. Automating these processes using tools like Apache Griffin, Great Expectations, or dbt improves efficiency and reliability. Proactive and reactive approaches can be used to enhance DQ and combining testing with monitoring to address unexpected data issues during production.

Measuring and maintaining DQ in the pipeline

Increasingly, data teams are coming to rely on similar principles of observability and monitoring to track data quality in production pipelines, with companies developing their own unique methodology for how to measure it, depending on the needs of the business.

Data observability can be broken down into five major pillars

Freshness - Is the data recent? When was the last time it was generated? What upstream data is included/omitted?
Distribution - Is the data within accepted ranges? Is it properly formatted? Is it complete?
Volume - Has all the data arrived?
Schema - What is the schema, and how has it changed? Who has made these changes and for what reasons?
Lineage - For a given data asset, what are the upstream sources and downstream assets that are impacted by it? Who are the people generating this data, and who is relying on it for decision making?

The five pillars of data observability serve as key measurements for understanding the health of data at each stage in its life cycle, and provide a fresh lens with which to view the quality of our data.

Understanding DQ downstream

DQ issues often become apparent downstream, in the analytics layer or applications. To address this, teams use tools like data reliability dashboards to monitor metrics such as time to detection (TTD) and time to resolution (TTR). Service-level agreements (SLAs), indicators (SLIs), and objectives (SLOs) help define and measure data reliability based on stakeholder needs. Data quality is traditionally evaluated on six dimensions: completeness, timeliness, validity, accuracy, consistency, and uniqueness. For data engineers, practical measures include tracking data errors, null values, timeliness, duplicates, and consistency, with a focus on aligning with stakeholders to prioritize critical data.

Developing trust in your data

Importance of trust in data

advanced systems are ineffective without reliable data
trust enables accurate insights and business value

Data observability

observability ensures data health across freshness, distribution, volume, schema, and lineage
borrowed from DevOps principles: monitoring, alerting, and triaging data issues
tools provide end-to-end lineage and automate monitoring without security risks

Measuring the impact of poor DQ

consequences include wasted time, revenue loss, compliance risks, and erosion of trust
key metrics to measure impact
- time to detection (TTD): time taken to identify issues
- time to resolution (TTR): time taken to fix issues
downtime costs formula
- (TTD hours + TTR hours) × downtime hourly cost
factors affecting cost
- labor for engineers
- compliance risks
- opportunity cost of lost trust

Setting data reliability goals

inspired by DevOps principles like SLAs, SLOs, and SLIs
steps to establish reliability:
- define reliability through inventory and stakeholder input
- measure indicators like update frequency and error distribution
- track objectives to monitor and improve reliability

Managing data incidents

effective incident resolution minimizes impact on stakeholders
communication strategies ensure alignment during issue resolution
borrowed practices from site reliability engineering streamline processes

Outcomes

improved data quality and reduced downtime
enhanced stakeholder trust and business alignment

Apache Iceberg: The Definitive Guide - Chapter 1 - Introduction to Apache Iceberg

Data Warehouses

DWs act as centralised repositories for large volumes of data ingested from various sources. They are designed to support analytical workloads like Business Intelligence (BI) reports and dashboards. They comprise storage, file format, table format, storage engine, catalog, and compute engine, all tightly coupled within a single system.

Pros: data warehouses are a single source of truth for data, enable fast query performance on vast amounts of historical data, and provide effective data governance policies
Cons: they often lock data into vendor-specific systems, can be expensive for storage and computation, primarily support structured data, and are not natively suited for advanced analytical workloads like ML

Data Lakes

Data lakes emerged as a cheaper alternative to data warehouses, capable of storing vast amounts of both structured and unstructured data in open formats. Data lakes utilise distributed filesystems like HDFS or cloud object storage for storage, rely on open file formats like Parquet or Avro, and employ table formats like Hive to organise data. They also leverage distributed query engines like Spark or Presto for computation.

Pros: offer lower storage and compute costs compared to data warehouses, store data in open formats, can handle unstructured data, and support ML use cases
Cons: can suffer from performance issues due to the decoupled nature of their components, require significant configuration effort, and lack built-in ACID transaction guarantees, leading to potential data inconsistency issues

Data Lakehouses

Data lakehouses represent the latest evolution, combining the best aspects of data warehouses and data lakes. They provide data warehouse-like functionalities (ACID transactions, performance) on top of data lake storage. They leverage data lake table formats like Apache Iceberg, Hudi, or Delta Lake to provide a metadata layer that enables ACID transactions, better performance, and features like time travel and schema evolution directly on data lake storage.

Benefits:

Fewer data copies: reducing data drift and costs
Faster query performance: leading to faster insights
Historical data snapshots: enabling rollback capabilities and reducing the impact of errors
Affordable architecture: due to reduced data duplication and lower storage and compute costs
Open architecture: avoiding vendor lock-in by using open formats like Iceberg and Parquet

Apache Iceberg

Apache Iceberg is a popular open table format for data lakehouses. It addresses limitations of older table formats like Hive by defining tables as a list of individual files instead of directories.

Key Features

ACID transactions: ensuring data consistency
Partition evolution: allowing changes to partitioning without rewriting the entire table
Hidden partitioning: simplifying queries and improving performance
Row-level table operations: providing flexibility for handling updates and deletes
Time travel: enabling queries on historical data snapshots
Version rollback: allowing reversion to previous table states
Schema evolution: supporting changes to table schemas

Apache Iceberg: The Definitive Guide - Chapter 2. The Architecture of Apache Iceberg

Iceberg’s architecture is designed to overcome the limitations of traditional table formats like Hive, enabling efficient data management and analysis in data lakes.

Three-Layered Structure:

Apache Iceberg tables consist of three main layers:

Data layer: this stores the actual data files, typically in formats like Parquet, ORC, or Avro. Iceberg is file format agnostic, providing flexibility and future-proofing
Metadata layer: this crucial layer manages metadata about the table’s data and operations. It uses a tree structure comprising manifest files, manifest lists, and metadata files. This layered approach ensures efficient data access and enables key features like time travel and schema evolution
Catalog layer: This acts as a central registry, pointing to the current metadata file for each Iceberg table. It provides a single entry point for engines and tools to locate and interact with Iceberg tables. Various backends like Amazon S3, Hive Metastore, or Nessie can serve as an Iceberg catalog

Components of the Data Layer:

Datafiles: these files contain the actual data of the table, supporting various file formats like Parquet, ORC, and Avro. Parquet, with its columnar structure, is the most widely used format due to its performance advantages for OLAP workloads
Delete Files: these files track records that have been deleted from the dataset. They facilitate the Merge-on-Read (MOR) approach for updates and deletes, where changes are tracked separately and merged during query execution. Two types of delete files exist:
- Positional delete files: identify deleted rows by their exact position in the data file
- Equality delete files: identify deleted rows based on the values of specific fields

Components of the Metadata Layer:

Manifest files: these track the data files and delete files within the data layer, including their location and statistics such as record counts and column bounds. Maintaining these statistics at the file level enables Iceberg to overcome the limitations of Hive’s table format, leading to significant performance improvements
Manifest lists: a manifest list represents a snapshot of the Iceberg table at a specific point in time. It contains a list of all active manifest files, their locations, partition information, and statistics
Metadata files: these store high-level metadata about the Iceberg table, including schema information, partition specifications, snapshots, and the current snapshot. Every change to the table results in a new metadata file, ensuring a linear history of table commits and facilitating concurrent writes
Puffin files: while datafiles and metadata files already contain statistics to enhance query performance, Puffin files introduce additional structures for advanced performance optimization. These files store specialized statistics and indexes, such as Theta sketches from the Apache DataSketches library, to accelerate queries that require distinct value counts or other specific computations

Stream

I streamed and read/watched a few things about Apache Iceberg and it seems very intriguing and will keep looking into it.

I kind of fell asleep and forgot to post this ~ so here it is at 1am 😆

That is all for today!

See you tomorrow :)