(Day 338) Going deeper into data lakehouses and Iceberg

Ivan Ivanov · December 4, 2024

Hello :) Today is Day 338!

A quick summary of today:

  • Apache Iceberg: The Definitive Guide - Chapter 5 - Iceberg Catalogs
  • Data Quality Fundamentals - Chapter 7 - Building End-to-End Lineage
  • Streamed - creating a local data lakehouse

Apache Iceberg: The Definitive Guide - Chapter 5 - Iceberg Catalogs

Requirements of an Iceberg catalog

Iceberg catalogs require implementations of basic functions: listing, creating, dropping, checking existence, and renaming tables. Catalogs must map a table path (e.g. db1.table1) to the filepath of the current metadata file, with different implementations achieving this differently, such as using a version-hint.text file in filesystem-based catalogs or a location property in Hive Metastore.

While these basic requirements enable functionality, production usage demands additional safeguards, particularly support for atomic operations. This ensures consistent table states for readers and writers and prevents data loss during concurrent writes, a critical feature for avoiding business impacts in production environments.

Catalog comparison

Hadoop

Simplest to start with, uses filesystem-based storage.

  • Pros: no external systems needed; works with various filesystems (HDFS, S3, ADLS, GCS)
  • Cons:
    • limited production suitability
    • requires atomic rename operations (e.g., S3 needs DynamoDB workaround)
    • restricted to one warehouse directory
    • performance issues with many namespaces/tables
    • dropping tables also deletes data

Best for quick starts, experimentation, or non-production environments.

Hive Catalog

Uses Hive Metastore for metadata management.

  • Pros:
    • compatible with many tools
    • cloud-agnostic
  • Cons:
    • requires user-managed Hive Metastore
    • no support for multitable transactions

Suitable if Hive Metastore is already part of the infrastructure and multitable transactions aren’t required.

AWS Glue Catalog

Managed service leveraging AWS Glue Data Catalog.

  • Pros:
    • reduces operational overhead
    • tight integration with AWS services
  • Cons:
    • no multitable transaction support
    • AWS-specific, complicating multicloud scenarios

Ideal for AWS-centric setups without multitable transaction needs.

Nessie

Git-like version control for data, enabling multitable transactions.

  • Pros:
    • supports multitable/multistatement transactions
    • data-as-code paradigm
    • cloud-agnostic
  • Cons:
    • limited engine/tool support
    • requires user-managed infrastructure (or hosted options like Dremio)

Excellent for cloud-agnostic setups requiring version control and transactional consistency.

REST Catalog

Interface-based catalog using RESTful services.

  • Pros:
    • minimal dependencies
    • highly flexible and customizable
    • supports multitable transactions
    • cloud-agnostic
  • Cons:
    • requires self-managed process and data store
    • no public backend implementation; must build or use hosted services (e.g., Tabular)
    • limited engine/tool support

Best for flexibility, cloud-agnostic goals, and multitable transaction needs.

JDBC Catalog

Uses JDBC-compliant databases (e.g. MySQL, PostgreSQL)

  • Pros:
    • easy setup with existing infrastructure
    • high availability with cloud-hosted databases
    • cloud-agnostic
  • Cons:
    • no multitable transaction support
    • tools must package JDBC drivers

Suitable for environments with existing JDBC databases, focusing on simplicity and availability

Catalog migration

Migrating catalogs in Apache Iceberg is straightforward due to the table data residing in data lake storage. This flexibility reduces vendor lock-in risks, future-proofs data, and allows for seamless transitions to better or more cost-effective catalog solutions. Common scenarios for migration include moving from an experimental catalog to a production-ready one, utilizing advanced features of a new catalog, or shifting environments (e.g. on-premises to cloud).

  1. lightweight migration: migration involves remapping the table path to metadata without data duplication. However, surrounding factors like active jobs and tool dependencies must be managed
  2. testing before full migration: tables can be registered in a new catalog for testing while still active in the current one. However, avoid making changes in the new catalog during this phase to prevent conflicts

Migration Methods

  • using the Iceberg CLI tool
  • using an engine (like Apache Spark)

Both methods retain the table’s full history, enabling operations like time travel and metadata inspection post-migration. Proper planning ensures smooth transitions and avoids data integrity issues.

Hands-on with Iceberg - Chapter 6, 7, 8, 9

This chapter was just following this notebook on the book’s repo after setting up minio, dremio and spark-iceberg. The chapter showed the SparkSQL and Dataframe API for a variety of READ/WRITE/UPDATE/DELETE queries - all of the SparkSQL queries are in the notebook.

Chapter 7 was just executing queries in Dremio’s query engine - just showing it’s flexibility and that we can execute queries there as well.

Chapter 8 - some basic setup on setting up AWS Glue jobs for Spark ETL jobs, however it was very short (suspeciously short to be honest) and it was about doing things in the AWS console which I cannot really follow at the moment.

Chapter 9 - Flink set up and basic READ/WRITE/UPDATE/DELETE operations.

I think these 2 videos are way better to get practice - Building a Data Lakehouse with Apache Iceberg, Spark, Dremio, Nessie & Minio and Ingesting Data Using Apache Flink into Apache Iceberg with a Nessie Catalog (and I bookmarked them to checkout on stream some time)


Data Quality Fundamentals - Chapter 7 - Building End-to-End Lineage

Building E2E field-level lineage for modern data systems

Example table-level lineage with upstream and downstream connections between objects in the data warehouse and tables

image

Due to the complexity of even the most basic SQL queries, however, building data lineage from scratch is no easy feat. Historically, lineage is parsed manually, requiring an almost encyclopedic knowledge of a company’s data environment and how each component interacts with each other.

Basic lineage requirements

To enable fast value realization, data teams need field-level data lineage to quickly identify and address impacts on downstream fields and reports. A secure architecture should prioritize metadata access over direct user data or PII handling. Automation is essential for updating data assets dynamically, reducing manual effort. Effective solutions must integrate with popular data tools across the entire pipeline, including warehouses, transformation tools, and BI platforms. Column-level lineage is critical for understanding data anomalies and requires detailed parsing beyond table-level lineage approaches.

In an analytics capacity, data lineage can be used for a variety of applications, including:

  • reviewing suspicious numbers in a revenue report
  • reducing data debt
  • managing PII

Data lineage design

At a foundational level, the structure of most lineages incorporates three elements

  • the destination table, stored in the downstream report
  • the destination fields, stored in the destination table
  • the source tables, stored in the data warehouse

image

image

Like any agile development process, building field-level lineage is an exercise in prioritization, listening, and quick iteration. Here are some best practices when building lineage

  • listen to your teammates and take everyone’s advice into consideration
  • invest in prototyping
  • ship and iterate

Case study: Architecting for data reliability at Fox

Alex Tverdohleb, VP of Data Services at Fox Networks, transformed the organization’s data architecture to enable self-serve analytics while ensuring reliability and trust. He developed a hybrid decentralized approach, termed “controlled freedom,” that allows internal stakeholders to access and use trustworthy data independently, without relying heavily on centralized data teams.

Key strategies include:

  • controlled freedom: setting strict parameters for data ingestion, security, and optimization, while granting users freedom to explore data for ad hoc analytics
  • decentralized team structure: five specialized teams (data tagging, engineering, analytics, science, and architecture) collaborate to meet business needs, with analysts bridging the gap between business requirements and engineering execution
  • purpose-driven technology: adopting a lakehouse architecture optimized for self-service, avoiding unnecessary “shiny” tools. Data flows through three layers—raw, optimized, and published—to ensure usability and consistency
  • data observability: leveraging tools like Monte Carlo for QA, validation, and alerting to monitor data quality and proactively address incidents, ensuring trust and transparency

Streamed - creating a local data lakehouse

I followed the blog version of this - Building a Data Lakehouse with Apache Iceberg, Spark, Dremio, Nessie & Minio video and explored dremio, nessie, minio and how data files, metadata, and manifest files get stored after running and inserting some queries.

After stream I casually watched this series of videos of the same authour as the above blog and one of the autors of the Apache Iceberg book. They are 16 in total and are very well made but have less than 1000 views each :(


That is all for today!

See you tomorrow :)