(Day 264) Day 5 of the DeepLearning.AI Data Engineering Professional Certificate course

Ivan Ivanov · September 21, 2024

data-eng theory applying-knowledge

Hello :) Today is Day 264!

A quick summary of today:

continue with week 2 and 3 of course 3

Week 2

Data Warehouses

A data warehouse is a subject oriented, integrated, non volatile and time variant collection of data in support of management’s decisions

Data warehouse implementation

Data lakes

Shortcomings of data lakes 1.0

data lakes became data swamps: no proper data management, no data cataloging, no discovery tools, no guarantee on the data integrity and quality
1.0 was write-only: data manipulation operations were hard to implement, difficult to comply with regulations (like GDPR)
no schema management and data modelling (joins were a huge headache)

Next-gen data lakes

In response to the limitations of the original data lake 1.0, engineers have developed strategies to enhance data management and retrieval. This video discusses three main approaches: data zones, partitioning, and data catalogs.

Data Zones

Data can be organized into different zones within a data lake, often following a three-zone model:

Landing/Raw Zone: Stores raw data ingested from source systems, providing a permanent record.
Cleaned/Transformed Zone: Contains data that has been cleaned, validated, and standardized, with any PII removed.
Curated/Enriched Zone: Holds data modeled with business logic, ready for consumption and stored in efficient open formats like Parquet, Avro, or ORC.

While the number and naming of zones can vary, they allow for appropriate data governance and ensure data quality for users.

Data Partitioning

To improve query performance, data in the cleaned or curated zones can be partitioned based on criteria such as time, date, or location. This technique enables query engines to scan only relevant partitions, resulting in faster performance.

Data Catalogs

A data catalog serves as a centralized collection of metadata about datasets, allowing users to search for information based on data owner, source, partitioning details, and column definitions. It maintains schemas and records changes over time, fostering a common understanding of data structure across the organization.

The Data Lake House Architecture

Despite advancements, organizations often relied on multiple storage systems to meet various needs—leveraging the low-cost storage of data lakes and the superior performance of data warehouses. This led to the costly ETL process for moving data between systems, which introduced risks of data quality issues.

To address these challenges, the Data Lake House architecture was developed, merging the benefits of data warehouses and data lakes into a unified solution.