(Day 264) Day 5 of the DeepLearning.AI Data Engineering Professional Certificate course

Ivan Ivanov · September 21, 2024

Hello :) Today is Day 264!

A quick summary of today:

  • continue with week 2 and 3 of course 3

Week 2

Data Warehouses

A data warehouse is a subject oriented, integrated, non volatile and time variant collection of data in support of management’s decisions

image

image

image

Data warehouse implementation

image

image

Data lakes

image

Shortcomings of data lakes 1.0

  • data lakes became data swamps: no proper data management, no data cataloging, no discovery tools, no guarantee on the data integrity and quality
  • 1.0 was write-only: data manipulation operations were hard to implement, difficult to comply with regulations (like GDPR)
  • no schema management and data modelling (joins were a huge headache)

Next-gen data lakes

In response to the limitations of the original data lake 1.0, engineers have developed strategies to enhance data management and retrieval. This video discusses three main approaches: data zones, partitioning, and data catalogs.

Data Zones

Data can be organized into different zones within a data lake, often following a three-zone model:

  1. Landing/Raw Zone: Stores raw data ingested from source systems, providing a permanent record.
  2. Cleaned/Transformed Zone: Contains data that has been cleaned, validated, and standardized, with any PII removed.
  3. Curated/Enriched Zone: Holds data modeled with business logic, ready for consumption and stored in efficient open formats like Parquet, Avro, or ORC.

While the number and naming of zones can vary, they allow for appropriate data governance and ensure data quality for users.

Data Partitioning

To improve query performance, data in the cleaned or curated zones can be partitioned based on criteria such as time, date, or location. This technique enables query engines to scan only relevant partitions, resulting in faster performance.

Data Catalogs

A data catalog serves as a centralized collection of metadata about datasets, allowing users to search for information based on data owner, source, partitioning details, and column definitions. It maintains schemas and records changes over time, fostering a common understanding of data structure across the organization.

The Data Lake House Architecture

Despite advancements, organizations often relied on multiple storage systems to meet various needs—leveraging the low-cost storage of data lakes and the superior performance of data warehouses. This led to the costly ETL process for moving data between systems, which introduced risks of data quality issues.

image

To address these challenges, the Data Lake House architecture was developed, merging the benefits of data warehouses and data lakes into a unified solution.

Data lakehouse

image

image

image

image

I have seen complaints online and memes about AWS’ many services. And the above pic from the course definitely confirms to this meme 😆

Summary

image

Week 3

Batch queries

There was some SQL practice like:

image

And also a lab for SQL which was very long and had to create CTE on top of CTE on top of CTE for different cases

image

Sometimes we can reduce query time by creating an index

image

Streaming queries

image

image

This part introduced AWS’ managed Apache Flink services, and the lab was about using and deploying them.

image

image


That is all for today!

See you tomorrow :)