(Day 267) Continuing with the DE course by Joe Reis

Ivan Ivanov · September 24, 2024

data-eng theory applying-knowledge

Hello :) Today is Day 267!

A quick summary of today:

covered week 1-3 of the final course in the DE specialisation

Course 4 week 1 continued

Data Modelling Techniques

From normalised model to Star schema

Transforming Normalized Data into a Star Schema

In many cases data engineers need to convert normalized data stored in a relational database into a star schema for easier querying, often in preparation for department-specific data marts. This process can follow Kimball’s four steps for designing a star schema:

Understand Business Needs: begin by identifying which business processes to model in the fact tables and define the grain, for example the level of detail for each row in the fact table. In the example, the grain is set to the individual product items in sales transactions
Identify Facts and Dimensions: based on requirements gathered from a data analyst, dimensions such as stores, items, and dates are identified. For example, the stores dimension includes store details like store_id, name, city, and zip code
Create Dimension Tables: surrogate keys are used for dimension tables to uniquely identify each row. Hash functions, such as MD5, can generate surrogate keys from natural keys like store_id and SKU
Build the Fact Table: the fact table contains metrics (i.e. quantity sold, item price) and foreign keys to link to dimension tables. For example, the fact table stores store_key, item_key, and date_key alongside transaction details

Once the star schema is built, tools like DBT or AWS Glue can be used to automate transformations. DBT focuses on transforming data within a single warehouse, while AWS Glue offers more flexibility, connecting multiple sources and destinations

The lab for this week is using dbt to model normalised data into a star schema.

There was a short talk between Joe and dbt’s co-founder Drew Banin, and Drew said that dbt does not take data from the warehouse, do checks, transform, and put it back in, it just instructs the data warehouse on what to do. So the data never leaves the warehouse. This is something I did not know about dbt and another reason it is so awesome.

Data Vault (another data modelling technique)

Dan Linstedt introduced it as a different approach to modelling data in a warehouse.

Sample data vault modelled data

One Big Table

The above approaches were developed in the past when data warehouses were expensive, on premises and heavily resource constrained with tightly coupled compute and storage. While batch data modeling has traditionally been associated with one of these strict approaches more relaxed approaches such as as one big table(obt) are becoming more common.

Joe Reis suggests using a wide table when we have a lot of data that needs more flexibility than a traditional data modeling approach might provide. When it comes to modelling our data, there’s no one size fits all solution, so we have to make sure to understand the trade offs between the possible approaches when it comes to flexibility, data integrity, and ease of use by downstream stakeholders to choose the best approach for your use case

Lab

I will just share the successful dbt run after creating 5 dim tables and the fact table models:

Week 2: Modelling and processing data for ML

As a DE, I might not be responsible for the models in the ML pipelines, but for setting up and maintaining the pipelies to serve data that supports these phases.

This week was about data(table/imgage/text) preprocessing and getting it ready for a ML model.

Week 3: Batch and Streaming transformations using Spark

Hadoop

integrates compute and persistent storage via the HDFS

Apache Spark

Computing engine designed for processing large datasets

retains intermediate results in memory
limits disk I/O interactions, enabling faster computation

The graded assignment this week was to use EMR - AWS’ managed hadoop cluster environment to do some data transformations with PySpark.

Technical considerations

2nd lab for this week for streaming practice

Debezium: Distributed platform for CDC, it connects to the source system and captures the changes based on the configuration one setup
Apache Zeppelin: Web-based notebook environment, it allows us to run Flink in a notebook alongside other possible interpreters
Apache Flink: Distributed processing engine for data streams, you will use it to connect to the Kafka topics and update the target database accordingly

There were some new cool looking technologies in this lab but I should not share more.

There is only week 4 left. This is the final week of the whole DE specialisation. However, from the deeplearning.ai online blog I saw that there are problems with the final assignments and they will get fixed (hopefully) by Friday. So until then I guess I will continue with some other study items on my list.

That is all for today!

See you tomorrow :)