(Day 262) Day 3 of the DeepLearning.AI Data Engineering Professional Certificate course

Ivan Ivanov · September 19, 2024

data-eng theory applying-knowledge

Hello :) Today is Day 262!

A quick summary of today:

finished course 2 of the specialisation Source Systems, Data Ingestion, and Pipelines

Finishing Course 2 Week 1

From yesterday, the last thing from week 1 was the lab and quiz. Below is a simple overview of the issues tackled in the lab as I should not share detailed info.

Fixing Database Connectivity Issues

So I had to look at the nwtworking setup like:

The issue was that I was trying to connect to the DB through c9, but the two were launched on different VPCs

Fixing Permission Issues (making sure cloud9 and the db are in the same security group and also can communicate with each other)

We covered:

how to connect to data sources
basics of networking
importance of IAM in ensuring security in source systems

Week 2: Data Ingestion

Summary: Batch vs. Stream Processing in Data Engineering

Throughout the course, the focus has been on understanding batch and stream processing, particularly in data engineering contexts. Almost all data can be seen as continuous streams of events—such as stock price changes or user interactions—that generate digital data when recorded.

Data Ingestion: Stream vs. Batch

Stream ingestion processes events individually in real-time as they occur.
Batch ingestion imposes boundaries, grouping data by time, size, or number of records, processing it as a single unit.

Both methods exist on a continuum, with micro-batch processing blurring the lines between batch and streaming. The choice between batch and streaming depends on the source system and the business use case.

Tools and Technologies

Databases: Connectors like JDBC or ODBC facilitate regular or event-based ingestion.
Serverless Tools: Services like AWS Glue ETL automate ingestion.
APIs: Custom connections may be needed, though API client libraries simplify the process.
Files: Sometimes ingestion is manual, using protocols like SFTP or SCP.
IoT/Sensor Data: Message queues or streaming systems are often required.

Practical Application

Hands-on use cases include batch ingestion from APIs and streaming ingestion from Kinesis, helping learners become familiar with various ingestion patterns and challenges.

Batch

Advantages of ELT

faster to implement
makes data available more quicly to end users
transformations can still be done efficiently

Disadvantages of ELT

if we don’t plan on how we will use the data, our pipe can end up being just EL, and have a ‘data swamp’ of unorganised, unmanegable and essentially useless data

Aspect	ETL	ELT
History	- In the 80’s and 90’s, data warehouse cost was very expensive (millions of dollars), so engineers wanted to be very intentional about the data they were about to load into the data warehouse. - Data volume was still manageable.	- Cloud data warehouse reduced the cost of storing and processing data significantly (from millions of dollars to hundreds/thousands of dollars only). - Data volume exploded.
Processing (transformation)	- Data is transformed into a predetermined format before it is loaded into a data repository. So, data engineers have to carefully model the data and transform it into its final form. - Transformations rely on the processing power of the processing tool that is used to ingest data (unrelated to the target destination).	- Raw data is loaded into the target destination. Then it is transformed just before analysis (can be used with well-defined data requests). - Transformation relies on the processing power of the data repository, such as the data warehouse.
Maintenance time	If the transformation is found to be inadequate, data needs to be re-loaded.	The original data is intact and already loaded and can be used when necessary for additional transformation. Less time required for data maintenance.
Load Time & transformation time	Load time: It typically takes longer as it uses a staging area and system. Transformation time: It depends on the data size, the transformation complexity and the tool that is used to perform the transformation.	Load time: There is no transformation involved, the data is directly loaded into the destination system. Transformation time: It is typically faster because it relies on the processing power and parallelization of modern data warehouses (generally considered more efficient).
Flexibility (data types)	ETLs are typically designed to handle structured data.	ELT can handle all types of data: structured, unstructured, semi-structured. Once the data is loaded into the target system, you can transform it.
Cost	It depends on what ETL/ELT tool is used and what target system the data is loaded. (And of course, it depends on the data volume).	It depends on what ETL/ELT tool is used and what target system the data is loaded. (And of course, it depends on the data volume).
Scalability	Nowadays, most of the cloud tools are scalable. However, the challenge here is that if you have lots of data sources and lots of targets, you would need to put in lots of effort to manage the code and handle data from multiple sources.	ELT uses the scalable processing power of the data warehouse to enable transformation on a large scale.
Data quality/security	It ensures data quality by cleaning it first. Transformations can also include masking personal information.	The data needs to be transferred first to the target system before transformations that enhance data quality or security are applied. There’s a sub-pattern called ELT where small does not refer to business modeling but to transformation with limited scope (mask sensitive data, deduplicate rows).

Batch lab

extract data from the spotify API
explore pagination
send an API request that requires authorisation

Streaming ingestion

In Amazon’s Kinesis Data Stream, the wording goes: stream (instead of topic) and shard (instead of partition)

Summary: Change Data Capture (CDC)

What is CDC?

Change Data Capture (CDC) is a method used to track and extract changes (insert, update, delete) in a database, ensuring data synchronization across systems. It allows data updates to be captured incrementally, rather than reloading entire datasets.

Strategies for Data Updates:

Full Load (Snapshot): Reloads the entire dataset, replacing old data with the new. It guarantees consistency but is resource-intensive, making it inefficient for high-volume or frequently updated data.
Incremental Load: Loads only the data that has changed since the last update, using mechanisms like last_updated_at columns. This approach is faster for large datasets and is commonly known as CDC.

Use Cases for CDC:

Data Synchronization: Continuously syncs data across systems (e.g., from a transactional database to a data warehouse).
Historical Change Tracking: Captures every change for audit or regulatory purposes, preserving historical data.
Microservices Integration: Tracks changes in a source database to trigger updates in dependent services, such as shipment or customer service systems.

CDC Approaches:

Push: The source system pushes data updates to the target system in real-time. However, if the target is unavailable, data may be lost.
Pull: The target system polls the source database for changes at regular intervals. While simpler, it introduces a time lag.

CDC Implementation Patterns:

Batch-oriented/Query-based CDC (Pull-based): Relies on querying the source database for changes, using columns like updated_at. It can be computationally expensive due to table scans.
Continuous/Log-based CDC (Pull-based): Utilizes the database’s log to capture changes in real-time with minimal overhead. Tools like Debezium can extract log events and send them to streaming platforms like Apache Kafka.
Trigger-based CDC (Push-based): Uses database triggers to notify CDC systems of changes. However, excessive use of triggers can degrade database performance.

Choosing an ingestion tool can depend on the data

data type and structure
data volume
latency requirements
data quality
changes in schema

Also the tool’s reliability and durability

Advice: Evaluate the tradeoffs between the cost of losing data vs building an appropriate level of redundancy

Streaming lab overview

Week 3 - DataOps

A reminder, DataOps is a set of practices and cultural habits centred around building robust data systems and delivering high quality data products

Automation with IaC

creating an EC2 instance
defining variables and outputs
defining data sources and modules

Terraform lab

the lab is about setting up and configuring all these terraform files

Observability

Data issues are inevitable and they could occur at any stage of your data pipeline. The earlier you are able to detect them, the less damage to the organization it will cause. To detect data issues, you need to first choose metrics that assess the data quality, similarly to how software teams monitor metrics that assess the health of their software’s infrastructure.

In her book (Data Quality Fundamentals), Barr Moses suggests to start with the following questions:

Is the data up-to-date?
Is the data complete?
Are fields within expected ranges?
Is the null rate higher or lower than it should be?
Has the schema changed?

She formulated these questions into 5 pillars for data observability, which aim to fully describe the state of the data.

Barr Moses 5 Pillars

Distribution/ Internal Quality: The quality pillar refers to the internal characteristics of the data, and checks metrics such as the percentage of NULL elements, percentage of unique elements, summary statistics and if your data is within the expected range. It helps you ensure that your data is trusted based on your data expectation.
Freshness: Data freshness refers to how “fresh” or “up-to-date” the data is within the final asset (table, BI report), i.e., when the data was last updated, and how frequently it is updated. Stale data results in wasted time and money.
Volume: Data volume refers to checking the amount of data ingested and looking for unexpected spikes or drops. Sudden drops in data volume can indicate issues like lost data or system outages, and sudden increases may indicate unexpected surges in usage.
Lineage: According to Barr, “When data breaks, the first question is always “where?” Data lineage helps you trace the data journey from its source to its destination, visualizing how data was transformed and where it was stored. This way, you can identify the source of errors or anomalies.
Schema: Data schema refers to monitoring changes in data structure or types. This pillar helps avoid the failure of the data pipeline.

Metrics can vary on the project

When it comes to data accuracy and completeness, we can assume that these are important for every project.

Great Expectations workflow

When working with Great Expectations, a typical workflow involves the following core components:

1. Data Context

Purpose: The entry point for the Great Expectations API.
Function: Allows you to create objects to connect to data sources, create expectations, and validate data.
Role: Manages objects and metadata for your project.

2. Data Sources

Purpose: Specify the location of the data you wish to test.
Sources: Can be a SQL database, local file system, S3 bucket, or a pandas DataFrame.

3. Data Assets & Batch Requests

Data Asset: A collection of records in a data source (e.g., a table, file, or query asset).
Batches: Data can be partitioned into smaller batches (e.g., by month or store IDs).
Batch Request: Primary way to retrieve data for validation from a data asset.

4. Expectations

Purpose: Statements used to verify if data meets certain conditions (e.g., column values are not null, or values are unique).
Expectation Suite: A collection of multiple expectations applied to a data asset.

5. Validator

Purpose: Validates data against a batch request and its corresponding expectation suite.
Manual/Automated Validation: Can be performed directly with the validator or streamlined using a checkpoint.

6. Checkpoints

Purpose: Automates the validation process by providing batch requests and expectation suites to a validator.
Role: Simplifies generating validation results.

7. Stores

Expectation Store: Stores expectation suites.
Validation Store: Stores objects generated during data validation.
Checkpoint Store: Stores checkpoint configurations.
Data Docs Store: Stores reports on expectations, checkpoints, and validation results.

Typical Workflow Steps:

Instantiate a Data Context: Set up the context for your Great Expectations project.
Declare a Data Source: Specify from where to get the data for validation.
Create a Batch Request: Retrieve data batches from the data asset.
Define Expectations: Specify conditions your data must meet.
Create an Expectation Suite: Group multiple expectations for a data asset.
Validate Data: Use a validator or checkpoint to run the validation.
Store and Access Metadata: Access the results and configurations through various stores.

GX (Great Expectations) seems like a good validating tool. I got the experience using it in one of the labs. (seems to have some similarities with Evidently, but this is mainly focused on data)

AWS CloudWatch for monitoring

Week 4: Orchestration

Before orchestrators, there was Cron

Cron is good:

to schedule simple and repetitive tasks (ex. regular data downloads)
when we are in the prototyping phase (ex. testing aspects of our data pipes)

Airflow

the final part of week 4 and this course 2 is about airflow, basics
the lab is about using airflow as well

1. Web Server

Purpose: Hosts the Airflow User Interface (UI).
Function: Provides a graphical interface where users can monitor, trigger, and troubleshoot DAGs and their tasks.
Role: Displays the status of DAGs and tasks by extracting information from the metadata database.

2. DAG Directory

Purpose: Stores the Python scripts that define your DAGs.
Function: The folder where DAG files are placed, and automatically connected to the web server.
Role: Any DAG added to the directory is visualized in the Airflow UI.

3. Scheduler

Purpose: Constantly monitors DAGs and their tasks.
Function: Triggers tasks based on a schedule or when dependencies are met. Pushes tasks to a queue.
Role: Determines when a task should run and transitions tasks from scheduled to queued.

4. Workers

Purpose: Executes tasks from the queue.
Function: Executes tasks assigned by the scheduler and reports the task’s success or failure.
Role: Handles the actual execution of tasks in a distributed fashion.

5. Metadata Database

Purpose: Stores metadata about DAGs and tasks.
Function: Keeps records of the status and state of DAGs and tasks.
Role: The web server retrieves information from this database to display in the UI.

Workflow Example:

DAGs are stored in the DAG Directory.
The Scheduler monitors DAGs and their tasks, and based on schedules or task dependencies, triggers tasks.
The triggered tasks are placed in a queue and executed by Workers.
Workers update the status of tasks (e.g., running, success, failed) in the Metadata Database.
The Web Server retrieves these task statuses from the metadata database and displays them in the UI for monitoring and management.

Airflow best practices docs

Orchestration options on AWS

Course 2 - completed

Ofc cannot get the certificate now as I am in a free trial period.

Tomorrow I continue with course 3

That is all for today!

See you tomorrow :)