(Day 260) DeepLearning.AI Data Engineering Professional Certificate got realeased

Ivan Ivanov · September 17, 2024

Hello :) Today is Day 260!

A quick summary of today:

  • the DE course by the writer of the DE holy book - Fundamentals of DE is live

I thought the course release date is the 19th, but I got an email saying I got access from today. I pre-registered so I got 14-day free trial and that means I have 14 days to complete the course

Yesterday I was planning to continue with the neo4j courses, but given that this DE course is for free for a limited time, this takes priority now.

Here is an overview of the course

image

This program was designed by Joe Reis in partnership with DeepLearning.AI and AWS to cover the fundamentals of data engineering, both in terms of the underlying theory and frameworks for thinking like a data engineering, as well as practical skills for building data engineering solutions on the cloud.

Today I started the 1st course: Introduction to Data Engineering


Week 1

image

As a DE we will focus on the big square in the middle, but we will also look into the other 2 sides.

A Brief History of Data Engineering

Data is foundational to information, existing in various forms such as numbers, words, or even phenomena like photons or wind. While data has always existed, digital data—recorded and processed through computers—gained prominence in the 1960s with the advent of computerized databases, followed by relational databases in the 1970s and the development of SQL. The 1980s introduced data warehouses for analytical decision-making, while the 1990s saw the rise of data systems for business intelligence. The dotcom boom in the mid-1990s led to rapid growth in web applications, supported by backend systems like servers and databases.

The 2000s saw the emergence of big data, driven by companies like Yahoo, Google, and Amazon. Big data refers to datasets characterized by velocity, variety, and volume, and its era began with innovations like Google’s MapReduce (2004) and Apache Hadoop (2006). Simultaneously, Amazon introduced cloud services like EC2 and S3, revolutionizing data storage and processing. AWS became a major player, followed by Google Cloud and Microsoft Azure.

By the late 2000s, the transition to real-time data streaming and cloud-first tools simplified big data processing. Today, the term “big data” has lost relevance, as scalable data tools are accessible to all companies, and data engineers focus on building scalable, business-driven data systems. The role of the modern data engineer is to leverage innovations and tools developed over the past decades to create data solutions that drive business value.

image

The Data Engineer Among Other Stakeholders

image

System Requirements

image

image

The first step in any data engineering project is gathering system requirements, typically from downstream stakeholders who have business goals in mind. These goals often aren’t expressed as clear system requirements, so it’s the data engineer’s job to translate them. The process starts with conversations tailored to each stakeholder’s technical background and role. In upcoming videos, experts Sol Rashidi and Jordan Morrow will provide advice on communicating with different stakeholders and effective requirements gathering. These videos are optional, and you can skip ahead to see a practical example involving an e-commerce company.

Translate Stakeholder Needs into Specific Requirements

image

Example conversation with a DS

image

image

image

image

Key tactic: ask stakeholders what action they plan to take with the data (its not the same as asking what they need)

image

Thinking like a DE

image

image

DE on the Cloud (aws)

The 2nd part of week 1 was an intro to the cloud and AWS’ services.

5 core services: compute, network, storage, databases and security (there are more but these are good to start with)

image

image

image

Databases:

image

image

Week 2

This part was very similar to his (Joe’s) book (unsurprisingly). And it’s nice seeing a video of the material I read in the book.

image

Data generation in Source systems

image

The first stage of the data engineering lifecycle involves generating data from various source systems. As a data engineer, you’ll work with data from diverse sources like internal databases, APIs, and IoT devices. These systems, often maintained by other teams or external vendors, are out of your control, but understanding how they work is vital for building reliable pipelines. Common sources include relational databases, NoSQL systems, files, APIs, and data-sharing platforms.

In reality, source systems are unpredictable—schemas may change, or systems may go down, causing downstream issues. Establishing good relationships with source system owners is crucial for navigating changes and ensuring smooth data workflows. Next, the process moves to data ingestion, which varies depending on the project.

Ingestion

Ingestion is moving raw data from source systems into your data pipeline for further processing

image

Storage

image

As a data engineer, it’s entirely possible that you will spend much of your time operating at or near the top of this hierarchy, meaning that you won’t be required to think about the details of exactly how your data is moving between different storage components and systems. However, you will be most effective in your work. If you take the time to understand the inner workings capabilities and limitations of your entire storage solutions right down to the wrong ingredients. The truth is that many practicing data engineers today do not deeply understand the details of the storage systems they build. This leads to unfortunate consequences when it comes to things like performance and cost.

Transformation

The transformation stage in the data engineering life cycle is where value is added by turning raw data into something useful for downstream users like business analysts and data scientists. Transformation involves three parts: queries, modeling, and transformation itself. Queries retrieve data from storage, often using SQL, and must be carefully written to avoid issues like performance problems or row explosion. Data modeling structures the data to reflect real-world relationships and make it easier for stakeholders to use, often requiring denormalization. Finally, transformation involves manipulating and enriching data at various stages, ensuring it’s in the correct format for downstream use, whether for reporting, analytics, or machine learning.

Serving data

  • analytics (BI, operational analytics, embedded analytics)
  • ML
  • reverse ETL

Undercurrents

  • security - use the principle of least privilage; adopt a defensive mindset
  • data management - the development, execution, and supervision of plans, programs, and practices that deliver, control, protect, and enhance the value of data and information assets throughout their life cycles. Data governance is first and foremost a data management function to ensure the quality, integrity, security, and usability of the data collected from an organization

image

  • data architecture - a roadmap or blueprint for our data systems; the design of systems to support the evolving data needs of an enterprise, achieved by flexible and reversible decisions reached through a careful evaluation of trade offs

Principles of Good Data Architecture

  1. Choose common components wisely
  2. Plan for failure!
  3. Architect for scalability
  4. Architecture is leadership
  5. Always be architecting
  6. Build loosely coupled systems
  7. Make reversible decisions
  8. Prioritize security
  9. Embrace FinOps
  • dataOps - aims to improve the development process and quality of data products. DataOps is first and foremost a set of cultural habits and practices that you can adopt. These include things like prioritizing communication and collaboration with other business stakeholders, continuously learning from your successes and failures, and taking an approach of rapid iteration to work toward improvements to your systems and processes. These are also the cultural habits and practices of DevOps, and they’re borrowed directly from the agile methodology

image

  • orchestration

image

image

Practical examples on AWS

Below are the AWS options for each part of the DE lifecycle

image

image

image

image

image

image

Below are the undercurrents on AWS

image

image

image

Orchestrator - airflow (but any of the other popular ones works)

image

image

Next is an assignment in a hosted AWS environment

Below is a summary of the assignment as I do not want to give too much of the actual thing.

image

1. Setup Review

image

  • setup Cloud9 (an AWS IDE)

image

2. Exploring the Source System

  • Access the source system database (MySQL) using the AWS RDS service.
  • Retrieve the database’s endpoint and connection details from the AWS console.
  • Establish a MySQL connection using the endpoint, username, password, and port.
  • Explore the database structure using SQL commands (show tables, etc.).

3. Resource Creation

  • Use Terraform files to create and configure AWS resources (AWS Glue, S3).
  • Initialize the Terraform environment and run commands (terraform init, terraform apply) to deploy resources.
  • Review .tf files for understanding Glue, S3, and networking configurations.

4. Running the Glue Job

  • Start the AWS Glue job using a command provided in the lab.
  • Monitor the job progress in the AWS Glue console (ETL Jobs section).
  • After the Glue job completes, verify the transformed data in the S3 bucket.

5. Querying the Data

  • Use a Jupyter notebook to run analytical queries on the transformed data stored in S3.
  • Utilize AWS Wrangler and Amazon Athena to extract data from S3.
  • Perform queries, such as finding total sales by country and other insights.

Tbh, this is amazing. The way this whole env is setup - really makes me feel like I am using the cloud and doing something real.

The end result is creating an interactive dashboard in jupyter:

image

My score after submitting the lab:

image

Week 3

Data Architecture

image

Conway’s law - any organization that designs a system will produce a design who structure is a copy of the organization’s communication structure

image

To Be Continued …


Tbh, I was expecting high quality from DeepLearning.AI and Joe Reis, and the course so far absolutely delivers on those expectations.

That is all for today!

See you tomorrow :)