(Day 263) Day 4 of the DeepLearning.AI Data Engineering Professional Certificate course

Ivan Ivanov · September 20, 2024

data-eng theory applying-knowledge

Hello :) Today is Day 263!

A quick summary of today:

started course 3 - data storage and queries

Structure provided by the course

This course consists of 3 weeks of content and covers these main learning objectives:

Explain how data is physically stored on disk and in memory
Compare how data is stored and queried in object, block, and file storage systems
Explain how data is stored in row-oriented vs column-oriented databases
Explain how graph and vector databases store and retrieve data
Explain the key architectural features of data warehouses, data lakes, and data lakehouses
Implement a data lake using AWS Glue
Implement a data lakehouse with a medallion-like architecture using Lake Formation and Iceberg
Explain the stages of the life of a query
Implement advanced SQL queries
Explain the role of of an index and its impact on query performance
Summarize approaches for processing aggregate and join queries
Compare the execution times of aggregate queries between row and columnar storage
List some strategies for enhancing query performance
Aggregate and join streaming data

Week 1

Raw storage

Data storage systems rely on various “raw ingredients,” including magnetic disk drives (HDDs), solid-state drives (SSDs), RAM, and CPU cache, each with unique performance, durability, and cost factors.

1. Magnetic Disk Drives (HDDs)

Mechanism: Use rotating platters with magnetic film to store data. Similar to record players, a read/write head locates the right track and encodes/reads binary data.
Performance: Mechanical operations limit speed, with latency around 4ms and a few hundred IOPS. Data transfer rate is around 300MB/s.
Use Case: Suitable for cost-efficient storage when data access is infrequent and speed is less critical.
Cost: HDDs are 2-3 times cheaper than SSDs.

2. Solid-State Drives (SSDs)

Mechanism: Stores data as electrical charges in flash memory cells, allowing faster, purely electronic operations.
Performance: SSDs handle tens of thousands of IOPS with a data fetch latency of ~0.1ms. They can transfer data up to 10 times faster than HDDs.
Use Case: Ideal for systems requiring fast random access and OLTP systems but costlier for analytical workloads.
Cost: More expensive than HDDs, especially for large data volumes.

3. Random Access Memory (RAM)

Mechanism: Volatile memory used to transfer data from persistent storage for processing.
Performance: Very fast, with transfer speeds of ~100GB/s and latency of 0.1 microseconds, performing millions of IOPS.
Use Case: Used for caching, data processing, and indexing, but expensive and unsuitable for long-term storage.
Cost: 30-50 times more expensive than SSDs. Volatile, meaning data loss if power is lost.

4. CPU Cache

Mechanism: Located directly on the CPU, offering ultrafast data retrieval.
Performance: Latency of 1 nanosecond and transfer speeds of 1TB/s.
Use Case: Best for storing frequently accessed data during processing.

Cloud storage options

As a data engineer, you’re faced with various cloud storage options. These systems operate at a higher abstraction level and include block storage, file storage, and object storage. Each has unique performance and scalability trade-offs.

File Storage

File storage systems organize files into directory trees, much like your computer’s folder structure. Metadata such as names, owners, modification dates, and permissions help manage these files. Managed services like Amazon Elastic File System (EFS) provide centralized access to shared files over a network, abstracting much of the complexity. However, file storage is generally slower due to the need to manage file hierarchies.

Block Storage

Block storage splits data into small, fixed-size blocks stored on disks like SSDs or magnetic drives. It offers high performance and low latency, often used in transactional databases and virtual machines (e.g., EC2 with Amazon Elastic Block Store, EBS). Block storage provides flexible, distributed architecture but is limited in scalability compared to object storage.

Object Storage

Object storage scales beyond block storage, often into petabytes. It decouples storage from compute, allowing you to handle massive data sets cost-effectively. Data is stored as immutable objects in a flat structure (e.g., AWS S3), and changes require rewriting the entire object. While object storage excels in scalability and parallel processing, it’s not suited for transactional workloads.

Summary

File Storage: Simple data sharing, low performance needs.
Block Storage: High performance, low-latency workloads.
Object Storage: Massive datasets, read-heavy analytical queries.

Storage tiers

How DBs store data

A Database Management System (DBMS) facilitates interaction with databases and consists of a transport system, query processor, execution engine, and storage engine. The storage engine is responsible for physically storing data on disk, handling tasks like serialization, arrangement, and indexing. Modern storage engines are optimized for SSDs and can manage complex data structures.

For large-scale data retrieval, using indexes can significantly improve query performance. For example, querying a relational database for rows based on country can be sped up by creating an index, allowing for binary search instead of scanning the entire dataset. This reduces time complexity from O(N) to O(log N).

In-memory databases, such as Memcached and Redis, use RAM for ultra-fast data retrieval. However, they are volatile and should not be used for persistent storage. They are ideal for caching and applications like real-time bidding or gaming leaderboards. Redis, in particular, supports more complex data types and has persistence mechanisms.

Row vs Column storage

row-oriented storage is suitable for transactional workloads that require read and write to be performed with low latency
column storage for analytical processing (suitable for analytical workloads where you want to apply aggregating operations on columns)

Parquet and ORC (optimized row columnar) are file formats that combine both approaches by following a hybrid approach that tries to get the best from both worlds. The hybrid approach relies on partitioning rows into groups where each row group is stored in a column-wise format.

Lab

The lab for this week was about neo4j and writing Cypher queries.

That is all for today!

See you tomorrow :)