Hello :) Today is Day 225!
A quick summary of today:
- read a bit more of the Fundamentals of Data Eng book
- started the finance voice assistant app
Fundamentals of Data Engineering Chapter 5: Data Generation in Source Systems
Source Systems: Main Ideas
Files and Unstructured Data
- Files: Sequences of bytes stored on disk, used to store parameters, events, logs, images, and audio.
- Data Exchange: Files are a universal medium; common formats include Excel, CSV, TXT, JSON, and XML.
- Structure: Files can be structured (Excel, CSV), semistructured (JSON, XML), or unstructured (TXT, CSV).
APIs
- APIs: Standard method for data exchange between systems. Simplifies but also adds complexity to data ingestion.
- Maintenance: Data engineers often need to manage custom API connections.
Application Databases (OLTP Systems)
- OLTP Systems: High-rate read/write databases, supporting low latency and high concurrency, used in applications like banking.
- ACID Properties: Ensure atomicity, consistency, isolation, and durability, simplifying application development but not always necessary for performance.
- Atomic Transactions: A set of operations committed as a unit, crucial in applications like banking.
OLTP and Analytics
- Challenges: Running analytics on OLTP systems can degrade performance; hybrid systems combining transactions and analytics are emerging.
Online Analytical Processing Systems (OLAP)
- OLAP Systems: Designed for large-scale analytics queries, not efficient for individual lookups.
- Usage: Often source systems for analytics or integrated in workflows like reverse ETL.
Change Data Capture (CDC)
- CDC: Method for extracting database change events (insert, update, delete) for real-time replication or event streaming.
Logs
- Logs: Capture event information in systems, valuable for data analysis, ML, and automation.
- Sources: Include operating systems, applications, servers, containers, networks, and IoT devices. Logs track the “who,” “what,” and “when” of events.
CRUD
CRUD stands for Create, Read, Update, and Delete, and represents the four essential operations used in managing persistent storage within programming. This pattern is foundational for storing application state in a database and is ubiquitous in APIs and databases. The CRUD operations ensure that data can be created, read, updated, and eventually deleted as needed, providing a basic framework for data interaction.
A common use of CRUD can be seen in web applications, where CRUD operations underpin RESTful HTTP requests, facilitating the storage and retrieval of data from databases. For example, when an application requires the creation of a new record, a corresponding Create operation is performed, followed by Read operations to retrieve the data, Update operations to modify it, and finally, Delete operations to remove the data if necessary.
Insert-Only
The Insert-Only pattern is a database operation strategy where instead of updating records, new records are inserted with a timestamp indicating when they were created. This pattern is beneficial for maintaining historical data directly in a table. For example, if a customer changes their address, rather than updating the existing record, a new record is inserted with the new address and a timestamp. To find the current address, the latest record for that customer ID would be retrieved.
This pattern effectively maintains a log within the table, making it ideal for applications that require access to historical data, such as banking systems that need to track changes over time. However, Insert-Only can lead to large table sizes, especially when data changes frequently. To mitigate this, records may be purged based on specific criteria, such as a sunset date, or limited to a maximum number of versions.
Messages and Streams
In event-driven architecture, two key concepts are Message Queues and Streaming Platforms. Although these terms are sometimes used interchangeably, they represent different mechanisms for handling data. A message is a unit of data sent from one system to another, often through a message queue. Once delivered, the message is typically removed from the queue. This approach is common in systems that require discrete, one-time communication between components.
For example, in an IoT system, a device might send a temperature reading to a message queue, which is then processed to determine whether to adjust a thermostat. Once the action is taken, the message is removed.
In contrast, a stream is an append-only log of event records. Events in a stream are accumulated in a sequence, often ordered by time or ID. Streams are used when tracking a series of events is important, and data is retained over an extended period. Streaming platforms enable complex operations on this data, such as aggregating multiple events or replaying data from a specific point in time.
Types of Time
When dealing with streaming data, understanding the different types of time is crucial. These include:
- Event Time: The time when the event is generated within the source system.
- Ingestion Time: The time when the event is ingested into a message queue, database, or another storage system.
- Process Time: The time when the event is processed, usually after ingestion.
- Processing Time: The duration taken to process the event.
Tracking these times is essential for accurate data pipeline monitoring and performance analysis. Each stage of data movement should be logged with timestamps to ensure transparency and facilitate debugging or optimization efforts.
Source System Practical Details
Interacting with modern source systems requires a thorough understanding of various technologies like databases and APIs. This section provides critical knowledge for data engineers, focusing on the types of databases you might encounter and the major considerations for working with them.
Databases
Databases are categorized into Relational and Nonrelational types, each with its own use cases and architectural considerations.
-
Relational Databases (RDBMS): These are widely used for storing structured data with relationships between different data entities. They support ACID properties (Atomicity, Consistency, Isolation, Durability) and are ideal for applications requiring transaction consistency and data integrity. Relational databases are typically normalized to reduce data redundancy and support complex queries through SQL.
-
Nonrelational Databases (NoSQL): These databases are designed to handle a wide variety of data models, including key-value, document, wide-column, graph, and others. NoSQL databases often trade off some of the ACID properties to achieve greater scalability and performance. They are ideal for applications with large-scale, unstructured, or semi-structured data, where flexibility and speed are prioritized.
Key-Value Stores
Key-value databases are a type of NoSQL database optimized for quick retrieval of values using a unique key. These databases are akin to hash maps or dictionaries in programming languages and are used in scenarios requiring fast access to large volumes of data, such as caching sessions for web applications.
Document Stores
Document stores are another type of key-value database where the value is a document, often in JSON format. These databases are highly flexible, allowing for the storage of complex nested data structures. They do not enforce a strict schema, which makes them adaptable but also potentially more challenging to manage.
Wide-Column Databases
Wide-column databases are designed for high transaction rates and massive scalability. They store data in columns rather than rows, making them suitable for use cases like real-time analytics and handling large datasets with low latency.
Graph Databases
Graph databases store data in a graph structure, with nodes representing entities and edges representing relationships. These databases are particularly effective for applications that need to analyze connections between data points, such as social networks or recommendation engines.
Third-Party Data Sources
With the widespread adoption of technology, companies across industries, including government agencies, are increasingly offering their data to customers and users. This data can be a part of their service offerings or available as a separate subscription. For example, the US Bureau of Labor Statistics provides labor market data, NASA shares research data, and Facebook offers data to businesses advertising on its platform.
The strategic reason behind making data available is that data is inherently “sticky,” meaning it encourages greater user engagement and integration. This creates a flywheel effect—users who integrate third-party data into their systems tend to generate more data, which in turn can be further integrated and utilized, leading to increased user adoption and engagement. This cycle leads to a near-infinite supply of third-party data sources.
Access to third-party data is typically facilitated through APIs, cloud platforms, or direct data downloads. APIs are particularly popular for enabling deep integrations, allowing customers to push and pull data seamlessly. For instance, Customer Relationship Management (CRM) systems often offer APIs that allow users to integrate CRM data into their own applications. A common workflow involves retrieving data from a CRM, processing it through a customer scoring model, and then using reverse ETL to feed the refined data back into the CRM, thereby helping sales teams target more qualified leads.
Message Queues and Event-Streaming Platforms
Event-driven architectures are becoming more prevalent in software development, driven by the ease of managing message queues and event-streaming platforms in cloud environments. These architectures are particularly well-suited for data applications that require real-time analytics because events can trigger application workflows and support near real-time data processing.
Streaming data, like messages and streams, often crosses multiple stages of the data engineering lifecycle. Unlike relational databases directly tied to applications, streaming data can blur the lines between different lifecycle stages due to its transient nature. For example, an event-streaming platform can serve as both a source system for message passing in event-driven applications and as a tool for real-time data processing during ingestion and transformation.
Message Queues
Message queues are systems that enable asynchronous data transmission between discrete systems using a publish-subscribe model. In this setup, data is published to a message queue and delivered to one or more subscribers, who acknowledge receipt, thereby removing the message from the queue.
Message queues are vital for decoupling applications in microservices architectures. They buffer messages to handle transient load spikes and ensure message durability through distributed architectures with replication. Key considerations when working with message queues include:
- Message Ordering and Delivery: Message queues typically use a First-In, First-Out (FIFO) model, but in distributed systems, strict FIFO ordering can be challenging to maintain. For instance, Amazon SQS offers both standard and FIFO queues, with the latter providing stronger ordering guarantees at the cost of additional overhead.
- Delivery Frequency: Messages can be delivered “exactly once” or “at least once.” Exactly once delivery ensures the message is not duplicated, while at least once delivery allows for potential redundancy, which can be useful in certain scenarios. Idempotent systems, which produce the same result regardless of how many times a message is processed, are ideal for handling delivery complexities.
- Scalability: Modern message queues are horizontally scalable, operating across multiple servers. This scalability allows them to dynamically adjust to load variations, but it also introduces challenges such as message duplication and fuzzy ordering.
Event-Streaming Platforms
Event-streaming platforms extend the functionality of message queues by providing a mechanism to ingest and process data in an ordered log of records. Unlike message queues, which primarily route messages, event-streaming platforms retain data for a set period, enabling the replay of past events.
In an event-streaming platform, events are typically composed of a key, value, and timestamp. For instance, an e-commerce order event might include a unique order ID (key), product details (value), and the time of purchase (timestamp).
Key features of event-streaming platforms include:
- Topics: Producers stream events to topics, which are collections of related events. For example, a topic might contain customer orders, and different consumers, such as fulfillment and marketing departments, can subscribe to this topic to trigger actions or analyze data.
- Stream Partitions: Streams can be divided into multiple partitions, allowing for parallel processing and higher throughput. The partition key determines how messages are distributed across partitions. For example, using a device ID as a partition key ensures that all messages from a particular device are processed together. However, it’s crucial to avoid “hotspotting,” where some partitions receive a disproportionate number of messages.
- Fault Tolerance and Resilience: Event-streaming platforms are distributed, meaning that if one node fails, another can take over, ensuring data continuity. This resilience makes them suitable for applications requiring reliable data production, storage, and ingestion.
Fundamentals of Data Engineering Chapter 6: Storage
Storage Abstractions
-
Data Lake: A centralized repository that allows you to store all your structured and unstructured data at any scale. It can handle raw data in its native format until it’s needed.
-
Data Lakehouse: A hybrid architectural model that combines the capabilities of data lakes and data warehouses. It supports both large-scale data storage and advanced analytics.
-
Data Platform: An integrated suite of tools and technologies that manage the data lifecycle, from ingestion to processing and analysis. It includes the infrastructure to handle data storage, governance, and access.
-
Cloud Data Warehouse: A type of data warehouse that is hosted in the cloud. It allows for scalable, on-demand data storage and computing power to analyze large volumes of data.
Storage Systems
-
HDFS (Hadoop Distributed File System): A distributed file system that stores data across multiple machines in a large cluster. It is designed to handle large volumes of data with fault tolerance and scalability.
-
RDBMS (Relational Database Management System): A database system that organizes data into tables and uses SQL (Structured Query Language) for database management and querying.
-
Cache/Memory-based Storage: Storage that resides in memory (RAM) to provide fast access to frequently accessed data, reducing latency compared to disk-based storage.
-
Object Storage: A storage architecture that manages data as objects rather than as a file hierarchy. It is commonly used for storing unstructured data like images, videos, and backups.
-
Streaming Storage: A storage system designed to handle real-time data streams, enabling immediate processing and analysis of data as it is generated.
Raw Ingredients
-
HDD (Hard Disk Drive): A traditional storage device that uses spinning disks to read/write data. It is known for its large storage capacity and relatively lower cost but slower speed compared to SSDs.
-
SSD (Solid State Drive): A storage device that uses flash memory to store data, providing faster read/write speeds than HDDs. SSDs have no moving parts, making them more reliable and quicker.
-
RAM (Random Access Memory): A form of volatile memory used to store working data and machine code currently in use. It offers high-speed data access but is temporary, losing its contents when power is off.
-
Networking: The interconnection of various hardware devices and systems to allow data transfer and communication. It is essential for enabling distributed computing and storage systems to function cohesively.
-
Serialization: The process of converting data into a format that can be easily stored or transmitted and then reconstructed. It is crucial for data storage, transfer, and communication across systems.
-
Compression: The technique used to reduce the size of data for storage or transmission. It can be lossless or lossy, depending on whether data integrity is preserved.
-
CPU (Central Processing Unit): The primary component of a computer that performs most of the processing inside a system. It executes instructions and processes data, playing a critical role in data handling and storage operations.
Data Storage Lifecycle and Data Retention
Fundamentals of Data Engineering Chapter 7: Ingestion
What Is Data Ingestion?
The process of moving data from one place to another. Data ingestion implies data movement from source systems into storage in the data engineering lifecycle, with ingestion as an intermediate step
Key Engineering Considerations for the Ingestion Phase
- What’s the use case for the data I’m ingesting?
- Can I reuse this data and avoid ingesting multiple versions of the same dataset?
- Where is the data going? What’s the destination?
- How often should the data be updated from the source?
- What is the expected data volume?
- What format is the data in? Can downstream storage and transformation accept this format?
- Is the source data in good shape for immediate downstream use? That is, is the data of good quality? What post-processing is required to serve it?
- What are data-quality risks (e.g., could bot traffic to a website contaminate the data)?
- Does the data require in-flight processing for downstream ingestion if the data is from a streaming source?
Regardless of how often the data is ingested, you’ll want to consider these factors when designing your ingestion architecture:
- bounded versus unbounded
- frequency
- sync vs async
- serialisation and deserialisation
- throughput and scalability
- reliability and durability
- payload
- push vs pull vs poll patterns
Bounded vs Unbounded data
Unbounded data is data as it exists in reality, as events happen, either sporadically or continuously, ongoing and flowing. Bounded data is a conve‐ nient way of bucketing data across some sort of boundary, such as time.
Frequency
One of the critical decisions that data engineers must make in designing dataingestion processes is the data-ingestion frequency. Ingestion processes can be batch, micro-batch, or real-time.
Synchronous Versus Asynchronous Ingestion
With asynchronous ingestion, dependencies can now operate at the level of individ‐ ual events, much as they would in a software backend built from microservices. Individual events become available in storage as soon as they are inges‐ ted individually
Serialization and Deserialization
Moving data from source to destination involves serialization and deserialization. As a reminder, serialization means encoding the data from a source and preparing data structures for transmission and intermediate storage stages. When ingesting data, ensure that your destination can deserialize the data it receives. We’ve seen data ingested from a source but then sitting inert and unusable in the destination because the data cannot be properly deserialized.
Throughput and Scalability
In theory, your ingestion should never be a bottleneck. In practice, ingestion bottle‐ necks are pretty standard. Data throughput and system scalability become critical as your data volumes grow and requirements change.
Another thing to consider is your ability to handle bursty data ingestion. Data gener‐ ation rarely happens at a constant rate and often ebbs and flows. Built-in buffering is required to collect events during rate spikes to prevent data from getting lost.
Reliability and Durability
Reliability and durability are vital in the ingestion stages of data pipelines. Reliability entails high uptime and proper failover for ingestion systems. Durability entails making sure that data isn’t lost or corrupted.
Payload
A payload is the dataset you’re ingesting and has characteristics such as kind, shape, size, schema and data types, and metadata
Push vs. Pull vs. Poll Patterns
A push strategy involves a source system sending data to a target
A pull strategy (Figure 7-8) entails a target reading data directly from a source
Another pattern related to pulling is polling for data. Polling involves periodically checking a data source for any changes. When changes are detected, the destination pulls the data as it would in a regular pull situation.
Batch Ingestion Considerations
- snapshot or differential extraction
- file-based export and ingestion
- ETL vs ELT
- inserts, updates, and batch size
- data migration
Message and Stream Ingestion Considerations
- schema evaluation
- late-arriving data
- ordering and multiple delivery
- replay
- time to live
- message size
- error handling and dead-letter queues
- consumer pull and push
Finance Voice Assistant
After reading today, I thought about going back to complete one of my project ideas - creating a finance voice assistant.
I can share that a working version is available on my github. I ended up using llama index for the pipeline, inspired by this amazin tutorial that showed up how to use llama index to query a db using natural language.
It is a bit late now, but I can share some basic pictures:
RAG DAG:
Streamlit UI:
A very cool thing is this tracing app called Pheonix, where we see the RAP pipeline’s executaion and what the LLM sees throughout
I will document the whole thing tomorrow and maybe make some more improvements.
That is all for today!
See you tomorrow :)