(Day 327) More of the Designing Data-Intensive Applications book

Ivan Ivanov · November 23, 2024

data-eng theory applying-knowledge

Hello :) Today is Day 327!

A quick summary of today:

Designing Data-Intensive Applications: Chapter 2
some more sql practice

Designing Data-Intensive Applications: Chapter 2. Data Models and Query Languages

I read and listened at the same time the book + audio version (this might be a viable learning strategy 😆), so here is a summary of ch. 2 ~

Relational Model vs. Document Model: A Comparative Overview

Relational model

origin: introduced by edgar codd in 1970, gaining widespread adoption through rdbms implementations by the 1980s
structure: organizes data in tables (rows and columns) with relationships defined via foreign keys
strengths:
- excellent support for joins and complex queries
- enforces strict schemas, ensuring data consistency
- broad adoption and a mature ecosystem
use cases: ideal for transaction processing, batch processing, and general-purpose web applications
challenges:
- impedance mismatch with object-oriented programming
- scalability concerns for massive datasets or high write loads

Document model

origin: emerged during the nosql movement in the 2010s to address scalability and flexibility challenges
structure: stores data in json, xml, or binary formats, allowing nested and flexible schemas
strengths:
- schema flexibility supports dynamic, evolving data structures
- improved performance for read-heavy workloads due to data locality
- simplifies application code for hierarchical or document-like data
use cases: frequently used in web apps, content management systems, and applications with deeply nested data
challenges:
- limited support for joins and many-to-many relationships
- lack of schema enforcement can result in inconsistencies
- performance degrades with large, frequently updated documents

Key Concepts

object-relational mismatch: discrepancy between object-oriented programming and relational models, often addressed with ORM frameworks
normalization: ensures consistency and reduces redundancy but can complicate queries
joins: relational databases excel here, while document databases often struggle
schema flexibility: document databases allow schema-less or schema-on-read approaches, accommodating rapidly changing data needs

Convergence of models

relational databases adopting document features: many RDBMS now support JSON/XML, enabling hybrid approaches
document databases adding relational capabilities: some incorporate joins and relational features
future outlook: the convergence trend suggests databases will increasingly handle both structured and unstructured data seamlessly

Choosing the right model should align with the application’s needs, balancing scalability, flexibility, and query complexity.

Leveraging multiple database types to utilize their strengths in different parts of an application ensures optimal performance and flexibility.

Query languages for data

When relational databases emerged, SQL introduced a declarative approach to querying data, in contrast to imperative methods used by systems like IMS and CODASYL.

Key differences between Declarative and Imperative queries

Imperative programming involves specifying the exact steps to achieve a result. For example, filtering a list in JavaScript would require loops and conditionals
Declarative programming, as seen in SQL, focuses on describing the desired outcome without detailing how to achieve it. The database handles the execution details

Declarative queries offer significant benefits:

conciseness and ease of understanding
independence from implementation details, enabling performance optimizations
support for parallel execution, leveraging modern multi-core processors effectively

Declarative Queries Beyond Databases

Declarative principles extend beyond databases to web technologies like CSS. Styling a web page declaratively with CSS is simpler and more efficient compared to manipulating the DOM imperatively with JavaScript. For example:

a CSS rule applies and updates styles dynamically as elements change
imperative JavaScript styling can result in errors, such as failing to update styles when conditions change

MapReduce and mixed paradigms

MapReduce represents a blend of declarative and imperative approaches. It is widely used for distributed data processing, including in NoSQL databases like MongoDB.

Declarative elements: filters (e.g. family: "Sharks") specify which data to process
Imperative elements: custom JavaScript functions define the map and reduce operations

Example MapReduce Workflow:

Mapping: emits key-value pairs for relevant data points
Reducing: aggregates values for each key, such as summing the counts of observed shark sightings by month

Challenges with MapReduce include complexity in coordinating map and reduce functions and limited query optimization opportunities. To address this, MongoDB introduced the Aggregation Pipeline, a declarative alternative that simplifies query construction while maintaining flexibility.

The progression from MapReduce to declarative aggregation pipelines demonstrates how declarative languages often evolve to balance expressiveness and usability.

Graph-like data models

Graph databases excel at handling complex many-to-many relationships, making them ideal for scenarios like social networks, web links, or road networks where relational databases struggle.

Key Features

structure: data is represented as vertices (nodes) connected by edges (relationships), mimicking real-world connections
schema flexibility: graph databases allow for dynamic connections without rigid schema constraints, unlike relational databases
efficient traversal: they support seamless navigation through data, making them efficient for analyzing interconnected relationships

Graph Database Models

Property graph model:
- stores properties as key-value pairs on vertices and edges
- examples: Neo4j, Titan, InfiniteGraph
Triple-store model:
- stores data as subject-predicate-object triples, akin to RDF
- examples: Datomic, AllegroGraph

Query Languages

Cypher: intuitive, arrow-based syntax for property graphs (Neo4j)
SPARQL: concise querying for triple-stores, heavily used in RDF
Datalog: rule-based, logic-oriented querying for complex derivations

Comparison with CODASYL

Graph databases improve upon the network model (CODASYL) by:

offering schema flexibility for connections
enabling direct vertex access via unique IDs
supporting unordered relationships, simplifying data handling
utilizing declarative query languages for easier queries

RDF and the Semantic Web

RDF triples underpin the semantic web, aiming for a machine-readable internet
despite potential for data integration across sites, its adoption remains limited

Overall, graph databases shine when dealing with highly interconnected data that demands efficient traversal and dynamic schema adaptability, making them invaluable for modern, relationship-centric applications.

Zach Wilson Q&A stream

I saw that Zach did a live Q&A stream. I did not join as it was 5am or something like that for me haha but thankfully the vod is on youtube. My main takeaway is related to showcasing DE projects and the questions we need to answer when building one:

whats the impact of the project? how does it change lives?
what is the robustness of the project? how are you going to improve it?
what is the quality of the project? are we doing quality checks?tests? is it written with code that’s readable? is it documented?
proof - is it visible?

Completed the medium practice SQL questions as well

Here is the link to the practice Qs. Today I finished all medium ones as well and only the hard ones are left. The main things of the medium Qs were window functions, group bys, CTEs and joins.

That is all for today!

See you tomorrow :)