(Day 327) More of the Designing Data-Intensive Applications book

Ivan Ivanov · November 23, 2024

Hello :) Today is Day 327!

A quick summary of today:

  • Designing Data-Intensive Applications: Chapter 2
  • some more sql practice

Designing Data-Intensive Applications: Chapter 2. Data Models and Query Languages

I read and listened at the same time the book + audio version (this might be a viable learning strategy 😆), so here is a summary of ch. 2 ~

Relational Model vs. Document Model: A Comparative Overview

Relational model

  • origin: introduced by edgar codd in 1970, gaining widespread adoption through rdbms implementations by the 1980s
  • structure: organizes data in tables (rows and columns) with relationships defined via foreign keys
  • strengths:
    • excellent support for joins and complex queries
    • enforces strict schemas, ensuring data consistency
    • broad adoption and a mature ecosystem
  • use cases: ideal for transaction processing, batch processing, and general-purpose web applications
  • challenges:
    • impedance mismatch with object-oriented programming
    • scalability concerns for massive datasets or high write loads

Document model

  • origin: emerged during the nosql movement in the 2010s to address scalability and flexibility challenges
  • structure: stores data in json, xml, or binary formats, allowing nested and flexible schemas
  • strengths:
    • schema flexibility supports dynamic, evolving data structures
    • improved performance for read-heavy workloads due to data locality
    • simplifies application code for hierarchical or document-like data
  • use cases: frequently used in web apps, content management systems, and applications with deeply nested data
  • challenges:
    • limited support for joins and many-to-many relationships
    • lack of schema enforcement can result in inconsistencies
    • performance degrades with large, frequently updated documents

Key Concepts

  • object-relational mismatch: discrepancy between object-oriented programming and relational models, often addressed with ORM frameworks
  • normalization: ensures consistency and reduces redundancy but can complicate queries
  • joins: relational databases excel here, while document databases often struggle
  • schema flexibility: document databases allow schema-less or schema-on-read approaches, accommodating rapidly changing data needs

Convergence of models

  • relational databases adopting document features: many RDBMS now support JSON/XML, enabling hybrid approaches
  • document databases adding relational capabilities: some incorporate joins and relational features
  • future outlook: the convergence trend suggests databases will increasingly handle both structured and unstructured data seamlessly

Choosing the right model should align with the application’s needs, balancing scalability, flexibility, and query complexity.

Leveraging multiple database types to utilize their strengths in different parts of an application ensures optimal performance and flexibility.

Query languages for data

When relational databases emerged, SQL introduced a declarative approach to querying data, in contrast to imperative methods used by systems like IMS and CODASYL.

Key differences between Declarative and Imperative queries

  • Imperative programming involves specifying the exact steps to achieve a result. For example, filtering a list in JavaScript would require loops and conditionals
  • Declarative programming, as seen in SQL, focuses on describing the desired outcome without detailing how to achieve it. The database handles the execution details

Declarative queries offer significant benefits:

  • conciseness and ease of understanding
  • independence from implementation details, enabling performance optimizations
  • support for parallel execution, leveraging modern multi-core processors effectively

Declarative Queries Beyond Databases

Declarative principles extend beyond databases to web technologies like CSS. Styling a web page declaratively with CSS is simpler and more efficient compared to manipulating the DOM imperatively with JavaScript. For example:

  • a CSS rule applies and updates styles dynamically as elements change
  • imperative JavaScript styling can result in errors, such as failing to update styles when conditions change

MapReduce and mixed paradigms

MapReduce represents a blend of declarative and imperative approaches. It is widely used for distributed data processing, including in NoSQL databases like MongoDB.

  • Declarative elements: filters (e.g. family: "Sharks") specify which data to process
  • Imperative elements: custom JavaScript functions define the map and reduce operations

Example MapReduce Workflow:

  1. Mapping: emits key-value pairs for relevant data points
  2. Reducing: aggregates values for each key, such as summing the counts of observed shark sightings by month

Challenges with MapReduce include complexity in coordinating map and reduce functions and limited query optimization opportunities. To address this, MongoDB introduced the Aggregation Pipeline, a declarative alternative that simplifies query construction while maintaining flexibility.

The progression from MapReduce to declarative aggregation pipelines demonstrates how declarative languages often evolve to balance expressiveness and usability.

Graph-like data models

Graph databases excel at handling complex many-to-many relationships, making them ideal for scenarios like social networks, web links, or road networks where relational databases struggle.

Key Features

  • structure: data is represented as vertices (nodes) connected by edges (relationships), mimicking real-world connections
  • schema flexibility: graph databases allow for dynamic connections without rigid schema constraints, unlike relational databases
  • efficient traversal: they support seamless navigation through data, making them efficient for analyzing interconnected relationships

Graph Database Models

  1. Property graph model:
    • stores properties as key-value pairs on vertices and edges
    • examples: Neo4j, Titan, InfiniteGraph
  2. Triple-store model:
    • stores data as subject-predicate-object triples, akin to RDF
    • examples: Datomic, AllegroGraph

Query Languages

  • Cypher: intuitive, arrow-based syntax for property graphs (Neo4j)
  • SPARQL: concise querying for triple-stores, heavily used in RDF
  • Datalog: rule-based, logic-oriented querying for complex derivations

Comparison with CODASYL

Graph databases improve upon the network model (CODASYL) by:

  • offering schema flexibility for connections
  • enabling direct vertex access via unique IDs
  • supporting unordered relationships, simplifying data handling
  • utilizing declarative query languages for easier queries

RDF and the Semantic Web

  • RDF triples underpin the semantic web, aiming for a machine-readable internet
  • despite potential for data integration across sites, its adoption remains limited

Overall, graph databases shine when dealing with highly interconnected data that demands efficient traversal and dynamic schema adaptability, making them invaluable for modern, relationship-centric applications.

Zach Wilson Q&A stream

I saw that Zach did a live Q&A stream. I did not join as it was 5am or something like that for me haha but thankfully the vod is on youtube. My main takeaway is related to showcasing DE projects and the questions we need to answer when building one:

  1. whats the impact of the project? how does it change lives?
  2. what is the robustness of the project? how are you going to improve it?
  3. what is the quality of the project? are we doing quality checks?tests? is it written with code that’s readable? is it documented?
  4. proof - is it visible?

Completed the medium practice SQL questions as well

Here is the link to the practice Qs. Today I finished all medium ones as well and only the hard ones are left. The main things of the medium Qs were window functions, group bys, CTEs and joins.


That is all for today!

See you tomorrow :)