Hello :) Today is Day 327!
A quick summary of today:
- Designing Data-Intensive Applications: Chapter 2
- some more sql practice
Designing Data-Intensive Applications: Chapter 2. Data Models and Query Languages
I read and listened at the same time the book + audio version (this might be a viable learning strategy 😆), so here is a summary of ch. 2 ~
Relational Model vs. Document Model: A Comparative Overview
Relational model
- origin: introduced by edgar codd in 1970, gaining widespread adoption through rdbms implementations by the 1980s
- structure: organizes data in tables (rows and columns) with relationships defined via foreign keys
- strengths:
- excellent support for joins and complex queries
- enforces strict schemas, ensuring data consistency
- broad adoption and a mature ecosystem
- use cases: ideal for transaction processing, batch processing, and general-purpose web applications
- challenges:
- impedance mismatch with object-oriented programming
- scalability concerns for massive datasets or high write loads
Document model
- origin: emerged during the nosql movement in the 2010s to address scalability and flexibility challenges
- structure: stores data in json, xml, or binary formats, allowing nested and flexible schemas
- strengths:
- schema flexibility supports dynamic, evolving data structures
- improved performance for read-heavy workloads due to data locality
- simplifies application code for hierarchical or document-like data
- use cases: frequently used in web apps, content management systems, and applications with deeply nested data
- challenges:
- limited support for joins and many-to-many relationships
- lack of schema enforcement can result in inconsistencies
- performance degrades with large, frequently updated documents
Key Concepts
- object-relational mismatch: discrepancy between object-oriented programming and relational models, often addressed with ORM frameworks
- normalization: ensures consistency and reduces redundancy but can complicate queries
- joins: relational databases excel here, while document databases often struggle
- schema flexibility: document databases allow schema-less or schema-on-read approaches, accommodating rapidly changing data needs
Convergence of models
- relational databases adopting document features: many RDBMS now support JSON/XML, enabling hybrid approaches
- document databases adding relational capabilities: some incorporate joins and relational features
- future outlook: the convergence trend suggests databases will increasingly handle both structured and unstructured data seamlessly
Choosing the right model should align with the application’s needs, balancing scalability, flexibility, and query complexity.
Leveraging multiple database types to utilize their strengths in different parts of an application ensures optimal performance and flexibility.
Query languages for data
When relational databases emerged, SQL introduced a declarative approach to querying data, in contrast to imperative methods used by systems like IMS and CODASYL.
Key differences between Declarative and Imperative queries
- Imperative programming involves specifying the exact steps to achieve a result. For example, filtering a list in JavaScript would require loops and conditionals
- Declarative programming, as seen in SQL, focuses on describing the desired outcome without detailing how to achieve it. The database handles the execution details
Declarative queries offer significant benefits:
- conciseness and ease of understanding
- independence from implementation details, enabling performance optimizations
- support for parallel execution, leveraging modern multi-core processors effectively
Declarative Queries Beyond Databases
Declarative principles extend beyond databases to web technologies like CSS. Styling a web page declaratively with CSS is simpler and more efficient compared to manipulating the DOM imperatively with JavaScript. For example:
- a CSS rule applies and updates styles dynamically as elements change
- imperative JavaScript styling can result in errors, such as failing to update styles when conditions change
MapReduce and mixed paradigms
MapReduce represents a blend of declarative and imperative approaches. It is widely used for distributed data processing, including in NoSQL databases like MongoDB.
- Declarative elements: filters (e.g.
family: "Sharks"
) specify which data to process - Imperative elements: custom JavaScript functions define the map and reduce operations
Example MapReduce Workflow:
- Mapping: emits key-value pairs for relevant data points
- Reducing: aggregates values for each key, such as summing the counts of observed shark sightings by month
Challenges with MapReduce include complexity in coordinating map and reduce functions and limited query optimization opportunities. To address this, MongoDB introduced the Aggregation Pipeline, a declarative alternative that simplifies query construction while maintaining flexibility.
The progression from MapReduce to declarative aggregation pipelines demonstrates how declarative languages often evolve to balance expressiveness and usability.
Graph-like data models
Graph databases excel at handling complex many-to-many relationships, making them ideal for scenarios like social networks, web links, or road networks where relational databases struggle.
Key Features
- structure: data is represented as vertices (nodes) connected by edges (relationships), mimicking real-world connections
- schema flexibility: graph databases allow for dynamic connections without rigid schema constraints, unlike relational databases
- efficient traversal: they support seamless navigation through data, making them efficient for analyzing interconnected relationships
Graph Database Models
- Property graph model:
- stores properties as key-value pairs on vertices and edges
- examples: Neo4j, Titan, InfiniteGraph
- Triple-store model:
- stores data as subject-predicate-object triples, akin to RDF
- examples: Datomic, AllegroGraph
Query Languages
- Cypher: intuitive, arrow-based syntax for property graphs (Neo4j)
- SPARQL: concise querying for triple-stores, heavily used in RDF
- Datalog: rule-based, logic-oriented querying for complex derivations
Comparison with CODASYL
Graph databases improve upon the network model (CODASYL) by:
- offering schema flexibility for connections
- enabling direct vertex access via unique IDs
- supporting unordered relationships, simplifying data handling
- utilizing declarative query languages for easier queries
RDF and the Semantic Web
- RDF triples underpin the semantic web, aiming for a machine-readable internet
- despite potential for data integration across sites, its adoption remains limited
Overall, graph databases shine when dealing with highly interconnected data that demands efficient traversal and dynamic schema adaptability, making them invaluable for modern, relationship-centric applications.
Zach Wilson Q&A stream
I saw that Zach did a live Q&A stream. I did not join as it was 5am or something like that for me haha but thankfully the vod is on youtube. My main takeaway is related to showcasing DE projects and the questions we need to answer when building one:
- whats the impact of the project? how does it change lives?
- what is the robustness of the project? how are you going to improve it?
- what is the quality of the project? are we doing quality checks?tests? is it written with code that’s readable? is it documented?
- proof - is it visible?
Completed the medium practice SQL questions as well
Here is the link to the practice Qs. Today I finished all medium ones as well and only the hard ones are left. The main things of the medium Qs were window functions, group bys, CTEs and joins.
That is all for today!
See you tomorrow :)