(Day 332) Have to do projects + network

Ivan Ivanov · November 28, 2024

data-eng theory applying-knowledge

Hello :) Today is Day 332!

A quick summary of today:

read and listen to Designing Data-Intensive Applications - Chapter 5 - Replication
listened to a data careers podcast by Joe Reis
more SQL practice

Designing Data-Intensive Applications - Chapter 5 - Replication

Replication means keeping a copy of the same data on multiple machines that are connected via a network.

Leaders and Followers

Leader-based replication overview

In leader-based replication, one replica (the leader) handles all write operations and updates followers through a replication log. Followers can only serve read requests. This setup is common in relational databases like PostgreSQL, MySQL, and SQL Server, as well as non-relational systems like MongoDB and Kafka.

Sync vs. Async replication

sync: ensures followers have up-to-date data, improving durability but increasing the risk of delays if followers are unresponsive
async: allows faster writes by not waiting for follower confirmation, but data loss is possible if the leader fails before replication completes

Setting up followers

Followers are initialized using a snapshot of the leader’s data, followed by replaying all changes since the snapshot. This process ensures followers eventually catch up without downtime.

Handling Failures

follower failures: resolved through ‘catch-up recovery’ where the follower syncs missed changes after reconnecting
leader failures: require failover, promoting a follower to leader and reconfiguring clients. Challenges include avoiding data loss and preventing “split brain” (two active leaders)

Replication Methods

statement-based: sends SQL statements to followers but can fail with nondeterministic or dependent operations
Write-Ahead Log (WAL) shipping: transmits low-level storage changes but ties replication to storage engine internals
logical log replication: uses higher-level, row-based logs, decoupling replication from storage formats and supporting more flexible upgrades

Leader-based replication balances performance, durability, and consistency but requires careful handling of failures and trade-offs between synchronous and asynchronous modes.

Problems with Replication Lag

Replication in distributed systems is crucial for fault tolerance, scalability, and reducing latency. However, when using asynchronous replication, replication lag can introduce inconsistencies. Here are three key issues and potential solutions:

Reading your own writes

Problem: Users may not see their updates immediately after submitting them, leading to frustration.

Solution:

read recent writes from the leader while routing other reads to followers
use timestamps to ensure replicas serving reads reflect the user’s most recent updates
implement cross-device consistency by centralizing metadata and routing all user requests to the same datacenter

Monotonic reads

Problem: Users may observe inconsistent states (e.g. seeing newer data followed by older data).

Solution:

ensure users consistently read from the same replica
reroute to a fallback replica only if the primary replica fails

Consistent prefix reads

Problem: Causal relationships between updates may appear out of order.

Solution:

write causally related updates to the same partition
use algorithms that track and maintain causal dependencies

General recommendations

If replication lag severely impacts user experience:

design the system to offer stronger guarantees, like read-after-write consistency
simplify application logic by leveraging transactional guarantees when possible

While asynchronous replication is scalable, pretending it behaves like synchronous replication can lead to subtle, hard-to-debug issues.

Multi-Leader Replication

This enables multiple nodes to accept writes, addressing some limitations of single-leader systems but introducing complexities. in this setup, each leader acts as a follower to others, forwarding changes to maintain consistency. this approach suits certain use cases but comes with challenges

multi-datacenter operation: having leaders in different datacenters improves performance and fault tolerance. writes are processed locally and replicated asynchronously, minimizing inter-datacenter latency. however, handling concurrent changes in different datacenters can lead to conflicts
offline operation: systems like mobile apps rely on asynchronous synchronization, allowing local writes while disconnected. conflicts must be resolved when syncing
collaborative editing: real-time applications, like google docs, use techniques resembling multi-leader replication. conflict resolution is crucial when multiple users edit the same content simultaneously
conflict resolution: conflicts occur when concurrent changes affect the same data. solutions include:
- prioritizing changes by timestamps or replica ids
- merging conflicting values
- recording conflicts for later resolution, either automatically or through user input
topologies: multi-leader systems use communication paths like all-to-all, circular, or star configurations. each has trade-offs in fault tolerance and replication speed

Multi-leader replication offers flexibility for distributed systems but requires careful handling of conflicts and topology design to ensure reliability and consistency.

Leaderless replication

A high-availability strategy

Leaderless replication, also known as Dynamo-style replication, allows any replica in a distributed system to accept write requests. Unlike leader-based replication, where a single leader dictates write order, this approach prioritizes high availability and resilience to node failures. However, it requires careful handling of concurrent writes and the potential for reading stale data.

How leaderless replication handles node outages

Leaderless systems operate without a designated leader, enabling any replica to process writes. When nodes are offline—for example, during maintenance—leaderless replication remains operational. Clients send writes to all replicas, and the system deems a write successful even if some replicas are unreachable.

This flexibility, however, introduces challenges. An unavailable node may miss updates while offline, leading to stale data when it rejoins the network. To address this, read requests often query multiple replicas, comparing version numbers to identify and repair outdated values, a process called read repair.

Quorums: balancing consistency and availability

Leaderless systems rely on quorum-based mechanisms to ensure consistency amidst node failures. Quorums are defined using three parameters:

n: total number of replicas
w: minimum number of writes needed for a successful write
r: minimum number of reads needed for a successful read

The relationship w + r > n ensures that at least one of the read replicas contains the most recent data. This relationship balances consistency and availability, with common configurations like n = 3, w = 2, r = 2. Adjusting ‘w’ or ‘r’ can prioritize performance or consistency. For instance, reducing ‘w’ or ‘r’ improves speed but increases the risk of stale reads

Limitations of quorum consistency

While quorums reduce the chances of stale data, they do not eliminate them entirely. Key challenges include:

sloppy quorums: during network disruptions, writes may temporarily store on nodes outside the primary replica set, delaying updates
concurrent writes: simultaneous writes to the same key by multiple clients can lead to conflicting version
edge cases: timing issues during node failures, write retries, or network delays may also result in stale values

Leaderless systems are thus best suited for scenarios where eventual consistency is acceptable, ensuring data convergence over time.

Resolving concurrent writes in leaderless systems

Concurrent writes are a major challenge, as they can arrive at replicas in different orders due to network delays. Common resolution strategies include:

Last Write Wins (LWW): using timestamps to determine the latest write, though this can lead to data loss
merging values: for instance, merging shopping cart updates from multiple replicas by taking their union. However, this requires handling deletions carefully, often using tombstones, markers indicating removed items

Version vectors for conflict management

To manage concurrent writes without a central leader, leaderless systems use version vectors. These track version numbers for each key across all replicas, helping differentiate between overwrites and concurrent writes. Advanced techniques like dotted version vectors, as seen in Riak, are particularly effective.

By using version vectors, systems can safely read from one replica and write to another, with no risk of data loss when conflicts are correctly resolved. For automation, Conflict-Free Replicated Data Types (CRDTs) are often employed to merge data without requiring application logic.

Jobs in data with Joe Reis podcast

I found this podcast episode that got released yesterday(?) where Joe Reis talks with a guest on careers in data and the data job market. My main takeaways:

do projects to learn rather than thinking of doing them just to be parts of your cv
network

More SQL practice on interviewquery.com

Today I finished the last 15 or so easy Qs (total is 40) and moved onto the medium ones of which I did 5

So far it’s just practicing group bys, CTEs, window functions, and general logic for problems. But the medium ones is where I will most likely learn a tonne.

That is all for today!

See you tomorrow :)