Home

About

Designing Data-Intensive Applications notes

22 Dec 2021

Chapter 1 Reliable, Scalable, and Maintainable Applications

Three concerns that are important in most software system: Reliability, Scalability and Maintainability

Scalability

load parameters: numbers to describe load. e.g., (Twitter) The distribution of followers per user is a key load parameter for discussing scalability. Hybrid mode of both approaches: Most users’ tweets continued to be fanned out to home timelines, small number of celebrities are fetching tweets separately and merged with that user’s home timeline when it is read

use percentiles to measure performance

If you are working on a fast-growing service, you will need to rethink your architecture on every order of magnitude load increase – or perhaps more often than that.

Keep your database on a single node (scale up) until scaling cost or high-availability requirements forced you to make it distributed.

Chapter 2 Data Models and Query Languages

Why NoSQL: greater scalability, specialized query that are not well supported by SQL, restrictiveness of relational schemas, better performance due to locality.

If data is stored in relational tables, a translation layer is required between the objects in the application code and the DB model of tables, rows and columns.

Relational model provides better support for joins, and many-to-one and many-to-many relationships.

Document Model

If your application does use many-to-many relationships, the document model becomes less appealing:

For highly interconnected data, the document model is awkward, the relational model is acceptable, and the graph models are the most natural.

If your application has mostly one-to-many relationships (tree-structured data) or no relationships between records, the document model is appropriate.

The advantage of using an ID is that because it has no meaning to humans, it never needs to change: the ID can remain the same, even if the information it identifies changes. Anything that is meaningful to humans may need to change sometime in the future.

Schema-on-read is similar to dynamic (runtime) type checking in programming language, whereas schema-on-write is similar to static (compile-time) type checking.

Convergence of Document and relational DB

SQL is a declarative query language. You just specify the pattern of the data you want, but not how to achieve that goal.

Many commonly used programming language are imperative

Graph Data Model

Facebook maintains a single graph with many types of vertices and edges: vertices represent people, locations, events, checkins and comments

Graphs are good for evolvability

Chapter 3 Storage and Retrieval

Well-chosen indexes speed up read queries, but every index slows down writes

Append-only log is good (compared with update the file in place) for several reasons:

SSTable: Sorted String Table, key-value pairs are sorted by key

LSM-Tree: Log-Structured Merge Tree: SSTable + memtable

Bloom filter: tell you if a key does not appear in the DB, saves many unnecessary disk reads for nonexistent keys

The log-structured indexes break the DB down into variable-size segments, typically several MB or more in size, and always write a segment sequentially. B-trees break the DB down into fixed-size blocks or pages, traditionally 4KB in size. Each page can be identified using an address or location, which allows one page to refer to another – similar to a pointer, but on disk instead of in memory

WAL: write-ahead log, an append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself (restore B-tree back after crash)

LSM-trees are faster for writes while B-trees are faster for reads. B-trees have higher write amplification. LSM-trees can be compressed better.

In MySQL’s InnoDB storage engine, the primary key of a table is always a clustered index, in SQL Server, you can specify one clustered index per table.

The performance advantage of in-memory DB is not due to the fact that they don’t need to read from disk (Operating system cache will make sure a disk-based storage engine may never need to read from disk). They can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk.

OLTP vs. OLAP

OLTP (online transaction processing): handles raw data, interactive. Records are inserted or updated based on the user’s input

OLAP (online analytic processing): aggregate raw data

Data warehouse: a separate DB to run the analytics without affecting OLTP operations. The process of getting data into the warehouse is known as ETL (extract-transform-load)

The data model of a data warehouse is most commonly relational.

In Chapter 2, a wide range of different data models are used in transaction processing, in analytics however, there is much less diversity of data models: star schema (aka dimensional modeling, center: fact table). A variation of this template is known as the snowflake, where dimensions are further broken down into subdimensions. Snowflake schemas are more normalized than star schemas.

In a typical data warehouse, fact tables often have +100 columns. But a typical query only accesses 4 or 5 of them at one time. So column-oriented storage is a good fit.

Store the same data sorted in several different ways. Data needs to be replicated to multiple machines anyway. (Downside of making writes more difficult)

Data cube/OLAP cube: cache aggregates that queries use more often. One way of creating such a cache is a materialized view. Not often used in OLTP DB because updates make writes more expensive. In read-heavy data warehouse they can make more sense.

why OLAP workloads are so different from OLTP: when your queries require sequentially scanning across a large number of rows, indexes are much less relevant.

Chapter 4 Encoding and Evolution

Binary encoding: Apache Thrift, Protobuf, Apache Avro

Data outlives code

Chapter 5 Replication

scale up vs. scale out

Synchronous vs. Asynchronous Replication

In practice, if you enable synchronous replication, it usually means one of the followers is sync, and the others are async. If the sync follower becomes unavailable or slow, one of the async followers is made sync. (also called semi-synchronous) (Azure Storage uses chain replication which is a variant of sync replication)

Single Leader Replication

Set up new follower: take a consistent snapshot of the leader’s DB, copy snapshot to the new follower, follower requests all the data changes that have happened since the snapshot was taken

Handle Node outages

Implementation of Replication logs

Eventual consistency, replication lag

Multi-leader Replication:

allow more than one node to accept writes. Each leader simultaneously acts as a follower to the other leader (master-master or active/active replication). Used in multi-datacenter operation.

retrofitted feature, may cause other DB features (autoincreamenting keys, triggers) problematic. Should be avoided if possible.

Another situation in which multi-leader replication is appropriate is if you have an application that needs to continue to work while it is disconnected from the Internet

The biggest problem with multi-leader replication is that write conflicts can occur.

Conflict Resolutions: The simplest strategy is to avoid them: If the application can ensure that all writes for a particular record go through the same leader, then conflicts cannot occur.

Last write wins (LWW): prone to data loss

merge conflicts, let application code to resolve.

Another problem is writes may arrive in the wrong order at some replicas, similar to “Consistent Prefix Reads”. Simply attaching a timestamp to every write is not sufficient, because clocks cannot be trusted to be sufficiently in sync to correctly order these events. a technique called version vectors can be used

Leaderless Replication

Read requests are also sent to several nodes in parallel. Appealing for use cases that require high availability and low latency and that can tolerate occasional stale reads, also suitable for multi-datacenter operation.

How does an unavailable node catch up on the writes that it missed?

Quorum reads (r) and writes (w): r + w > n (replicas)

What if we cannot reach a quorum of w or r nodes?

Sloppy quorum: writes and reads still require w and r successful response, but those may include nodes that are not among the designated n home nodes for a value. Once the network interruption is fixed, any writes that one node temporarily accepted on behalf of another node are sent to the appropriate “home” nodes (hinted handoff)

Quorum Consistency cannot guarantee to get the latest value. Stronger guarantees require transactions or consensus.

Detect Concurrent Writes

LWW (last write wins) achieves the goal of eventual convergence, but at the cost of durability: if there are several concurrent writes to the same key, even if they were all reported as successful to the client, only one of the writes will survive and others will be silently discarded. Moreover, LWW may even drop writes that are not concurrent.

The only safe way of using DB with LWW is to ensure a key is only written once and thereafter treated as immutable

Version vectors (vector clock):

Chapter 6 Partitioning

Consistent Hashing: a way of evenly distributing load across an internet-wide system of caches.

Cassandra achieves a compromise between two partitioning strategies (by key range and by hash of key). If the primary key is chosen to be (user_id, update_timestamp), the 1st key (user_id) is hashed to determine the partition, the second key (update_timestamp) is used as a concatenated index. Different users may be stored on different partitions, but within each user, the updates are stored ordered by timestamp on a single partition.

Hashing a key cannot avoid hot spots entirely. If one key is known to be very hot, a simple technique is to add a random number to the beginning or end of the key.

Partitioning secondary indexes by Document vs. by Term

Rebalancing partitions

Fixed number of partitions (partition numbers > nodes) vs. Dynamic partitioning vs. Fixed number of partitions per node

Operations: Automatic or Manual Rebalancing

Automation can be dangerous in combination with automatic failure detection. This puts additional load on the overloaded node, other nodes and the network –making the situation worse and potentially causing a cascading failure.

Service Discovery: How does the component making the routing decision (which may be one of the nodes, or the routing tier, or the client)

Many distributed data systems rely on a separate coordination service such as ZooKeeper to keep track of this cluster metadata. Whenever a partition changes ownership, or a node is added or removed, ZooKeeper notifies the routing tier so that it can keep its routing information up to date.

HBase, SolrCloud and Kafka also use ZooKeeper to track partition assignments.

Cassandra and Riak use a gossip protocol among the nodes. Requests can be sent to any node, and that node forwards them to the appropriate node.

Chapter 7 Transactions

Systems that do not meet the ACID criteria are sometimes called BASE: Basically Available, Soft state, and Eventual consistency.

The idea of ACID consistency is that you have certain statements about your data (invariants) that must always be true.

Atomicity, isolation and durability are properties of the DB, whereas consistency (in ACID sense) is a property of the application. It’s the application’s responsibility to define its transactions correctly, this is not something that the DB can guarantee.

The idea of ACID isolation is that concurrently running transactions shouldn’t interfere with each other.

A transaction is usually understood as a mechanism for grouping multiple operations on multiple objects into one unit of execution.

Many distributed datastores have abandoned multi-object transactions because they are difficult to implement across partitions. Datastores with leaderless replication work much more on a “best effort” basis

Weak isolation levels (vs. serializable isolation which means the DB guarantees that transactions have the same effect as if they ran serially):

Read Committed

no dirty read - cannot see uncommitted data, (how to implement) use lock, or remember both the old committed value and the new value

no dirty write - cannot overwrite uncommitted data. (how to implement) use lock

Snapshot Isolation and Repeatable Read

read skew, an example of nonrepeatable read. A client sees different parts of the DB at different points in time.

Each transaction reads from a consistent snapshot of the database - the transaction sees all the data that was committed in the DB at the start of the transaction. Even if the data is subsequently changed by another transaction, each transaction sees only the old data from that particular point in time.

A key point of snapshot isolation is readers never block writes, and writes never block readers

(how to implement) DB must keep several different committed versions of an object (multi-version concurrency control, MVCC). A typical approach is that read committed uses a separate snapshot for each query, while snapshot isolation uses the same snapshot for an entire transaction. When a transaction is started, it is given a unique, always-increasing transaction ID (txid)

The reason of this naming confusion (Oracle: serializable, PostgreSQL and MySQL: repeatable read) is that the SQL standard doesn’t have the concept of snapshot isolation. IBM DB2 uses “repeatable read” to refer to serializability

Prevent Lost Updates

Lost update problem: read-modify-write cycle concurrently.

Many DB provide atomic update operations, which remove the need to implement read-modify-write cycles in application code. (but it won’t help if application code still uses read-modify-write cycles) Or use lock, application code needs to add lock explicitly

An alternative is to allow them to execute in parallel and, if the transaction manager detects a lost update, abort the transaction and force it to retry its read-modify-write cycle.

Write Skew and Phantoms

write skew: not dirty write nor lost update, because the two transactions are updating two different objects

phantom: a write in one transaction changes the result of a search query in another transaction

Serializability

Serializable isolation is usually regarded as the strongest isolation level. It guarantees that even though transactions may execute in parallel, the end result is the same as if they had executed one at a time, serially, without any concurrency. protect against all the concurrency issues

Actual Serial Execution

To execute only one transaction at a time, in serial order, on a single thread.

Encapsulating transactions in stored procedures. Systems with single-threaded serial transaction processing don’t allow interactive multi-statement transactions.

Cons:

Two-Phase Locking (2PL)

Only one widely used algorithm for serializability for around 30 years. Used by the serializable isolation level in MySQL (InnoDB) and SQL Server, and the repeatable read isolation level in DB2.

Implementation: acquire shared lock before read; acquire exclusive lock before write; upgrade shared lock to exclusive lock if first reads then writes; hold the lock until the end of the transaction (commit or abort. This is where the name comes from: 1st phase is when the locks are acquired, 2nd phase is when all the locks are released)

Will cause deadlock. DB automatically detects deadlocks and aborts one of them. The aborted transaction needs to be retried by the application.

Unlike snapshot isolation, Reading an old version of the object is not acceptable under 2PL.

Quite unstable latency, very slow at high percentiles.

Serializable Snapshot Isolation (SSI)

First described in 2008.

Optimistic concurrency control. It lets transactions continue anyway, in the hope that everything will turn out all right. When a transaction wants to commit, the DB checks whether anything bad happened; if so, the transaction is aborted and has to be retried.

Based on snapshot isolation, On top of snapshot isolation, SSI adds an algorithm for detecting serialization conflicts among writes and determining which transactions to abort.

How to detect:

Performance:

Chapter 8 The Trouble with Distributed Systems

Partial failures in a distributed system are nondeterministic. You may not even know whether something succeeded or not.

Rather than using configured constant timeouts, systems can continually measure response times and their variability (jitter)

Monotonic clock is suitable for measuring a duration (time interval) as they are guaranteed to always move forward; while time-of-day clock is unsuitable due to the jumps back issue

Fundamental problems of LWW:

Physical clock (time-of-day and monotonic clock, measure actual elapsed time) vs. logical clock (based on incrementing counters, are safer for ordering events)

The truth is defined by the majority

Fencing token: a number that increases every time a lock is granted

Byzantine fault/Byzantine General Problem: nodes may “lie” (send arbitrary faulty or corrupted response)

Byzantine fault-tolerant

Chapter 9 Consistency and Consensus

Transaction isolation is primarily about avoiding race conditions due to concurrently executing transactions, whereas distributed consistency is mostly about coordinating the state of replicas in the face of delays and faults.

Linearizability (aka atomic consistency, strong consistency, immediate consistency, or external consistency):

The basic idea is to make a system appear as if there were only one copy of the data, and all operations on it are atomic.

Linearizability vs. Serializability

A DB may provide both serializability and linearizability, and this combination is known as strict serializability or strong one-copy serializability. 2PL or actual serial execution are linearizable, but SSI is not linearizable, as it does not include writes that are more recent than the snapshot.

Linearizability Scenarios: leader election, uniqueness constraints (username or email address must be unique)

CAP theorem: Consistency, Availability, Partition tolerance: pick 2 out of 3 (better phrasing CAP would be either Consistent (linearizability) or Availability when Partitioned, because partitions aren’t something about which you have a choice: they will happen whether you like it or not).

Applications that don’t require linearizability can be more tolerant of network problems.

If you want linearizability, the response time of read and write requests is at least proportional to the uncertainty of delays in the network. A faster algorithm for linearizability does not exist, but weaker consistency models can be much faster, so this trade-off is important for latency-sensitive systems.

Linearizability has a total order of operations, causality defines a partial order. There are no concurrent operations in a linearizability datastore. Linearizability implies causality: any system that is linearizability will preserve causality correctly. In many cases, system that appear to require linearizability in fact only really require causal consistency.

Lamport Timestamps

Generating sequence numbers that is consistent with causality.

A pair of (counter, node ID)

Every node and every client keeps track of the maximum counter value it has seen so far, and includes that maximum on every request. When a node receives a request or response with a maximum counter value greater than its own counter value, it immediately increases its own counter to that maximum.

Version vectors can distinguish whether two operations are concurrent or whether one is causally dependent on the other, whereas Lamport timestamps always enforce a total ordering, you cannot tell whether two operations are concurrent or whether they are causally dependent.

Not sufficient to solve many common problems in distributed system (for example, unique username, this approach works for determining the winner after the fact, but not sufficient when a node has just received a request from a user to create a username, and needs to decide right now)

Total Order broadcast: Knowing when your total order is finalized

Delivering a message is like appending to the log. A node is not allowed to retroactively insert a message into an earlier position in the order if subsequent messages have already been delivered. This fact makes total order broadcast stronger than timestamp ordering.

The sequence number can then serve as a fencing token, because it is monotonically increasing. In ZooKeeper, this sequence number is called zxid.

It can be proved that a linearizable compare-and-set (or increment-and-get) register and total order broadcast are both equivalent to consensus.

Distributed Transactions and Consensus

Two-phase Commit (2PC): implementing atomic commit (for distributed system)

Coordinator failure:

The only way 2PC can complete is by waiting for the coordinator to recover (In principle, the participants could communicate among themselves to find out how each participant voted and come to some agreement, but that is not part of the 2PC protocol)

Three-phase Commit

2PC is called a blocking atomic commit protocol due to the fact that 2PC can be stucking waiting for the coordinator to recover

Nonblocking atomic commit requires a perfect failure detector – i.e., a reliable mechanism for telling whether a node has crashed or not. In a network with unbounded delay a timeout is not a reliable failure detector.

Distributed Transactions in Practice

Many cloud services choose not to implement distributed transactions due to the operational problems.

Distributed transactions in MySQL are reported to be over 10 times slower than single-node transactions.

Database-internal distributed transactions can often work quite well. On the other hand, transactions spanning heterogeneous technologies are a lot more challenging.

XA (extended architecture) is a standard for implementing two-phase commit across heterogeneous technologies. It’s not a network protocol, merely a C/Java Transaction API for interfacing with a transaction coordinator.

Limitations of distributed transactions

Coordinator logs are required in order to recover in-doubt transactions after a crash, which makes application server no longer stateless

Since XA needs to be compatible with a wide range of data systems, it is necessarily a lowest common denominator. (cannot detect deadlocks across different systems, does not work with SSI)

Have a tendency of amplifying failures (any part of the system is broken, the transaction fails)

Fault-Tolerant Consensus (Paxos, Raft)

A consensus algorithm cannot simply sit around and do nothing forever, it must make progress, Even if some nodes fail, the other nodes must still reach a decision.

Any consensus algorithm requires at least a majority of nodes to be functioning correctly in order to assure termination. That majority forms a quorum

Total order broadcast is equivalent to repeated rounds of consensus

We have two rounds of voting: once to choose a leader, and a second time to vote on a leader’s proposal. The key insight is that the quorums for those two votes must overlap: if a vote on a proposal succeeds, at least one of the nodes that voted for it must have also participated in the most recent lead election.

Difference between fault-tolerant consensus and 2PC:

in 2PC the coordinator is not elected, fault-tolerant consensus algorithms only require votes from a majority of nodes, whereas 2PC requires a “yes” vote from every participant. Moreover, consensus algorithms define a recovery process by which nodes can get into a consistent state after a new leader is elected.

Membership and Coordination Service

ZooKeeper and etcd are designed to hold small amounts of data that can fit entirely in memory.

ZooKeeper is modeled after Google’s Chubby lock service, implementing not only total order broadcast (hence consensus), but also: Linearizable atomic operations, total ordering of operations, failure detection, change notification

ZooKeeper runs on a fixed number of nodes (usually three or five) and performs its majority votes among those nodes while supporting a potentially large number of clients

Normally, the kind of data managed by ZooKeeper is quite slow-changing: it represents information like “the node running on IP address 10.1.1.23 is the leader for partition 7”

ZooKeeper, etcd, and Consul are often used for service discovery – that is, to find out which IP address you need to connect to in order to reach a particular service

Causal consistency (weak consistency, as somethings can be concurrent) does not have the coordination overhead of linearizability and is much less sensitive to network problems

Chapter 10 Batch Processing

Services (online systems) vs. Batching processing systems (offline systems) vs. Stream processing systems (near-real-time systems)

Distributed batch processing engines have a deliberately restricted programming model:

Sushi principle: raw data is better

Alternatives for batching processing (besides MapReduce)

Dataflow engines: Spark (distributed batch computations), handle an entire workflow as one job, rather than breaking it up into independent subjobs. Sufficient for intermediate state between operators to be kept in memory instead of Materialization of Intermediate State (The process of writing out this intermediate state to files. like MapReduce)

MapReduce does not account for the iterative nature, it will always read the entire input dataset and produce a completely new output dataset (vs. Graph model and iterative processing)

A difference between batch processing and stream processing

Chapter 11 Stream Processing

When moving toward continual processing with low delays, polling becomes expensive if datastore is not designed for this kind of usage. It is better for consumers to be notified when new events appear (push. Messaging system).

Batch processing system provides a strong reliability guarantee (failed tasks are automatically retried, etc.), messaging system may lose messages (Direct messaging. Whether message loss is acceptable depends on the application. For example, with sensor reading, an occasional missing data is perhaps not important, since an updated value will be sent a short time later).

UDP multicast is widely used in the financial industry for streams such as stock market feeds, where low latency is important.

Message brokers (message queue) vs. Direct Messaging (UDP, etc.)

When multiple consumers read messages, two main patterns:

Log-based message brokers: producer sends a message by appending it to the end of the log. Combine the durable storage approach with low-latency notification facilities (Apache Kafka)

Chapter 12 The Future of Data Systems

Lambda Architecture

Combine both batch processing and stream processing

Separation of application code and state

Most web applications today are deployed as stateless services, in which any user request can be routed to any application server, but the state has to go somewhere: a database

request/response interaction vs. publish/subscribe dataflow (more responsive user interfaces and better offline support): an option of subscribing to changes, not just querying the current state