Translate Spanner: Google’s Globally-Distributed Database

The last article in the distributed systems translation series, original text at https://research.google.com/archive/spanner-osdi2012.pdf

Now Google Cloud supports Cloud Spanner, and you can view some details about its features on the product documentation.

The translation is poor, and many sentences retained the English word order, making it difficult to understand in Chinese. It might be better not to do full translations in the future.

My language skills are still too poor, whether in English or Chinese. 😢

Introduction#

Spanner is a scalable, globally distributed database designed, built, and deployed by Google. At the highest level of abstraction, it is a database that shards data across many Paxos state machines located in data centers around the world. Replicas are used to achieve global availability and geographic locality, and clients automatically failover between replicas. As the amount of data or the number of servers changes, Spanner automatically reshards across machines. Additionally, Spanner automatically migrates data between machines across data centers to balance load and handle failures. Spanner is designed to scale across hundreds of data centers, millions of machines, and trillions of rows of data. Applications can use Spanner to achieve high availability, even in the event of cross-regional natural disasters, by replicating within the same continent or even across continents.

Our first customer was F1: the rewritten Google Ads backend. F1 uses five replicas distributed across the United States. Most other applications may replicate data between three to five data centers within the same geographic area, which have relatively independent failure modes. In other words, most applications may prioritize low latency over high availability, as long as one or two data centers remain operational during a failure. Spanner primarily focuses on managing replica data across data centers, but we also spent a significant amount of time designing and implementing important database features on top of the distributed systems infrastructure.

Despite many projects being happy to use Bigtable, we have also received complaints from users that Bigtable is difficult to use for certain types of applications, such as those with complex and easily changing schemas, or applications that require strong consistency when replicating across multiple regions.

Many of Google's applications choose to use Megastore due to its semantic relational model and support for synchronous replication operations, even though its write bandwidth is relatively low. As a consensus, Spanner evolved from a versioned key-value store similar to Bigtable into a temporary multi-version database. Data is stored in schema-based semi-relational tables; data has versions; and each version is automatically assigned a timestamp at the time of commit; old data follows a configurable garbage collection policy; applications can read data at old timestamps. Spanner supports general transactions and provides a SQL-based query language. As a globally distributed database, Spanner supports many interesting features. First, applications can dynamically control and configure data replicas at a fine granularity, allowing applications to specify constraints to control: which data centers contain which data, how far the data is from users (to control read latency), how far apart each replica is (to control write latency), and how many replicas to maintain (to control data durability, availability, and read performance). Data can dynamically and transparently move between data centers, balancing resource usage across data centers. Second, Spanner has two features that are difficult to implement in distributed databases: it provides externally consistent reads and writes, and supports global read consistency across databases at the same timestamp. These features enable Spanner to support consistent backups, consistent MapReduce execution, and atomic schema updates at a global scale, even in the presence of ongoing transactions.

The ability to achieve these features is based on the fact that Spanner can assign globally meaningful commit timestamps to transactions, even if the transactions may be distributed, and this timestamp can reflect the serialization order of the transactions. Furthermore, the serialization order adapts to external consistency (or is equivalent to linearizability). If a transaction $T_{1}$ commits before another transaction $T_{2}$ starts, then the commit timestamp of $T_{1}$ is less than that of $T_{2}$ . Spanner is the first system to provide such guarantees at a global scale.

The key support for providing these features is a new TrueTime API and its implementation. This API directly reflects the uncertainty of clocks, and the guarantee of Spanner's timestamps on serialization order relies on the bounds provided by the TrueTime implementation (note: clocks are uncertain, so the TrueTime implementation provides a range rather than a fixed moment). If the uncertainty is high, Spanner will slow down to eliminate the uncertainty. Google's cluster management software provides the implementation of the TrueTime API. This implementation guarantees that the uncertainty is small enough (typically less than 10ms) through various modern clock sources (GPS and atomic clocks).

Implementation#

This section first describes the structure and basic principles of Spanner's implementation, then describes the directory abstraction used to manage replicas and locality, which is also the basic unit of data movement, and finally describes our data model, why Spanner is more like a relational database than a key-value store, and how applications control data locality.

A Spanner deployment is called a universe. Given that Spanner can directly manage data on a global scale, the number of running universes is quite small. We currently run a test/playground universe, a development/production universe, and a production universe. Spanner consists of a set of zones, each zone is roughly similar to a deployment of a set of Bigtable servers. Zones are the basic unit of managing deployments. The collection of zones is also the set of locations where data can be replicated.

As new data centers come online and old data centers are shut down, zones can be dynamically added and removed from the system, while zones are also physically isolated units: there may be multiple zones within a single data center, for example, in some cases, different applications' data within the same data center must be partitioned across different sets of servers.

Organization Structure

The above figure represents different servers in a Spanner universe. A zone has a zonemaster and hundreds to thousands of spanservers; the former allocates data to spanservers, while the latter provides data services to clients. Each zone's location proxies are used for clients to address the spanservers to which their data is allocated. The universe master and placement driver are currently both single instances. The universe master primarily serves as a console for interactive debugging, displaying the status information of all zones. The placement driver handles minute-level automatic data movement across zones, periodically communicating with spanservers to find data that needs to be moved to satisfy updated replication constraints or load balancing. Due to space limitations, we will only describe spanserver in detail here.

Spanserver Software Stack#

This section focuses on the implementation of spanserver to illustrate how to build replicas and distributed transactions on top of a Bigtable-based implementation. The software stack is shown in the figure:

Software Stack

At the lowest level, each spanserver is responsible for instances of 100 to 1000 data structures called tablets. The tablet here is conceptually similar to Bigtable's tablet, implementing a multiset of the following mapping.

(key:string, timestamp:int64) -> string

Unlike Bigtable, Spanner directly assigns timestamps to data, which is an important aspect of why Spanner is more like a database than a key-value store. The state of each tablet is stored in a set of B-tree-like files and a write-ahead log (WAL), all of which are stored in a distributed file system called Colossus (the successor to GFS). To support replication, each spanserver implements a separate Paxos state machine on top of each tablet (earlier versions of Spanner supported multiple Paxos state machines per tablet, allowing for more flexible replication configurations, but the complexity of this design led us to abandon it). Each state machine stores its metadata and the logs of the tablets it is responsible for. Our Paxos implementation supports long-lived leaders based on a time lease mechanism, with a default lease time of 10 seconds. The current Spanner implementation logs each Paxos write twice: once in the tablet's log and once in the Paxos log. This choice is made for convenience and may be improved in the future. Our implementation of Paxos is pipelined to improve Spanner's throughput under WAN latency; however, write operations are executed sequentially in Paxos.

The Paxos state machine is used to implement a consistently replicated multiset of mappings. The key-value mapping state of each replica is stored in the corresponding tablet. Write operations must initiate the Paxos protocol at the leader; read operations directly access the state of any sufficiently new replica's underlying tablet.

A collection of replicas forms a Paxos group, and each spanserver implements a lock table on each leader replica to achieve concurrency control. This lock table contains the state of two-phase locks: mapping key ranges to lock states $K_{0..n} \Rightarrow lock$ (having a long-lived Paxos leader is key to efficiently managing the lock table). In both Bigtable and Spanner, we designed long-lived transactions (e.g., report generation, which may take minutes), and in scenarios where conflicts arise, optimistic concurrency control performs poorly. Operations that require synchronization, such as transaction read operations, must first acquire locks from the lock table, while other operations can skip the lock table locks.

Each spanserver on each replica leader also implements a transaction manager to support distributed transactions. This transaction manager is used to implement participants for the leader; other replicas in the same group become participant followers. If a transaction involves only one Paxos group (the majority of transactions), it can proceed without the transaction manager, as the lock table and Paxos together provide sufficient transaction support. If a transaction involves more than one Paxos group, the leaders of those groups coordinate to execute a two-phase commit. When executing a transaction, one group involved is selected as the coordinator; the leader of this selected group is called the coordinator leader, and the followers of this group are called the coordinator followers. The state of each transaction manager is stored on the underlying Paxos group (and thus is replicated).

Directories and Placement#

Spanner's implementation supports a bucket abstraction called directory on top of the key-value mapping multiset, which is a set of contiguous keys with the same prefix (the choice of the term directory is a historical artifact; a better term might be bucket). Supporting directories allows applications to control their data locality by carefully selecting keys. A directory is a unit of data placement, and all data within the same directory has the same replica configuration. They move between Paxos groups on a directory basis. As shown in the figure:

Directory Movement

Spanner may move a directory out of one Paxos group to reduce load; place frequently accessed directories into the same group; or move a directory closer to its accessors. Directories can be moved while clients are performing operations. A 50MB directory can be moved in about a few seconds.

A Paxos group may contain multiple directories, which is a difference between Spanner's tablets and Bigtable's tablets: the former does not necessarily partition a single row space in lexicographical order. Instead, a Spanner tablet is a container that can encapsulate multiple partitions of row space. We made this decision to allow multiple frequently accessed partitions to coexist.

Movedir is a background task used to move directories between Paxos groups. Movedir is also used to add and remove replicas within Paxos groups, as Spanner currently does not support configuration modifications based on Paxos. Movedir is not implemented as a single transaction to avoid blocking ongoing read and write operations during large data movements. Instead, movedir simply records that it has started moving data in the background, and when it has moved most of the data, it uses a transaction to automatically move the remaining small amount of data and then updates the metadata of the two Paxos groups.

A directory is also the smallest unit where applications can specify geographic replication properties (or placement policies). Our placement-specification language separates the responsibility of managing replica configurations. Administrators control two dimensions: the number and type of replicas, and the addresses where these replicas are placed. A menu of named options is created in these two dimensions (for example: North America, using five replicas plus one witness replica) (note: witness replicas are generally used to participate in leader election voting, arbitrate conflicts, prevent brain splits, etc., see Cloud documentation). Applications label each database and/or individual directories through combinations of these options to control how data is replicated. For example, an application might store each end user's data in their own directory, allowing user A's data to have three replicas in Europe, while user B's data has five replicas in North America.

To simplify the explanation, we have made excessive simplifications. In fact, if a directory is too large, Spanner will shard it into multiple fragments, and these fragments may be served on different Paxos groups (and thus on different servers). Movedir is actually moving fragments between groups rather than entire directories.

Data Model#

Spanner exposes the following data characteristics to applications: a schema-based semi-relational table data model, a query language, and general transactions. The shift towards supporting these features is driven by various factors.

The demand for schema-based semi-relational tables and synchronous replication was driven by the popularity of Megastore. Google has at least 300 applications using Megastore (even though its performance is relatively low) because its data model is simpler to manage than Bigtable and supports synchronous replication across data centers (Bigtable only supports eventual consistency replication across data centers). Notable applications using Megastore include Gmail, Picasa, Calendar, Android Market, and AppEngine. Given the popularity of Dremel as an interactive data analysis tool, the demand for a SQL-like query language in Spanner is also very clear.

Finally, the lack of cross-row transactions in Bigtable has often led to complaints; part of the reason for building Percolator was to address this issue. Some authors believe that supporting general two-phase commits is too costly due to performance or availability issues. We believe it is better to let application programmers handle performance issues when misuse of transactions leads to bottlenecks, rather than programming in the absence of transactions. Running two-phase commits through Paxos mitigates availability issues.

The application's data model is built on a key-value mapping with directories as buckets. An application can create one or more databases within a universe. Each database can contain an unlimited number of structured tables. Tables are similar to relational data tables, containing rows, columns, and versioned values. We will not delve into the details of Spanner's query language. It resembles SQL and has some extensions to support protocol-buffer value fields.

Spanner's data model is not purely relational because each row must have a name. More accurately, each table needs to have a set of one or more ordered primary key columns. This requirement makes Spanner still look like a key-value store: the primary key constitutes the name of the row, and each table defines a mapping from primary key columns to non-primary key columns. A row only exists when a value is defined for the primary key (even if it is NULL). Enforcing this structure is useful because it allows applications to control data locality by selecting keys.

Spanner Example Structure

The above figure contains an example structure of Spanner: photo metadata sorted based on each user and each album. This schema definition language is similar to Megastore but adds additional requirements: each Spanner database must partition tables by one or more hierarchical structures as specified by the client.

Client applications declare hierarchical structures in database schemas using INTERLEAVE IN. The top-level table in the hierarchy is a directory table. In the directory table, all rows starting with key K, arranged in lexicographical order, form a directory.

ON DELETE CASCADE indicates that when a row in the directory table is deleted, all related sub-rows are also deleted. The above figure also shows the interleaved layout of the example database: for instance, Albums(2,1) indicates the row in the Albums table starting from user_id == 2 && album_id == 1. This interleaving of tables to form directories is very important because it allows clients to describe locality relationships between multiple tables, which is necessary for achieving good performance in sharded distributed databases. Without this feature, Spanner cannot know the most important locality relationships.

TrueTime#

This section sketches the TrueTime API and its implementation. We leave most of the details for another paper; the goal of this section is to demonstrate the power that such an API brings us.

Method	Returns
TT.now()	TTinterval: [earliest, latest]
TT.after(t)	true if t has definitely passed
TT.before(t)	true if t has definitely not arrived

The above table lists the methods of the TrueTime API, which explicitly represents time as $TTinterval$ , a time interval with bounded time uncertainty (unlike standard library time interfaces that only inform clients of the concept of uncertainty). The nodes of $TTinterval$ are of type $TTstamp$ , and the $TT.now()$ method returns a $TTinterval$ range that guarantees to include the absolute time at the moment of calling $TT.now()$ . The time epoch is similar to UNIX time with leap seconds handled ambiguously. The instantaneous error bound is defined as $\varepsilon$ , which is half the width of the time interval, and the average error is $\bar{\varepsilon}$ , where the $TT.after()$ and $TT.before()$ methods are simple wrappers around $TT.now()$ .

Let $t_{abs}(e)$ represent the absolute time at which event $e$ occurs. More formally, TrueTime guarantees: for any invocation $tt = TT.now(), tt.earliest \le t_{abs}(e_{now}) \le tt.latest$ where $e_{now}$ is the invocation event.

TrueTime's underlying time references are GPS and atomic clocks. TrueTime uses two forms of time references because they have different failure modes. Vulnerabilities of GPS reference sources include antenna and receiver failures, local radio interference, related failures (involving defects, such as incorrect leap second handling and spoofing), and GPS system outages. Atomic clocks may fail in ways unrelated to GPS and other atomic clocks, leading to significant drift over time due to frequency errors.

TrueTime is implemented by a set of time master machines in each data center and timeslave processes on each machine. Most masters are equipped with GPS receivers with dedicated antennas; these masters are physically isolated to reduce the impact of antenna failures, radio interference, and spoofing. The remaining masters (which we call Armageddon masters) are equipped with atomic clocks. An atomic clock is not that expensive: the cost of an Armageddon master is comparable to that of a GPS master.

All master time references are periodically compared with each other, and each master cross-checks the advancement rate of its reference time against its local clock, and will exit if there is a significant deviation. Between two synchronizations, Armageddon masters declare a slowly increasing time uncertainty, conservatively estimated based on clock drift in the worst-case scenario. The uncertainty declared by GPS masters is generally close to zero.

Each daemon polls multiple masters to reduce the vulnerability of relying on any single master node. Some are GPS masters selected from nearby data centers; the remaining GPS masters come from more distant data centers; and of course, there are some Armageddon masters. The daemon uses a variant of the Marzullo algorithm to detect and reject false information, synchronizing the local machine clock with non-liars. To prevent local clock failures, machines with frequency offsets exceeding the worst-case error bounds for the corresponding components and operating environments are excluded.

Between two synchronizations, the daemon declares a slowly increasing time uncertainty. $e$ represents the conservatively estimated worst-case local clock drift, which also depends on the time-master's error and the communication delay with the time-master. In our production environment, $e$ is typically a sawtooth function regarding time. Within each polling interval, there is a variation of 1ms to 7ms, so $e$ is mostly 4ms. The daemon's polling interval is currently set to 30s, and the current drift rate is set to 200 microseconds/second, collectively leading to a sawtooth variation range of 0 to 6ms, with the remaining 1ms coming from communication delays to the time master. In failure scenarios, this sawtooth may shift, such as occasional unavailability of time masters causing $e$ to increase across data center ranges. Similarly, machine and network connection overloads may also lead to occasional local spikes.

Concurrency Control#

This section describes how to use TrueTime to ensure correctness in concurrency control and how to use these properties to implement features like externally consistent transactions, lock-free read-only transactions, and historical non-blocking reads. These features enable, for example, auditing reads of the entire database at timestamp $t$ to read the effects of all transactions committed before $t$ .

It is important to distinguish between writes seen by Paxos (subsequently referred to as Paxos writes) and Spanner client writes. For example, a two-phase commit generates a Paxos write during the prepare phase but does not correspond to a Spanner client write.

Timestamp Management#

Operation	Concurrency Control	Replica Required
Read-Write Transaction	pessimistic	leader
Read-Only Transaction	lock-free	leader for timestamp; any for read
Snapshot Read, client-provided timestamp	lock-free	any
Snapshot Read, client-provided bound	lock-free	any

The above table lists the types of operations supported by Spanner. Spanner's implementation supports read-write transactions, read-only transactions (pre-declared snapshot isolation transactions), and snapshot reads. Independent writes are implemented as read-write transactions; non-snapshot independent reads are implemented as read-only transactions, both of which have internal retries (no explicit client loop retries are needed).

Read-only transactions are a type of transaction with snapshot isolation performance advantages. A read-only transaction must declare in advance that it does not contain any writes; it is not just a read-write transaction without any write operations. In a read-only transaction, read operations are executed lock-free at a timestamp chosen by the system, so they do not block subsequent write operations. Read operations in read-only transactions can be executed on any sufficiently new replica.

Snapshot reads are reads of the past and are also executed without locking. A client can specify a timestamp for a snapshot read or provide an expiration bound on the desired timestamp's staleness, allowing Spanner to choose a timestamp. In either case, snapshot reads will be executed on sufficiently new replicas.

For read-only transactions and snapshot reads, once a timestamp is chosen, it is inevitable that the data at that timestamp will not have been garbage collected unless the data has been explicitly deleted. Ultimately, clients can avoid buffering results in retry loops. When a service fails, the client internally provides a timestamp and the current read position to continue querying on another server.

Paxos Leader Leases#

Spanner's Paxos implementation uses time-based leases to ensure long-lived leaders (default 10s). A potential leader sends a time-based lease voting request; upon receiving a sufficient quorum of votes, it can consider itself to have obtained the lease. A replica explicitly extends the lease vote after a successful write, and the leader will request lease renewal votes as the lease approaches expiration.

Define a leader's lease interval: from the moment it discovers it has received a sufficient quorum of lease votes until it no longer has a sufficient quorum of lease votes (because some votes have expired).

Spanner relies on the following disjointness invariant: for each Paxos group, the lease intervals of each Paxos leader and every other leader do not overlap (Appendix A describes how to maintain this invariant).

Spanner's implementation allows a Paxos leader to relinquish its leadership by releasing its lease vote. To maintain the disjointness invariant, Spanner restricts when relinquishment is allowed. Define $s_{max}$ as the maximum timestamp used by the leader; subsequent paragraphs will describe when $s_{max}$ will be advanced. Before relinquishing, the leader must ensure that $TT.after(s_{max})$ is true.

Assigning Timestamps to RW Transactions#

Read and write transactions use two-phase locks, and ultimately, they can be assigned timestamps at any time [after acquiring all locks but before releasing any locks]. For a given transaction, Spanner assigns its timestamp to the timestamp of the Paxos write that represents the transaction's commit.

Spanner relies on the following monotonicity invariant: within each Paxos group, Spanner assigns timestamps to Paxos writes in a monotonically increasing order, even across leaders (note: the leader may change to another node). A single leader replica can easily assign timestamps in monotonically increasing order. By leveraging the disjointness invariant, this property can be maintained among various leaders: a leader can only assign timestamps within its leader lease interval.

Note that whenever a timestamp $s$ is appropriately assigned, $s_{max}$ must be greater than $s$ to maintain disjointness. Spanner also enforces the following external consistency invariant: if the start of transaction $T2$ occurs after the commit of transaction $T1$ , then the commit timestamp of $T2$ must be greater than that of $T1$ .

Define the start and commit times for transaction $T_{i}$ as $e_{i}^{start}$ and $e_{i}^{commit}$ , with the commit timestamp as $s_{i}$ ; then the invariant becomes:

$t_{abs}(e_{1}^{commit}) < t_{abs}(e_{2}^{start}) \Rightarrow s_{1} < s_{2}$

The protocol for executing transactions and assigning timestamps adheres to two rules that collectively ensure immutability, as follows. Define the event of the write transaction $T_{i}$ 's commit request reaching the coordinator leader as $e_{i}^{server}$ .

Start: The coordinator leader assigns a timestamp $s_{i}$ to the write transaction $T_{i}$ , where $s_{i}$ is not less than $TT.now().latest$ computed at $e_{i}^{server}$ . Note that the participant leaders are not important here; their participation in the implementation of the second rule will be described later.
Commit Wait: The coordinator leader ensures that no client can see any data committed by $T_{i}$ before $TT.after(s_{i})$ is true. The commit wait ensures that $s_{i}$ is less than the absolute time of the commit of $T_{i}$ , i.e., $s_{i} < t_{abs}(e_{i}^{commit})$ ; the implementation of commit wait will be introduced later, proving:

Proof

Serving Reads at a Timestamp#

The previously described monotonicity invariant allows Spanner to correctly determine whether a replica's state is sufficiently new to satisfy a read request. Each replica tracks a value called safe time $t_{safe}$ , which is the maximum timestamp to which this replica has been updated. If the requested timestamp $t \le t_{safe}$ , then the replica can satisfy a read request.

Define $t_{safe} = min(t_{Paxos}^{safe}, t_{TM}^{safe})$ , where each Paxos state machine has a safe time $t_{Paxos}^{safe}$ and each transaction manager has a safe time $t_{TM}^{safe}$ . The $t_{Paxos}^{safe}$ is straightforward: it is the maximum timestamp of Paxos write operations that have already been applied. Due to the monotonically increasing nature of timestamps and the sequential application of write operations, there are no write operations at or after $t_{Paxos}^{safe}$ for Paxos. The $t_{TM}^{safe}$ uses the state of the transaction manager of its replica's leader, which is derived from metadata during the Paxos write process.

If there are no "prepared but not committed" transactions on the replica (i.e., transactions that are in the middle of the two-phase commit), then $t_{TM}^{safe}$ on the replica is $\infty$ (for a participant slave server, $t_{TM}^{safe}$ actually uses the state of its replica leader's transaction manager, which can be inferred from metadata passed during Paxos write operations). If such transactions exist, their impact on the state is uncertain: the participant replica does not yet know whether this transaction will commit. As previously described, the commit protocol ensures that each participant knows the lower bound of a prepared transaction's timestamp. For a group g, each participant leader assigns a prepare timestamp $s_{i,g}^{prepare}$ to the prepare record of the transaction, and the coordinator leader ensures that for all participants in group g, the transaction's commit timestamp $s_{i} \ge s_{i,g}^{prepare}$ ; thus, for all transactions $T_{i}$ prepared on g, $t_{TM}^{safe} = min_{i}(s_{i,g}^{prepare}) - 1$ .

Assigning Timestamps to RO Transactions#

The execution of a read-only transaction is divided into two phases: assigning a timestamp $s_{read}$ and then executing the transaction read as a snapshot read at $s_{read}$ . Snapshot reads can be executed on any sufficiently new replica. $s_{read}$ can be assigned at any time after the transaction starts as $s_{read} = TT.now().latest$ , providing external consistency in a manner similar to the previously discussed writes. However, if $t_{safe}$ is not sufficiently new (i.e., smaller than $s_{read}$ ), then the data read operation must block at $s_{read}$ (in addition, choosing a custom $s_{read}$ value may raise $s_{max}$ to maintain disjointness). To reduce the chances of blocking, Spanner should assign the oldest timestamp to protect external consistency. Section 4.2.2 explains how to choose such a timestamp.

Details#

This section explains some practical details of read-write transactions and read-only transactions that were previously overlooked, as well as a special transaction implementation for achieving atomic schema changes. It then describes improvements to some of the basic schemes previously described.

Read-Write Transaction#

Similar to Bigtable, write operations in transactions are cached in the client until committed. Ultimately, read operations in the transaction will not see the effects of write operations in the transaction. This design works well in Spanner because a read operation returns the timestamp of any data it reads, while uncommitted writes have not yet been assigned a timestamp.

Read operations in read-write transactions use wound-wait to avoid deadlocks. The client initiates a read operation request to the corresponding group's leader, requesting a read lock and then reading the most recent data. If a client transaction remains open, it sends a keepalive message to prevent the participant leader from timing out its transaction. When a client completes all read operations and caches all write operations, it begins the two-phase commit. The client selects a coordinating group and sends commit messages to each participant leader, including the coordinator's identity and all write operations. Using client-driven two-phase commits can avoid transferring data across regions twice.

A non-coordinator participant leader first requests a write lock, then selects a prepare timestamp that must be greater than all timestamps assigned to the transaction it is allocating (to ensure monotonicity), and then logs a prepare record through Paxos. Each participant notifies the coordinator of its prepare timestamp.

The coordinator leader also first requests a write lock but skips the prepare phase. After receiving messages from all other participant leaders, it selects a timestamp for the entire transaction. The commit timestamp $s$ must be greater than or equal to all prepare timestamps (to satisfy consistency), greater than the coordinator's $TT.now().latest$ at the time it receives its commit message, and greater than all timestamps previously assigned to transactions by the leader (again, ensuring monotonicity). The coordinator leader then logs the commit (or aborts if it times out while waiting for other participants).

Before allowing any coordinator replica to apply the commit record, the coordinator leader waits until $TT.after(s)$ is true to comply with the commit-wait rule described in Section 4.1.2. Since the coordinator leader chooses the commit timestamp $s$ based on $TT.now().latest$ and is now waiting until this timestamp is in the past (note: it must wait until the current absolute time is definitely greater than $s$ ), the expected wait time is $2 * \bar{\varepsilon}$ (note: the average error time mentioned in Section 3). This waiting time typically overlaps with the communication time with Paxos. After the commit wait, the coordinator sends the commit timestamp to the client and all other participant leaders, and each participant leader logs the result of the transaction through Paxos. All participants apply the same timestamp and then release their locks.

Read-Only Transactions#

Assigning a timestamp requires a communication phase (negotiation phase) among all Paxos groups involved in the read operations. Ultimately, Spanner requires each read-only transaction to have a scope expression that summarizes the set of keys the entire transaction will read. Spanner will automatically infer this scope for a single request. If the value of the scope can be provided by a single Paxos group, the client can initiate a read-only transaction request to the leader of that group (the current Spanner implementation only assigns a timestamp for read-only transactions at the Paxos leader). This leader assigns the $s_{read}$ timestamp and then executes the read operation. For single-point reads, Spanner has a better method for assigning timestamps than directly calling $TT.now().latest$ . Define $LastTS()$ as the timestamp of the last write commit on the Paxos group. If there are no prepare transactions, $s_{read}$ can be directly assigned as $LastTS()$ , easily satisfying external consistency: the transaction will see the results of the last write and thus will be positioned after it.

If the value of the scope requires multiple Paxos groups, there are several options. The most complex option is to communicate with all group leaders in a round and negotiate to choose $s_{read}$ based on $LastTS()$ . Spanner currently implements a simple way to avoid the client having to go through a whole round of communication, simply setting its read execution time to $s_{read} = TT.now().latest$ (which may need to wait for the safe time to catch up). All reads in the transaction can be sent to sufficiently new replicas.

Schema-Change Transactions#

TrueTime enables Spanner to support atomic schema modifications. Using a standard transaction is not feasible because the number of participants (the number of groups in the database) can be in the millions. Bigtable supports atomic schema modifications within a single data center, but its schema modifications block all operations. In contrast, Spanner's schema-change transactions are typically a variant of non-blocking standard transactions.

First, it explicitly assigns a future timestamp, registered during the prepare phase, allowing schema modifications spanning thousands of servers to be completed with minimal disruption to other concurrent activities. Second, read and write operations explicitly depend on the schema.

Additionally, any read and write operations with implicit schema dependencies must synchronize with any registered schema-change at time $t$ : if their timestamps are before $t$ , they may execute; otherwise, they must block until after the schema-change transaction (note: since this transaction's timestamp is greater than $t$ , it must wait for the schema-change transaction to complete to ensure commit waiting). Without TrueTime (such a stable and globally consistent time API), defining schema changes occurring at time $t$ is meaningless.

The previously defined $t_{safe}^{TM}$ has a weakness in that a prepared transaction can prevent $t_{safe}$ from advancing. Ultimately, no read operations can be executed at subsequent timestamps, even if the read operations have no conflicts with the transaction. This false conflict can be eliminated by adding a fine-grained mapping from key ranges to prepare transaction timestamps. This information can be stored in the lock table, which already maps key ranges to lock metadata. When a read operation arrives, it only needs to check the fine-grained safe time corresponding to its conflicting key range.

The previously defined $LastTS()$ has a similar weakness: if a transaction has just committed, a non-conflicting read-only transaction must assign $s_{read}$ to this transaction's timestamp. Ultimately, the read operation may be delayed (note: this transaction has just committed and may not yet have synchronized across all participants, so reading from a replica may require waiting). This weakness can also be remedied by adding a fine-grained mapping from key ranges to commit timestamps for $LastTS()$ in the lock table (we have not yet implemented this optimization). When a read-only transaction arrives, its timestamp can be assigned as the maximum value of $LastTS()$ with conflicting key ranges, unless there is currently a conflicting prepare transaction (which can be determined through fine-grained safe time).

The previously defined $t_{safe}^{Paxos}$ has a weakness in that it cannot advance without Paxos write operations (note: $t_{safe}^{Paxos}$ uses the timestamp of the last write operation). That is, a snapshot read at time $t$ cannot be executed on a Paxos group where the last write operation occurred before $t$ . Spanner leverages the disjointness of leader-lease intervals to address this issue. Each Paxos leader advances $t_{safe}^{Paxos}$ by maintaining a threshold and ensuring that future write operation timestamps are always higher than this threshold: it maintains a mapping from Paxos sequence number n to the minimum timestamp that may be assigned to future Paxos sequence number n+1, $MinNextTS(n)$ . A replica can advance its $t_{safe}^{Paxos}$ to $MinNextTS(n)−1$ after it has applied the nth sequence number.

A single leader can easily execute $MinNextTS()$ commitments because the timestamps promised by $MinNextTS()$ are within the leader's lease, and the disjointness invariant (note: the lease times of each leader do not overlap) allows for $MinNextTS()$ commitments to be executed among leaders. If a leader wants to advance $MinNextTS()$ beyond its leader-lease, it must renew its lease. Note that $s_{max}$ is always greater than the maximum value in $MinNextTS()$ to ensure disjointness.

By default, leaders advance the $MinNextTS()$ value every 8 seconds, so in the absence of prepared transactions, a healthy slave in an idle Paxos group can provide read services for timestamps greater than the oldest timestamp in the worst case by 8 seconds. Leaders can also elevate the $MinNextTS()$ value based on the needs of slaves.

Evaluation#

Omitted...

(Note: This section mostly compares with other databases; citation links can be found in the original paper.)

Megastore and DynamoDB have already provided storage services with cross-data-center consistency guarantees. DynamoDB offers a key-value interface and replicates only within a single region. Spanner follows the Megastore approach to provide a semi-relational data model and a similar schema language. Megastore did not achieve high performance. It is built on top of Bigtable, which incurs high communication overhead. It also does not support long-lived leaders: multiple replicas may initiate write operations. In the Paxos protocol, write operations on different replicas can conflict, even if they do not have logical conflicts: multiple writes per second can cause the throughput of the Paxos group to collapse. Spanner provides high-performance general transactions and external consistency.

The idea of adding transactions on top of replicated storage can be traced back to Gifford's doctoral thesis. Scatter is a recent DHT-based key-value store that adds a transaction layer on top of consistent replication. Spanner focuses on providing a higher-level interface than Scatter. Gray and Lamport described a non-blocking commit protocol based on Paxos. Their protocol results in higher messaging costs than two-phase commits, increasing the cost of committing large-scale distributed groups. Walter provides a snapshot isolation variant suitable for within data centers, but not across data centers. In contrast, our read-only transactions provide more natural semantics because we provide external consistency for all operations.

Recently, there has been a lot of work aimed at reducing or eliminating lock overhead. Calvin eliminates concurrency control: it pre-allocates timestamps and executes all transactions in timestamp order. HStore and Granola each support their own types of classification, some of which can avoid locks. None of these systems support external consistency. Spanner addresses contention issues by providing snapshot isolation.

VoltDB is an in-memory sharded database that supports asynchronous master-slave replication for disaster recovery but does not support more general replication configurations. This is an example of NewSQL and a market-driven effort to support scalable SQL.

Many databases have implemented the functionality of reading in the past, such as MarkLogic and Oracle's Total Recall; Lomet and Li describe a strategy for implementing such temporal databases.

Farsite derives clock uncertainty relative to trusted clock sources (looser than TrueTime): the way server leases are maintained in Farsite is similar to how Paxos maintains leases in Spanner. Previous work has used relaxed synchronized clocks for concurrency control, and we have demonstrated that TrueTime allows reasoning about global time across Paxos state machine clusters.

Future Work#

Last year, we spent most of our time working with the F1 team to migrate Google's advertising backend from MySQL to Spanner. We are actively improving monitoring and support tools, as well as optimizing its performance. Additionally, we have improved the functionality and performance of backup and recovery systems. We are currently implementing the Spanner schema language, automating the maintenance of auxiliary indexes, and automating re-sharding based on load. In the long term, we plan to explore many new features. Optimistic parallel reads may be a valuable strategy, but preliminary experiments suggest that a correct implementation may not be simple. Furthermore, we plan to eventually support direct modifications to Paxos configurations.

Considering that we expect many applications to replicate their data in very close data centers, TrueTime $\varepsilon$ may significantly impact performance. We have not seen any insurmountable obstacles to reducing $\varepsilon$ to 1ms. The time-master-query interval can be reduced, and better quartz clocks are relatively inexpensive. Time-master-query delays can be improved by enhancing network topology, and may even be avoided through alternative time-distribution techniques.

Ultimately, there are many obvious areas for improvement. Although Spanner can scale in terms of the number of nodes, node-local data structures perform relatively poorly for complex SQL queries because they are designed for simple key-value access. Algorithms and data structures from the database literature can greatly improve single-node performance. Secondly, automatically moving data between data centers based on client load has always been one of our goals, but to efficiently achieve this, we still need the capability to automate and coordinate the migration of client application processes between data centers. Migrating processes brings a more severe problem of managing resource requests and allocations across data centers.

Conclusions#

In summary, Spanner combines and extends research from two communities: it gains a familiar, user-friendly semi-relational interface, transactions, and SQL-based query language from the database community; and scalability, automatic sharding, fault tolerance, consistent replication, external consistency, and wide-area distribution from the systems community.

Since the launch of Spanner, we have spent over five years iterating to the current design and implementation. Part of the reason for such a long iteration is that we gradually realized that Spanner should not just solve the problem of a globally replicated namespace, but should also focus on the database features missing in Bigtable.

Our design indicates that the key factor in implementing various features of Spanner is TrueTime. We have demonstrated that materializing clock uncertainty in a time API allows for building distributed systems with more robust temporal semantics. Furthermore, as the underlying systems impose stronger constraints on clock uncertainty, the overhead of implementing strong semantics will also decrease. As a community, we should no longer rely on loosely synchronized clocks and weak time APIs when designing distributed algorithms.