The Chubby lock service

Continuing to translate the paper, Chubby is one of the important components relied upon by many distributed systems within Google. See the original text here

Design#

Rationale#

First, the differences between the client Paxos library and the centralized lock service were compared.

The client Paxos library does not rely on other services and can provide a standard programming framework, but it does not fully consider high availability during the early prototype development phase and is not suitable for consensus protocols. Using locks allows for easier maintenance of existing program structures and communication patterns, without needing to maintain a certain number of clients to ensure consensus, thus reducing the number of servers required to implement a reliable client system. For these reasons, the author chose the lock service and allowed the storage of small files to implement functions such as electing a master node and advertising propagation. Additionally, the author introduced some decisions regarding expected usage and environmental aspects, including support for a large number of clients observing files, event notification mechanisms, file caching, and security mechanisms.

System structure#

Chubby mainly consists of two components that communicate via RPC: the server and the client library. All RPC communication between the client and server is coordinated through the client library.

A Chubby cell is a collection of multiple (usually five) servers (replicas) placed in different locations (such as different racks) to reduce the likelihood of correlated failures. The replicas use a distributed consensus algorithm to elect a master, which must obtain votes from the majority of replicas and a commitment that no other master will be elected within a certain time (master lease). As long as the master continues to win the majority of votes from the replicas, the lease will be renewed periodically.

Each replica maintains a copy of a simple database, but only the master can perform read and write operations on the database, while other replicas copy updates from the master through the consensus protocol. Clients find the master by sending a master location request to the replicas listed in DNS. Non-master nodes receiving the request will return the identifier of the master. Once the client locates the master, all requests will be sent to it until the node stops responding or identifies itself as no longer the master.

Write requests are broadcast to all replicas via the consensus protocol, and can only be confirmed if the request reaches the majority of replicas in the call. Read requests are handled solely by the master node, which is safe as long as the master lease has not expired, because it is impossible for another master to exist simultaneously.

If a master fails, other replicas will execute the election protocol when their master lease expires, and a new master will typically be elected within a few seconds.

If a replica fails and does not recover within a few hours, a simple replacement system will select a new machine from the idle pool to start the lock server process and update the corresponding replica's IP in the DNS table. The current master node will periodically poll DNS and eventually discover this change, then update the cell member list in the database. This list will be kept consistent among all members through a standard replication protocol. Meanwhile, the new replica will obtain the latest database copy from backups in the file server and updates from other active replicas. Once the new replica processes a request waiting for submission from the current master, it can participate in the voting for the new master election.

Files, directories, and handles#

Chubby provides a file system-like interface, but it is simpler than the UNIX interface. The entire system consists of a tree structure of names separated by slashes, such as /ls/foo/wombat/pouch, where /ls is the common prefix for all names, indicating the lock service. The second part, foo, is the name of the Chubby cell, resolved through DNS queries into one or more Chubby servers. A special cell name local indicates that the client should use the local cell, which is usually in the same building as the client. The remaining part, /wombat/pouch, is defined by the client.

Using this file system-like structure eliminates the need to implement basic browsing and namespace operation tools, reducing the cost of educating users.

Chubby is designed differently from traditional UNIX file systems to facilitate distribution. To allow files in different directories (different cells) to serve different masters, Chubby does not support moving files from one directory to another, does not maintain directory modification times, and avoids path-dependent permissions (i.e., the permissions of each file are controlled by the file itself, independent of the directory it resides in).

To simplify file metadata caching, the system does not display the last access time. The namespace only contains files and directories, collectively referred to as nodes. Each node has a unique name within its cell and does not support symbolic or hard links.

Nodes can be permanent or temporary, and all nodes can be explicitly deleted. Temporary nodes will be deleted when no client has them open (or the directory is empty), and temporary nodes can be used as temporary files to indicate to other nodes that a client is active.

All nodes can be used as read/write advisory locks. Each node has various metadata, including three ACLs (access control lists) names for: read control, write control, and changing node ACL names.

In non-overriding cases, a node inherits its parent directory's ACL name upon creation. The ACL itself is also a file under the ACL directory, part of the cell's local namespace. These ACL files contain a simple list of responsible names. Therefore, if a file F's write ACL name is foo and the ACL directory contains a file named foo, which includes the entry bar, then user bar is allowed to write to file foo, with user verification done through the RPC system's built-in mechanism.

Since Chubby's ACLs are simple files, they are automatically available for other services that want to use similar permission control mechanisms.

The metadata for each node contains four monotonically increasing 64-bit numbers that allow clients to easily check for changes:

Instance number: greater than the instance number of the same name (deleting, creating the same node)
Content generation number (only for files): incremented when writing file content
Lock generation number: incremented when the node lock transitions from idle to held
ACL generation number: incremented when the node's ACL name is written

Chubby also provides a 64-bit file content checksum, allowing clients to know if the file has changed. Clients open files to obtain handles similar to UNIX file descriptors. Handles contain:

A checksum to prevent clients from creating or guessing handles, thus requiring a complete access control check only at handle creation (similar to UNIX checking permission bits only when opening files).
A sequence number that allows the master to determine whether this handle was generated by itself or a previous master node.
Mode information provided when opening the file, allowing a newly started master to reconstruct its state in the event of an old handle.

Locks and sequencers#

Each Chubby file and directory behaves like a read-write lock: a client handle holds the lock in exclusive mode, or multiple client handles hold the lock in shared mode.

Here, the lock is an advisory lock: it only conflicts with other operations attempting to acquire the same lock. Holding lock F is not necessary to access file F and does not block other clients from accessing this file. In contrast, mandatory locks prevent all other clients that do not hold this lock from accessing the locked object:

The resources protected by Chubby locks are implemented by other services, not just lock-related files. If mandatory locks need to be enforced, those services must be modified.
We do not want to force users to shut down applications when they need to access locked files for debugging or management purposes (to remove the lock).
Performing error checks is simpler; one can simply assert that lock X is held.
Bugs or malicious actions have many opportunities to corrupt data without holding a lock, so we find that the additional guarantees provided by mandatory locks are not very valuable.

In Chubby, acquiring any type of lock requires write permission, so unauthorized readers cannot block writers' operations.

Locks in distributed systems are complex because communication is generally uncertain, and processes may fail independently. Thus, a process holding lock L may issue an R request and then fail, while another process may acquire lock L and perform some operations before the R request reaches its destination. If R arrives later, it may operate on inconsistent data without the protection of lock L. The problem of out-of-order message reception has been extensively studied, with solutions including virtual time and virtual synchrony, which avoid this problem by ensuring messages are processed in a consistent order as observed by all participants.

Introducing sequence numbers to all interactions in existing complex systems is costly, while Chubby provides a method to introduce sequence numbers only in interactions that use locks. At any time, a lock holder requests a sequencer (sequence controller) to generate a non-transparent byte string that describes the lock's immediate state after acquisition: it includes the lock's name, lock mode (exclusive or shared), and lock generation number.

If a client wants to perform some protected operations on the server, it will pass the sequencer to the server. The server will check whether the sequencer is still valid and in the appropriate mode; otherwise, it will reject the request. A sequencer can be validated through the server's Chubby cache. If the server does not want to maintain a session with Chubby, it can use the last observed sequencer for validation. The sequencer mechanism only requires appending a string to messages, which is also easy to explain to developers.

Although sequencers are easy to use, they progress slowly for some important protocols. Therefore, Chubby also provides a less perfect but simpler mechanism for services that do not support sequencers to reduce the risk of latency or out-of-order requests.

If a client normally releases a lock, other clients can immediately acquire that lock. However, if a lock is released due to the holder crashing or being inaccessible, the lock server will prevent other clients from acquiring that lock for a period known as lock-delay. Clients can specify the delay time, but the maximum is one minute, to avoid a crashed client causing the lock (and its associated resources) to be unavailable for an arbitrary length of time. Although not perfect, lock-delay protects unmodified servers and clients from the everyday problems caused by message delays and restarts.

Events#

Chubby clients can subscribe to a series of events when creating handles, which are asynchronously dispatched to clients through higher-level calls in the Chubby library. Events include:

File content modification: often used to monitor service location (service discovery) broadcasted through files.
Addition, removal, or modification of subnodes: used for mirroring.
Chubby master failover: warns clients that other events may be lost, so they need to rescan data.
A handle and its associated lock expiration: generally indicates communication issues.
Lock acquisition: used to determine whether a primary has been elected.
Conflicting lock requests from other clients: allows lock caching.

Events are dispatched after the corresponding actions occur, so if a client is notified of file content changes, it can ensure that the data read afterward is new. The last two events mentioned are rarely used and could be omitted in hindsight. For example, after an election, clients typically need to communicate with the new primary, rather than just knowing that a primary exists, so they wait for a modification event indicating that the new primary has written its address to a file.

Conflicting lock events theoretically allow clients to cache data held by other servers, using Chubby locks to maintain cache consistency. A lock conflict request notifies the client that it should complete pending operations, flush modifications, discard cached data, and then release the lock (lock conflict means another client wants to modify the data). However, no one has adopted this approach so far.

API#

Clients treat Chubby handles as non-transparent pointers that support various operations. Handles can only be created through Open and destroyed through Close.

Open opens a file or directory with a specified name, producing a handle similar to a UNIX file descriptor. Only Open uses the name; all other operations are based on the handle. The name used here is relative to an already existing directory handle (relative path), and the library provides a handle / that is always valid, representing the root directory. Directory handles avoid issues that arise from globally using the current directory in multi-threaded programs with multiple layers of abstraction.

Clients are provided with various options:

How the handle will be used (read/write/lock/change ACL) - the client must have appropriate permissions for the handle to be created successfully.
Events that should be dispatched.
Whether a file or directory should (must) be created immediately. If a file is created, the caller can provide initial content and initial ACL names, and the return value will indicate whether the file was created.

Close() closes an open handle, after which the handle cannot be used, and this call will never fail. A related call, Poison(), causes the handle to fail all pending and subsequent operations without needing to close the handle. This allows a client to cancel Chubby calls executed by other threads without worrying about releasing the memory accessed by those calls.

The main calls executed on a handle are:

GetContentsAndStat() returns the file's content and metadata, with the file content being atomically read in full. We avoid partial reads and writes to prevent issues with large files. A related call, GetStat(), only returns the file's metadata, while ReadDir() returns the names and metadata of subnodes in the directory.
SetContents() writes content to a file, allowing the client to provide a file generation number to simulate a file's compare-and-swap; the content will only be modified if the generation number matches the current one. Write operations are also atomically performed as a whole. Similarly, there is a SetACL() call that performs similar operations on the node's associated ACL name.
Delete() calls to delete the node when it has no subnodes.
Acquire(), TryAcquire(), Release() to acquire and release locks.
GetSequencer() returns a sequencer describing all locks held by the current handle.
SetSequencer() associates the handle with a sequencer; if the sequencer has expired, all related sequencer operations will fail.
CheckSequencer() checks whether the sequencer is valid.

If the node is deleted after the handle is created, the call will fail, even if the file is recreated later. This is because the handle is always associated with a file instance rather than a name. Although Chubby can perform access control checks on all calls, it only checks the Open call in practice.

All calls, except for the parameters they use, will include an operation parameter that saves any data and control information related to any call. Specifically, through the operation parameter, clients can:

Provide a callback function to make the call asynchronous.
Wait for a call to finish.
Obtain extended error and diagnostic information.

Clients can use these APIs to perform primary elections: all potential primaries open the lock file and then attempt to acquire the lock, with one succeeding and becoming the primary while the others become replicas. The primary then uses the SetContents() call to write its identifier into the lock file, allowing other clients and replicas to discover it through the GetContentsAndStat() call (possibly in response to a file modification event). Ideally, the primary obtains a sequencer through GetSequencer() to pass it to the server during subsequent communications, and they will execute CheckSequencer() to confirm it is still the primary. For services that cannot check the sequencer, the previously mentioned delayed lock can be used.

Caching#

To reduce read traffic, Chubby clients cache file data and node metadata in memory. The cache is maintained by a lease mechanism, kept consistent through invalidation requests sent by the master. This protocol ensures that clients either see a consistent Chubby state or encounter an error.

When file data or metadata changes, modification operations will block until the master sends invalidation requests to all clients that may have cached data. This mechanism is built on top of KeepAlive RPCs. Upon receiving an invalidation request, clients clear invalid states and confirm during the next KeepAlive call. The modification process continues until the server knows that each client has invalidated its cache: either because the client confirmed the invalidation message or because the client's lease allowing its cache to remain valid has expired.

Only one round of invalidation is needed because when cache invalidation cannot be confirmed, the master will consider the node to be non-cacheable. This allows read requests to not be delayed, which is useful when read operations far exceed write operations. An alternative would be to block all access to the corresponding node during invalidation calls, which would reduce some hotspot clients' bombardment of non-cached accesses to the master during invalidation, at the cost of occasional delays. If this is a concern, a hybrid approach could be imagined, switching strategies once overload is detected.

The caching protocol is simple: invalidate directly when changes occur, never update the cache. Updates may be simpler than invalidation, but a protocol that only updates may be very inefficient, as a client accessing a file may receive an infinite number of updates, leading to an infinite number of unnecessary updates (after invalidation, it can simply fetch the latest).

Although providing strict consistency is costly, we also reject using weak consistency models because they are difficult for programmers to use. Similarly, for example, virtual synchrony mechanisms require clients to exchange sequence numbers in all messages in an environment with pre-existing diverse communication protocols, which is not appropriate.

In addition to caching data and metadata, Chubby clients also cache open handles, so if a client opens a file that has already been opened, only the first Open request needs to be sent to the master. This caching is subject to some minor restrictions to ensure that it does not affect the semantics observed by clients: handles on temporary files cannot remain open after the application closes them, and handles that allow locking can be reused but cannot be used simultaneously by multiple handles. The latter restriction exists because clients may use Close() or Poison() to cancel pending Acquire() calls to the master and cause side effects.

The Chubby protocol allows clients to cache locks, meaning locks can be held longer than necessary and reused on the same client. When other clients request conflicting locks, an event is notified to the lock holder, allowing the holder to release the lock when needed elsewhere.

Sessions and KeepAlives#

A Chubby session refers to the relationship between a Chubby cell and a client, maintained over a period through periodic handshakes known as KeepAlives. Unless the client actively notifies the master, the client's handle locks and cached data remain valid during the session (session maintenance messages may require the client to confirm cache invalidation to ensure session validity).

When a client first contacts a Chubby cell, it requests a new session, while session termination is implicit, possibly due to the client terminating it or the session being idle (suspended) (no handles opened and no calls made within a minute).

Each session has an associated lease (a renewal interval), ensuring that the master will not unilaterally terminate the session for a future period. The end of this interval is referred to as session lease timeout. The master can freely extend the timeout to the future but cannot move it to the past.

The master will postpone the lease timeout in the following three scenarios: when the session is created, when the master fails over, and when responding to the client's KeepAlive RPC. Upon receiving a KeepAlive request, the master generally blocks the RPC until the client's previous lease interval is close to expiration, then allows the RPC to return to the client and informs the client of the new lease timeout. The master may extend the timeout for any duration, with the default extension being 12 seconds, but a heavily loaded master may use a longer duration to reduce the number of KeepAlive calls it needs to handle. The client will immediately initiate a new KeepAlive upon receiving the previous response, ensuring that there is almost always a KeepAlive blocking at the master.

In addition to extending leases, KeepAlive responses can also be used to transmit events and invalidate caches. The master allows early returns of KeepAlives during event dispatching or invalidation, ensuring that clients cannot maintain sessions before confirming cache invalidation, and that all Chubby RPCs flow from clients to the master. This simplifies the client and allows the protocol to run on firewalls that only allow unidirectional connection initialization.

Clients maintain a relatively conservative lease timeout relative to the master because they need to consider the time spent returning response messages and the possibility of the master's clock being ahead. To maintain consistency, we need to ensure that the server's clock does not exceed the client's clock by a known constant factor.

If the client's local lease times out, it cannot determine whether the master has terminated the session. The client will disable and clear the cache, entering an uncertain state. The client waits for a grace period (default 45 seconds), and if it successfully exchanges KeepAlive before this interval ends, it will re-enable the cache; otherwise, it assumes the session has expired. This is done to prevent Chubby API calls from blocking indefinitely when the Chubby cell is inaccessible. If communication has not been reestablished before the grace period ends, the call will return an error.

When the grace period begins, the Chubby library can notify the application of a danger event, and when session communication issues are resolved, it will send a safe event; conversely, if the session times out, it will send a timeout event. This information allows the application to remain quiet when uncertain about its session state, and to recover without needing to restart when encountering temporary issues. Preventing interruptions is crucial for some services with high startup overhead.

If a client holds a handle H for a node and all operations on H fail due to the associated session expiring, all subsequent requests (except Close) will fail in the same manner. This allows clients to ensure that network and service interruptions will only cause the loss of a suffix of the operation sequence, rather than any arbitrary subsequence, thus allowing the use of eventual writes to mark complex changes as committed (as long as the eventual write is successful, all preceding operations must have succeeded).

Fail-overs#

When a master fails or loses control, it will discard all memory states, including: session handles and locks. The authoritative lease timer for sessions runs on the master, so the lease timer is stopped until a new master is elected, effectively extending the client's lease. If the master election is quick, the client can connect to the new master before the local (loose) lease timer expires. However, if the lease takes a long time, the client will clear its local cache and wait for the new master during the grace period, thus allowing the session to be maintained during failover processes that exceed normal lease timeouts.

Failover

The above image shows a series of events during a prolonged master failover event, where the client must use its grace period to protect its session. The initial master has a client lease M1, and then the client has a relatively conservative estimate C1. The master then submits lease M2, and the client extends the lease to C2. After that, the master crashes before responding to the next KeepAlive. Some time passes before a master is elected, and eventually, the client's lease C2 expires, prompting the client to clear its cache and start a new grace period timer.

During this period, the client cannot confirm whether its lease has expired on the master. The client does not immediately destroy the session but blocks all application-level API calls to prevent the application layer from observing inconsistent data. At the start of the grace period, the Chubby library sends a danger event to the application layer, prompting it to listen for sent requests until it can confirm the session state.

Eventually, a new master is elected, initializing a conservative estimate of lease M3 for any leases that may have been held previously. The first KeepAlive request sent from the client to the new master will be rejected due to an incorrect master epoch number. The retry request (6) can succeed but generally will not extend the master lease because M3 itself is a conservative estimate (extended). However, the response (7) allows the client to extend its lease to C3 and notify the application layer that the session is no longer in danger. Since the grace period is long enough to cover the interval from the end of C2 to the start of C3, the client only perceives a delay. If the grace period were shorter than this interval, the client would drop the session and report an error to the application layer.

Once the client connects to the new master, the client library and master cooperate to provide the application layer with the illusion that no failure ever occurred. To achieve this, the new master must reconstruct a conservative estimate of the memory state relative to the previous master, partly by reading stable storage data on disk (through standard database replication protocols) and partly by obtaining state from the client, along with some conservative estimates. The database records each session, held locks, and temporary files.

The newly elected master performs the following steps:

The master selects a new epoch number, which clients must pass in each call. The master will reject requests with epoch numbers lower than this and provide the current latest number, ensuring the new master does not respond to packets sent to the previous master, even if running on the same machine.
The master may respond to master-location requests but will not initially process incoming session-related requests.
The master reconstructs the data structures for sessions and locks recorded in the database in memory, extending session leases to the maximum duration that the previous master may have used.
The master begins allowing clients to perform KeepAlives but does not execute other session-related operations.
The master sends a fail-over event to all sessions, prompting clients to clear their caches (as they may have missed some invalidation notifications) and warning the application layer that some events may have been lost.
The master waits for all sessions to acknowledge the fail-over event or for them to expire.
The master allows processing of all operations.
If a client uses a handle created before the failover (determined by the sequence number in the handle), the master will recreate the handle's representation in memory and respond to the call. If this newly created handle is closed, the master will record it in memory, ensuring it is not recreated within the current epoch, thus preventing a delayed or duplicate network packet from accidentally recreating an already closed handle. A failed client can recreate a closed handle in a later epoch, which is harmless.
After a period, the master deletes temporary files for handles that have not been opened. Clients should also refresh handles on temporary files after a period following the failover. The unfortunate effect of this mechanism is that if the last client using that file loses its session during the failover, the temporary file may not disappear in a timely manner.

Database implementation#

The first version of Chubby used a Berkeley DB with replication as its database. Berkeley DB provides B-trees that map bytestring keys to arbitrary bytestring values. We used a key comparison function that first compares the number of levels in the path, dividing all nodes by path name into keys while ensuring that sibling nodes remain adjacent in the sorting. Since Chubby does not use path-based permissions, each file access only needs to look up once in the database.

Berkeley DB uses a distributed consensus protocol to replicate the database logs across multiple servers. Once the master lease is added, this aligns with Chubby's design, making implementation straightforward.

The B-tree code of Berkeley DB has been widely used and is very mature, but the replication code was recently added and has few users. Software maintainers must prioritize maintaining and improving their most streamlined product features. The maintainers of Berkeley DB solved our issues, but we felt that using the replication code would expose us to more risks than we were willing to bear. Ultimately, we wrote a simple database using WAL and snapshot techniques. As before, the database logs are distributed among all replicas through a distributed consensus protocol. Chubby uses only a small portion of Berkeley DB's features, so this rewrite allowed us to greatly simplify the entire system, such as when we need atomic operations without requiring transactions.

Backup#

Every few hours, the master of each Chubby cell writes its database snapshot to GFS servers in different buildings. Using separate buildings ensures that backups can survive building damage, and replicas do not introduce circular dependencies within the system, as GFS cells within the same building may depend on their elected Chubby cells.

Backups provide disaster recovery and a method for initializing new alternative replicas' databases without adding load to running services.

Mirroring#

Chubby allows a set of files to be mirrored from one cell to another. The mirroring operation is fast because the files are small, and the event mechanism can immediately notify the mirroring code when files are added, deleted, or modified. Assuming no network issues, changes will reflect in dozens of mirrors within less than a second. If a mirror becomes unreachable, it will remain unchanged until the connection is restored, with file updates identified by comparing their checksums.

Mirroring is most often used to copy configuration files to different computing clusters distributed around the world. A special cell named global contains a subtree /ls/global/master that is mirrored to the subtree /ls/cell/slave on all other cells. The global cell is special because its five replicas are distributed across the world in far-apart locations, making it accessible in most places.

The mirrored files in the global cell contain Chubby's own access control lists, as well as files that inform monitoring services of the existence of Chubby cells and other services, allowing clients to locate pointers to large datasets like Bigtable cells and configuration files for many other systems.

Mechanisms for scaling#

Chubby's clients are independent processes, so Chubby must handle more clients than expected. We have seen over 90,000 clients directly connected to a single Chubby master, involving even more machines than that. Since there is only one master in each cell and its machines and clients are the same, client data far exceeds its processing capacity. Therefore, the most effective scaling technique is to significantly reduce communication with the master. Assuming the master has no serious performance bugs, small improvements in request processing on the master have little impact. We have employed several methods:

We can create any number of Chubby cells, but clients almost always use the nearest cell to avoid relying on remote machines. Our typical deployment is to use one Chubby cell in a data center with thousands of machines.
The master can increase its lease time from 12 seconds to a maximum of 60 seconds when under high load, allowing it to handle fewer KeepAlive RPCs (KeepAlive is the primary request type, and failing to process requests in a timely manner is a typical failure mode for overloaded servers; clients are less sensitive to delays in other calls).
Chubby clients cache file data, metadata, missing files, and open handles to reduce calls made to the server.
Use protocol conversion servers to convert the Chubby protocol into simpler protocols like DNS, etc.

Here, we describe two familiar mechanisms: proxies and partitioning, which we expect to allow Chubby to scale further. We have not yet used them in production, but they have been designed and may be used soon. We currently do not need to consider scaling beyond five times: first, the number of machines we place in a data center or rely on a single service instance has limits; second, since we use similar machines for both Chubby clients and servers, hardware optimizations that increase the number of clients on each machine will also increase the capacity of each server.

Proxies#

Chubby's protocol can be proxied by trusted processes, which can reduce server load by handling KeepAlive and read requests, but cannot reduce write traffic through proxy caching. However, even with aggressive client caching strategies, write flows account for less than one percent of Chubby's normal workload. Thus, using proxies can significantly increase the number of clients.

If a proxy handles N client KeepAlive traffic, it reduces the server load by N, which could be 10k or more. The maximum reduction in read traffic from proxy caching is about the average of shared reads, roughly a factor of 10. However, since reads account for less than 10% of Chubby's load, the effect of saving KeepAlive traffic is more important. Additionally, proxies add extra RPC calls for writes and first reads. One might expect that proxies would double the temporary unavailability of cells, as each proxy client relies on two potentially failing machines: the proxy server and the Chubby master server. Observant readers may note that the failover strategy described earlier is not ideal for proxy servers.

Partitioning#

Chubby's interface is designed so that a cell's namespace can be partitioned across different servers. Although we do not need this now, the code can partition the namespace through directories. If enabled, a Chubby cell can consist of N partitions, each with a set of replicas and a master. Nodes D/C in each directory D will be stored in partition P(D/C) = hash(D) mod N, noting that the metadata for D may be stored in different partitions P(D) = hash(D0) mod N, where D0 is the parent directory of D.

Partitioning aims to create large Chubby cells with minimal communication between partitions. Although Chubby lacks hard links, directory modification times, and cross-directory renaming operations, some operations still require cross-partition communication:

ACLs themselves are files, so one partition may need to use another partition for permission checks. However, ACL files are easy to cache, and only Open() and Delete() calls require ACL checks, with most clients reading publicly accessible files that do not require ACLs.
When a directory is deleted, cross-partition calls may need to confirm that the directory is empty.

Since most calls processed by each partition do not depend on other partitions, we expect the impact of inter-partition communication on performance and availability to be limited. Although the number of partitions N can be large, it is expected that each client will contact most partitions, thus reducing read and write traffic on the partition by a factor of N, but not necessarily the KeepAlive traffic.

If it becomes necessary for Chubby to handle more clients, our strategy includes a combination of proxies and partitioning.

Use, surprises, and design errors#

...Omitted relevant data and usage

Use as a name service#

Although Chubby was designed as a lock service, we found that its most common use is as a name server. The common internet name system uses time-based caching DNS entries with a time-to-live (TTL). If not refreshed within this period, the DNS data will be discarded. Typically, choosing an appropriate TTL value is intuitive, but if rapid replacement of failed services is desired, the TTL may need to be short enough to overwhelm the DNS server.

For example, it is common for our developers to run jobs involving thousands of processes, each needing to communicate with others, leading to quadratic levels of DNS queries. We might want to use a 60-second TTL, which would allow misbehaving clients to be replaced without excessive delays, and in our environment, this is not considered too short a replacement time.

In this case, to maintain a job with nearly 3000 clients, 150k DNS cache queries per second are needed, and larger jobs would pose even more severe problems, with many jobs potentially running simultaneously. Before introducing Chubby, the variability of our DNS load had been a serious issue for Google.

In contrast, Chubby's caching uses invalidation, so in the absence of changes, a constant rate of KeepAlive requests can indefinitely maintain any number of cached entries on the client. A Chubby master with 2 cores at 2.6GHz can handle 90k clients directly connected, including clients from the large jobs mentioned above. The ability to provide rapid name updates without polling each name individually is so appealing that Chubby now provides name service for many systems within the company.

Although Chubby's caching allows a cell to support a large number of clients, peak load remains a concern. When we first deployed a Chubby-based name service with 3k processes (resulting in 9m requests), it could overwhelm the master. To address this, we chose to batch a set of name queries, allowing a single query to return and cache a large number (typically 100) of name mappings for processes in the job.

The caching semantics provided by Chubby are more precise than those of a name service, as name resolution only requires timely notification rather than complete consistency. By introducing a simple protocol conversion server specifically for name queries, we have the opportunity to alleviate the load on Chubby. If we had anticipated Chubby's use as a name service, we might have implemented full proxying sooner to avoid the need for this simple requirement.

There is also a further protocol conversion server: the Chubby DNS server stores name data on Chubby that is available to DNS clients. This service is important for simplifying the transition from DNS names to Chubby names and adapting existing applications (like browsers) that cannot easily convert.

Lessons learned#

Developers rarely consider availability. We found that our developers seldom consider the possibility of failures and tend to view services like Chubby as always available. For example, our developers built a system that used hundreds of machines, and when Chubby elected a master, the program would start a recovery process that took tens of minutes. This not only magnified a single failure in time by a hundred times but also affected hundreds of machines. We would prefer developers to plan for brief interruptions in Chubby, so such events have minimal impact on their applications.

Developers also failed to understand the difference between service startup and service availability to their applications. For instance, the global cell is essentially always running, as it is rare for two or more data centers located far apart to fail simultaneously. However, the availability observed by clients is always lower than that of the local cell. First, the probability of clients being separated from the local cell is lower, and although the local cell may frequently go down for maintenance, the same maintenance will directly affect clients, so Chubby's unavailability will not be observed by clients.

Our API choices also influence how developers handle Chubby interruptions. For example, Chubby provides events that allow clients to discover master failovers, originally intended to let clients check for possible changes, as other events may be lost. Unfortunately, many developers crash their applications directly upon receiving this event, significantly reducing the availability of their systems. We might have been better off sending redundant file change events or ensuring that no events are lost during failovers.

Currently, we use three mechanisms to prevent developers from being overly optimistic about Chubby's availability, especially regarding the global cell. First, as previously mentioned, we check how engineering teams plan to use Chubby and advise them against using technologies that tightly couple their system's availability with Chubby. Second, we now provide libraries to perform some high-level tasks that automatically isolate developers from Chubby interruptions. Third, we conduct post-mortems on each Chubby interruption, not only to eliminate bugs in Chubby and our operations but also to reduce applications' sensitivity to Chubby availability, both of which can improve overall system availability.

Fine-grained locks can be overlooked. As described in earlier chapters, a design that allows clients to use fine-grained locks was proposed. Surprisingly, we have not needed to implement such a service so far. Our developers have found that to optimize applications, they must eliminate unnecessary communication, which often means finding ways to use coarse-grained locks.

Poor API choices can have unexpected impacts. Overall, our API has developed well, but there has been a prominent mistake. We intended to cancel long-running calls through the Close() and Poison() RPCs, which also discard the handle's state on the server. This prevents handles that could acquire locks from being shared, such as being shared among multiple threads. We could add a Cancel() RPC to allow open handles to be shared.

The use of RPCs affects transport protocols. KeepAlives are used both to refresh client sessions and to transmit events and cache invalidations issued by the master. This design makes it impossible for clients to refresh their sessions when cache invalidation occurs, which seems ideal but requires us to be cautious in protocol design. TCP's backoff strategy does not consider upper-layer timeouts, such as Chubby's leases, so TCP-based KeepAlives result in many session losses during network congestion. We were forced to use UDP instead of TCP to send KeepAlive RPCs. UDP has no congestion avoidance mechanism, so when upper-layer time constraints need to be considered, we prefer to use UDP.

Under normal circumstances, we could expand the protocol with an additional TCP-based GetEvent() RPC for transmitting events and invalidations, using the same method as KeepAlive. KeepAlive responses would still include a list of unacknowledged events, ensuring that events are eventually confirmed.