Translate Bigtable: A Distributed Storage System for Structured Data

Abstract#

Bigtable is a distributed storage system for managing structured data, designed to scale to very large sizes: petabytes of data distributed across thousands of commercial servers. Many internal Google projects use Bigtable to store data, including web indexing, Google Earth, and Google Finance. These applications impose different requirements on Bigtable, both in terms of data size (from URLs to web pages to planetary images) and latency requirements (from background batch processing to real-time data services). Despite the variety of demands, Bigtable successfully provides a flexible, high-performance solution for all these Google products.

Introduction#

Over the past two and a half years, we have designed, implemented, and deployed a distributed storage system called Bigtable for managing structured data at Google. Bigtable is designed to reliably scale to petabytes of data and thousands of machines. Bigtable has achieved several goals: broad applicability, scalability, high performance, and high availability. Bigtable is used in over 60 products and projects at Google, including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth. These products use Bigtable to handle a variety of high-load workloads, ranging from throughput-oriented batch jobs to latency-sensitive services for end users. The Bigtable clusters used by these products cover a wide range of configurations, from a few to thousands of servers, storing up to hundreds of terabytes of data.

In many ways, Bigtable is similar to a database and shares many implementation strategies with databases. Parallel databases and in-memory databases have achieved scalability and high performance, but Bigtable provides a different interface compared to these systems. Bigtable does not support a complete relational data model; instead, it offers clients a simple data model that supports dynamic control over data layout and data format, allowing clients to infer the local properties of the data represented in the underlying storage. Data is indexed by row and column, with names that can be any string. Bigtable treats data as raw strings, although clients serialize various forms of structured and semi-structured data into these strings. Clients can control data locality by carefully choosing data formats. Finally, Bigtable's schema parameters allow clients to dynamically control whether to retrieve data from memory or disk.

Data Model#

Bigtable is a sparse, distributed, persistent, multi-dimensional sorted map indexed by row key, column key, and a timestamp, where each value is a raw byte array.

(row: string, column: string, time: int64) -> string

After exploring many potential uses for Bigtable-like systems, we settled on this data model. For example, suppose we want to maintain a copy of a large collection of web pages and their associated information that can be used for different projects; we call this table Webtable. In Webtable, we might use URLs as row keys, various attributes of the web pages (such as page content, anchors, etc.) as column keys, and then store the corresponding (content retrieved at different times) as values. As shown below:

Webtable

Rows#

Row keys in the table can be any string (currently up to 64KB). Each read and write operation performed under the same row key is atomic, making it easier for clients to infer the system's behavior when concurrently updating the same row. Bigtable maintains data in lexicographic order based on row keys. The range of rows within a table is dynamically partitioned, with each row range called a tablet, which is also a basic unit of distribution and load balancing. This way, reading a small range of rows is very efficient, generally requiring communication with only a small number of machines. Clients can leverage this property when choosing row keys to achieve better locality when accessing data. For example, in Webtable, by reversing the hostname portion of URLs, we group web pages with the same domain together. For instance, we store the data for maps.google.com/index.html under the key com.google.maps/index.html, placing pages from the same domain close to each other to make some host and domain analysis more efficient.

Column Families#

Column keys are grouped into sets called column families, which form a basic unit of access control. All data types stored within a column family are generally the same (we compress data from the same column family together). Column families must be created before any column keys can store data within them; once a column family is created, all column keys within that family can be used.

Our goal is to keep the number of different column families within a table relatively small (up to a few hundred) and to change column families infrequently during operations; in contrast, a table may have an unlimited number of columns. A column key can be named using this syntax: family. The name of the column family must be printable, but the identifier can be any string. For example, one column family in Webtable is language, which stores the language used to write the web page content. We use only one column key in the language column family, which stores the language ID for each web page. Another useful column family in this table is anchor, where each column key represents a separate anchor (as shown above), with the identifier being the URL it references and the content being the link text.

Access control, disk, and memory accounting are all managed at the column family level. In our Webtable example, this control allows us to manage various applications: some add new base data, some read base data and create derived column families, and some are only allowed to view currently existing data (for privacy reasons, they may not even view all column families).

Timestamps#

Each cell in Bigtable can contain multiple versions of the same data, indexed by timestamps. Bigtable's timestamps are 64-bit integers. Timestamps can be assigned by Bigtable as real time represented in microseconds, or explicitly assigned by client applications, which must generate unique timestamps to avoid conflicts. Different versions of a cell are stored in decreasing order, so the most recent data can be read first.

To alleviate the burden of managing different versions of data, we support two column family-level settings that allow Bigtable to perform automatic garbage collection: clients can specify to keep only the most recent n versions or only store sufficiently recent versions (e.g., data written in the last seven days).

In the Webtable example, we set the timestamps for the crawled web pages stored in content to the time when those pages were actually crawled, and the garbage collection mechanism described above allows us to store only the three most recent versions of each page.

API#

The Bigtable API provides functionality for creating and deleting tables and column families, as well as modifying cluster, table, and column family metadata, such as access control rights.

Client applications can write to and delete values in Bigtable, look up values by individual rows, or iterate over a subset of data within the table. Below is C++ code that uses the RowMutation abstraction to perform a series of data updates (some unrelated code is omitted). The Apply call performs an atomic change on Webtable: it adds an anchor on the www.cnn.com row and deletes a different anchor.

// Open table
Table *T = OpenOrDie("/bigtable/web/webtable");
// Write new anchor and delete old anchor
RowMutation r1(T, "com.cnn.www");
r1.Set("anchor:www.c-span.org", "CNN");
r1.Delete("anchor:www.abc.com");
Operation op;
Apply(&op, &r1);

Below is C++ code that uses the Scanner abstraction to iterate over all anchors for a specified row. Clients can iterate over multiple column families and have many mechanisms to limit the number of rows, columns, and timestamps produced by the scan. For example, we can limit the scan to only produce columns for anchors that match the regular expression anchor:*.cnn.com, or only produce anchors with timestamps falling within the last ten days.

Scanner scanner(T);
ScanStream *stream;
stream = scanner.FetchColumnFamily("anchor");
stream->SetReturnAllVersions();
scanner.Lookup("com.cnn.www");
for (; !stream->Done(); stream->Next()) {
  printf("%s %s %lld %s\n",
  scanner.RowName(),
  stream->ColumnName(),
  stream->MicroTimestamp(),
  stream->Value());
}

Bigtable supports many other features that allow users to handle data in more complex ways. First, Bigtable supports single-row transactions, which can be used for atomic execution of read-modify-write sequences under a single row key. Bigtable currently does not support general cross-row transactions, although it provides clients with an interface for cross-row batch writes. Second, Bigtable allows cells to be used as integer counters. Finally, Bigtable supports executing client-provided scripts in the server's address space, using a data processing language developed by Google called Sawzall. Currently, our Sawzall-based API does not support client scripts writing to Bigtable but supports many forms of data transformation, filtering based on arbitrary expressions, and aggregation through various operations.

Bigtable can also be used with MapReduce, a framework developed by Google for running large-scale parallel computations. We have created a series of wrappers that allow Bigtable to be used as an input source and output target for MapReduce jobs.

Building Blocks#

Bigtable is built on top of many other infrastructures at Google, using GFS to store logs and data files. A Bigtable cluster typically runs in a shared pool of machines, where many other distributed applications also run, and Bigtable processes often share the same machines with other distributed application processes. Bigtable relies on a cluster management system for job scheduling, managing resources on shared machines, handling machine failures, and monitoring machine status.

Bigtable uses the Google SSTable file format to store internal data. SSTable provides a persistent, ordered, immutable key-value mapping table, where keys and values are arbitrary byte strings. SSTable provides operations for looking up the value corresponding to a specified key and for iterating over all key-value pairs within a specified key range.

Internally, each SSTable consists of several contiguous blocks (typically 64KB in size but configurable). At the end of the SSTable, there is an index block for block location, which is loaded into memory when the SSTable is opened. A query operation may be completed with a single disk lookup: first, by performing a binary search in the in-memory index to find the correct block, and then reading the corresponding block from disk. Additionally, an SSTable can be entirely mapped into memory, allowing queries and scans to be executed without any disk operations.

Bigtable relies on a highly available, persistent distributed lock service called Chubby. A Chubby service consists of five active replicas, one of which is elected as the master responsible for actively handling requests. The service is online when a majority of replicas are alive and can communicate with each other. Chubby uses the Paxos algorithm to ensure that replicas remain consistent in the face of failures. Chubby provides a namespace composed of directories and small files, each of which can be used as a lock, and read and write operations on a single file are atomic. The Chubby client library provides a consistency cache for Chubby files. Each Chubby client maintains a session with the Chubby service. When a lease expires and cannot be renewed, the client session will expire. When a client session expires, it will discard all locks and open file handles. Chubby clients can also register callbacks on Chubby directories and files to receive notifications of changes or session expirations.

Bigtable uses Chubby to handle various tasks: ensuring that at most one active master exists at any time; storing the starting locations of Bigtable data; discovering tablet servers and confirming the death of tablet servers; storing Bigtable format information (column family information for each table); and storing access control lists. If Chubby is unavailable for an extended period, Bigtable will also become unavailable. We recently measured this impact across 14 Bigtable clusters spanning 11 Chubby instances. The average percentage of Bigtable service time during which some data stored in Bigtable was unavailable due to Chubby unavailability (Chubby outages or network issues) was 0.0047%. The highest percentage of a single cluster affected by Chubby unavailability was 0.0326%.

Implementation#

The implementation of Bigtable consists of three main components: a library linked to each client, a master service, and many tablet services. Tablet services can be dynamically added or removed from the cluster to accommodate changes in workload.

The master is responsible for assigning tables to tablet servers, discovering new and expired tablet servers, balancing the load of tablet servers, and garbage collecting files in GFS. It also handles schema changes, such as the creation of tables and column families.

Each tablet server manages a set of tablets (typically ten to a thousand tablets per tablet server), handling read and write requests for the tablets it manages and splitting overly large tablets.

Like many single-master distributed storage systems, client data is not sent directly to the master but is read and written directly with tablet servers. Since Bigtable clients do not rely on the master for tablet location information, most clients do not communicate with the master. As a result, the master's load in practice is very low.

A Bigtable stores many tables, each composed of a set of tablets, with each tablet containing all data associated with a range of rows. Initially, each table has only one tablet, but as the table's data grows, it automatically splits into multiple tablets, with each tablet typically sized between 100-200MB.

Tablet Location#

We use a three-level hierarchy similar to a B+ tree to store tablet location information.

Tablet location hierarchy

The first level is a file stored in Chubby that contains information about the root tablet. The root tablet contains all tablet location information in a special METADATA table. Each METADATA tablet contains a collection of user tablets. The root tablet is actually the first tablet of the METADATA table and will never be split, ensuring that the tablet location structure does not exceed three levels.

The METADATA table stores tablet location information in row keys encoded as the tablet's table identifier plus its ending row key. Each METADATA row stores nearly 1KB of data in memory. The METADATA tablet has a moderate limit of 128MB, allowing our three-level location scheme to address 2^34 tablets, or 2^61 bytes of data. (Note: the root METADATA tablet can accommodate 128MB / 1KB = 2^17 secondary entries, and each secondary entry can also accommodate 128MB / 1KB = 2^17 tertiary entries, resulting in a total of 2^34 tablets, with each tablet sized at 100-200MB, taking 128MB gives a total size of 2^61 bytes.)

The client library caches tablet locations. If a client does not know the location of a tablet or finds that its cached location information is incorrect, it will recursively look up the target tablet's location information. If the client cache is empty, the addressing algorithm requires three round trips, including one read from Chubby. If the client cache is stale, the addressing algorithm may take up to six round trips (assuming the METADATA tablet does not move frequently), as it only discovers the cache is outdated when it misses (i.e., first using the cache, then looking up three times with misses, and finally querying downwards three times).

Since tablet location information is stored in memory, there is no need to access GFS. We further reduce overhead by prefetching tablet location information through the client library: each time the client reads the METADATA table, it reads multiple tablets.

We also store secondary information in the METADATA table, including logs of all events related to each tablet (e.g., when a tablet starts serving), which is helpful for debugging and performance analysis.

Tablet Assignment#

Each tablet is assigned to only one tablet server at a time. The master continuously tracks the set of all live tablet servers and the current assignment of all tablets on tablet servers, including those that are unassigned. When a tablet is unassigned and there is enough space on an available tablet server, the master will send a tablet load request to assign the tablet to that tablet server.

Bigtable uses Chubby to continuously track tablet servers. When a tablet server starts, it creates a unique file in a specified Chubby directory and requests a mutex lock. The master monitors this directory.

A tablet server will stop working when it loses the mutex lock: for example, a network partition causes the server to lose its Chubby session (Chubby provides an efficient mechanism for tablet servers to check if they still hold the lock without using network traffic). As long as the file still exists, the tablet server will attempt to reacquire the mutex lock on the file; if the file no longer exists, the tablet server cannot provide service again and will terminate itself. Whenever a tablet server terminates (for example, when the cluster management system removes the tablet server's machine from the cluster), it will attempt to release any locks it holds so that the master can more quickly reassign its tablets to other servers.

The master is responsible for detecting when a tablet server is no longer serving its tablets and reassigning those tablets as quickly as possible. To detect when a tablet server is no longer serving its tablets, the master periodically queries the lock status of each tablet server. If a tablet server reports that it has lost its lock, or if the master cannot reach that server after several attempts, the master will attempt to acquire the mutex lock on the corresponding server's file. If the master can acquire the lock, it indicates that Chubby is alive and the tablet server may have terminated or experienced communication failure with Chubby; thus, the master deletes the server file to ensure it cannot provide service again. Once the server file is deleted, the master can move all tablets previously assigned to that server into the unassigned tablets collection. To ensure that the Bigtable cluster is not affected by network issues between the master and Chubby, the master will terminate itself when its session with Chubby expires; however, as mentioned above, the master's failure does not change the assignment of tablets on tablet servers.

When a master is started by the cluster management system, it needs to know the assignment of tablets before making modifications. The master performs the following steps to start: 1. The master requests a unique lock in Chubby to avoid parallel master instantiation; 2. The master scans the servers directory in Chubby to find live servers; 3. The master communicates with each live tablet server to discover which tablets each tablet server has been assigned; 4. The master scans the METADATA table to see all tablets; whenever it scans an unassigned tablet, it adds it to an unassigned tablet collection so that this tablet can be assigned (Note: only tablets in the unassigned tablet collection can be assigned).

A complex situation arises because the METADATA tablets must be assigned before the METADATA table can be scanned. Therefore, before starting the fourth step of scanning, if the root tablet is not found in the third step, the master will add the root tablet to the unassigned collection. This step ensures that the root tablet will be assigned. Since the root tablet contains the names of all METADATA tablets, once the master scans the root tablet, it can know all tablets.

The current set of existing tablets changes in the following situations: tablets are created or deleted, two existing tablets merge into a larger tablet, or a tablet splits into two smaller tablets. The master can track these changes because it initiated all changes except for the last one (Note: creation/deletion/merging are performed by the master, while splitting is not). Tablet splitting is special because it is performed by the tablet server. The tablet server submits the split operation by recording the information of the new tablets in the METADATA table, and once the split operation is submitted, it notifies the master. If this notification is lost (due to a tablet server or master crash), the master will also discover it when requesting the tablet server to load the split tablet, as the tablet server will find that the entry corresponding to the requested tablet in the METADATA table contains only a portion (determined by row range), and then the tablet server will send a notification to the master that the tablet has been split.

Tablet Serving#

Tablet Representation

The persistent state of a tablet is stored in GFS, as shown in the figure above. Update operations submit a commit log containing redo records. For these updates, the most recent commits are stored in an ordered buffer in memory called a memtable; older commits are stored in a series of SSTables. To recover a tablet, the tablet server reads the metadata of the tablet from the METADATA table: this includes a list of SSTables that make up the tablet and a set of redo pointers pointing to all commit logs that may contain the tablet's data. The server reads the index of the SSTables into memory and executes all redo logs that have been committed to reconstruct the memtable.

When a write operation reaches the tablet server, the server checks whether the format is correct and whether the sender is authorized to perform the change. Authorization is performed by reading the list of allowed writers from a Chubby file (most cases will hit the cache). A valid change will be written to the commit log, with group commit used to improve throughput for small changes. Once the write is committed, its memory will be inserted into the memtable. When the tablet server receives a read operation, it similarly checks the format and permissions. A legitimate read operation will be executed on a merged view of the memtable and a series of SSTables. Since both SSTables and memtables are sorted data structures, the merged view can be constructed efficiently.

Read and write operations can continue while tablets are splitting or merging.

Compactions#

After write operations are executed, the size of the memtable increases. When the size of the memtable reaches a threshold, the memtable is frozen, and a new memtable is created. The frozen memtable is converted into an SSTable and written to GFS. This minor compaction process has two goals: to reduce the memory footprint of the tablet server and to decrease the number of commit log entries that need to be read during server crash recovery. During the compaction process, read and write operations can continue.

Each minor compaction creates a new SSTable. If this operation continues indefinitely, read operations may need to merge updates from an arbitrary number of SSTables. Instead, we limit the number of files by periodically performing major compactions in the background. Each major compaction reads some contents of SSTables and memtables and writes them into a new SSTable. After the compaction is completed, the input SSTables and memtables can be discarded.

The operation of rewriting all SSTables into a single SSTable is called major compaction. SSTables produced by minor compactions may contain special delete entries that suppress data still alive in the old SSTables. In other words, a SSTable produced by major compaction does not contain any delete information or deleted data (all data exists within this single SSTable without needing to retain delete-related information). Bigtable periodically scans all tablets and performs major compaction operations on them. These major compaction operations allow Bigtable to reclaim resources used by deleted data and ensure that deleted data disappears from the system in a timely manner, which is important for services storing sensitive data.

The implementation described above requires certain optimizations to meet our users' demands for high performance, high availability, and reliability. This section will describe some implementations in more detail to highlight these improvements.

Locality groups#

Clients can group multiple column families into a locality group, with each tablet creating a separate SSTable for each locality group. Separating column families that are not typically accessed together into different locality groups can improve read efficiency. For example, in Webtable, page metadata can be placed in one locality group, while page content can be placed in a different locality group, so an application that needs to read page metadata does not need to read any page content.

Additionally, some useful tuning parameters can be specified on a per-locality group basis. For example, a locality group can be declared to be stored in memory. Locality groups stored in memory SSTables will be lazily loaded into memory by the tablet server. Once loaded, reading column families belonging to this locality group can occur without accessing disk. This feature is useful for small but frequently accessed data: internally, we use this feature in the METADATA table to locate column families.

Compression#

Clients can control whether to compress the SSTables of a locality group, and if so, specify the compression format. The user-specified compression format will be applied to each SSTable block (the size can be controlled through tuning parameters of the locality group). Although compressing each block separately may lose some space, the benefit is that reading a small portion of an SSTable does not require decompressing the entire file. Many clients use a two-step custom compression format, where the first step uses Bentley and McIlroy’s scheme to compress long common strings over a large window, and the second step uses a fast compression algorithm to find duplicate patterns over a 16KB window. Both compression processes are very fast, with encoding speeds of 100–200 MB/s and decoding speeds of 400–1000 MB/s on modern machines.

Although we prioritize speed over space reduction when selecting compression algorithms, this two-step compression scheme has yielded unexpectedly good results. For example, in Webtable, we use this compression scheme to store web page content. In one experiment, we stored a large number of documents in a compressed locality group, and for the purpose of the experiment, we limited each document to only store one version. This compression scheme achieved a space reduction ratio of 10:1. This is much better than the typical Gzip compression ratio of 3:1 or 4:1 on HTML pages, as the row storage method in Webtable places all pages from the same host close together, allowing the Bentley-McIlroy algorithm to identify a large number of shared templates across pages from the same host.

Many applications, not just Webtable, choose row names that can ultimately aggregate similar data together, achieving very high compression ratios. Bigtable achieves even higher compression ratios when storing multi-version data.

Caching for read performance#

To improve read performance, tablet servers use a two-layer cache. The Scan Cache is a high-level cache that caches key-value pairs returned by the SSTable interface in the tablet server code, while the Block Cache is a low-level cache that directly caches SSTable blocks read from GFS. The Scan Cache is most useful for applications that repeatedly read the same data, while the Block Cache is most useful for applications that read data very close together in a short time frame (e.g., sequential reads or randomly reading different columns from the same locality group on a hot row).

Bloom filters#

As mentioned earlier, a read operation must read all SSTables that constitute the state of the tablet server. If the SSTables are not in memory, this can lead to many disk accesses. We reduce this number by allowing clients to specify that a Bloom filter should be created for a specific locality group. A Bloom filter allows us to query whether a specified row/column pair contains any data in an SSTable. For certain applications, storing a small Bloom filter in memory on the tablet server can significantly reduce the number of disk accesses for read operations. Using Bloom filters also means that most queries for non-existent rows or columns do not need to access disk at all.

Commit-log implementation#

If we were to write each tablet's commit log to different log files, there would be a very large number of files being concurrently written to GFS. Depending on the underlying file system implementation on each GFS server, these writes could cause a lot of disk seeks to write to different physical log files. Additionally, writing to different log files for each tablet would reduce the efficiency of group commit optimizations, as groups would become very small (Note: different commits written to different files cannot be placed in the same group). To address this issue, we append changes to a single commit log on each tablet server, mixing changes from different tablets into the same physical log file.

Using a single log file provides a significant performance boost during normal operations but complicates recovery. When a tablet server crashes, its tablets are moved to a large number of other tablet servers: each server typically loads only a small amount of data from the original tablet server. To recover a tablet's state, the new tablet server needs to replay the changes corresponding to the tablet from the original tablet server's commit log file. However, the changes for this tablet are mixed with changes from other tablets in the same physical log file. One approach is for each new tablet server to read the entire commit log file and then only replay the entries corresponding to the tablet needed for recovery. However, in this mode, if a tablet from a failed tablet server is assigned to 100 machines, the log file would be read 100 times (each tablet server reads it once).

We first sort the commit log entries by <table, row name, log sequence number> as the key. In the sorted output, all changes for a specific tablet are contiguous, allowing efficient sequential reading with a single disk seek. To parallelize the sorting, we split the log files into 64MB segments and sort each segment in parallel on different tablet servers. This sorting process is coordinated by the master and is initiated when a tablet server indicates that it needs to recover changes from some commit log files.

Writing commit logs to GFS sometimes leads to brief performance issues due to various reasons: for example, a GFS server machine may experience write crashes, or there may be congestion on the network paths to three GFS servers, or the load may be too high, etc. To protect against changes during peak latency periods in GFS, each tablet server actually has two threads running simultaneously. If the performance of writing to the active log file is low, the log file write will switch to the other thread, and changes in the commit log queue will be written by the new thread. The log entries contain sequence numbers, allowing the recovery process to ignore duplicate entries generated by switching log threads.

Speeding up tablet recovery#

If the master transfers a tablet from one tablet server to another, the source tablet server first performs a minor compaction on the corresponding tablet, reducing the number of uncompressed states in the tablet server's commit log, thus reducing the time required for recovery. After completing this compaction, the tablet server stops serving this tablet, and before actually unloading the tablet, the tablet server performs another minor compaction (usually quickly) to eliminate any remaining uncompressed states that arrived after the first minor compaction. After the second minor compaction is completed, the tablet can be loaded onto other tablet servers without needing to recover any log entries.

Exploiting immutability#

In addition to SSTable caching, many other parts of the Bigtable system are simplified because the SSTables we generate are immutable. For example, when reading from SSTables, we do not need to perform any synchronization operations on the accessed file system. Ultimately, concurrent notifications between different rows can be implemented very efficiently. The only mutable data structure that can be read and written simultaneously is the memtable. To reduce contention during memtable reads, we implement copy-on-write for each row in the memtable, allowing reads and writes to execute in parallel.

Since SSTables are immutable, the problem of permanently removing deleted data transforms into garbage collection of obsolete SSTables. Each tablet's SSTables are registered in the METADATA table. The master removes obsolete SSTables from the SSTables using a mark-and-sweep approach. The METADATA table contains a collection of root tablets. Ultimately, immutable SSTables allow us to quickly separate tablets, enabling child tablets to share the SSTables of their parent tablet rather than generating a new set of SSTables for each child tablet.

Lessons#

In the process of designing, implementing, maintaining, and supporting Bigtable, we have gained many experiences and learned several interesting lessons.

One lesson we learned is that large distributed systems are susceptible to many different types of failures, not just the standard network partitions and crash-stop errors assumed in many distributed protocols. For example, we encountered many issues caused by these reasons: memory and network failures, large clock drifts, machine hangs (unresponsive), persistent asymmetric network partitions, bugs in other systems (like Chubby), GFS quota overflows, and other planned and unplanned hardware maintenance. As we have gained experience with many of these issues, we have modified various protocols to address them. For example, we added checksums to our RPC mechanism. We also addressed these issues by removing some assumptions made by one part of the system about other parts, such as no longer assuming that a given Chubby would only return errors within a certain range.

Another lesson we learned is the importance of delaying the addition of new features until we understand how they will be used. For example, we initially planned to support general transactions in our API. Since there were no use cases at the time, we did not implement them. Now that we have many real applications running on Bigtable, we can investigate their actual needs and find that most applications only require single-row transactions. For distributed transactions, the most important use case is maintaining auxiliary indexes, and we plan to add a special mechanism to meet their needs. This new mechanism will be less general than distributed transactions but more efficient (especially for updates spanning hundreds of rows) and will also interact better with our optimistic replication scheme across data center replicas.

A practical lesson we learned while supporting Bigtable is that appropriate system-level monitoring is crucial (for example, not only monitoring Bigtable itself but also monitoring clients using Bigtable). For instance, we extended our RPC system so that for a simple RPC, it would log detailed traces of all important operations performed by that RPC. This feature allowed us to discover and fix many issues, such as lock contention in the tablet data structure, slow writes to GFS for changes submitted to Bigtable, and blocking access when METADATA tablets were unavailable. Another useful monitoring example is that each Bigtable cluster is registered in Chubby, allowing us to track all clusters, understand their sizes, the software versions they are running, the traffic they receive, and monitor for unusually high latency issues.

The most important lesson we learned is the value of simple design. Given the size of our system (approximately 100,000 lines of code excluding test code), the code evolved in unpredictable ways, and we found that the clarity of the code and design greatly affected maintenance and debugging. For example, our tablet-server membership protocol. Our first protocol was simple: the master periodically leased to the tablet servers, and then the tablet servers killed themselves when the lease expired. Unfortunately, this protocol significantly reduced availability during network issues and was very sensitive to the recovery time of the master. We redesigned it many times until we had a well-performing protocol. However, the result was that this protocol became overly complex and relied on some behaviors of Chubby that are rarely used by other applications.

We found that we spent too much time dealing with some ambiguous edge cases, not only in Bigtable's code but also in Chubby's code. Ultimately, we abandoned this protocol and transitioned to a new simple protocol that only relies on commonly used Chubby features.

Conclusions#

We have described Bigtable, a distributed system used by Google to store structured data. Bigtable has been in production use since April 2005, after nearly seven person-years of design and implementation. By August 2006, over sixty projects were using Bigtable. Our users appreciate the high performance and availability provided by Bigtable, and as resource demands grow over time, they can easily scale the cluster's capacity by simply adding more machines.

Given Bigtable's unusual interface, an interesting question is how difficult it is for our users to adapt to using Bigtable. New users often are unsure how to best use Bigtable's interface, especially when they are accustomed to the general transactions supported by relational databases. Nevertheless, the fact that many Google products have successfully used Bigtable demonstrates that our design works well in practice.

We are still developing many additional new features for Bigtable, such as support for auxiliary indexes and building infrastructure for cross-data-center replication with multiple master replicas. We have also begun deploying Bigtable as a service provided to product teams, so individual products do not need to maintain their own clusters. As we scale our service clusters, we need to address more resource-sharing issues within Bigtable itself.

Ultimately, we have found that building our own storage solution at Google has significant benefits, providing us with remarkable flexibility when designing our own data model for Bigtable. Additionally, we have autonomous control over the implementation of Bigtable and the Google infrastructure it relies on, meaning we can promptly eliminate bottlenecks and inefficiencies.