Classic papers in the field of distributed systems, translated while reading with the help of ChatGPT and Claude, mainly focusing on the first five chapters see the original text here
Overview#
GFS is a scalable distributed file system for large, distributed, data-intensive applications.
Application Scenarios and Design Points of GFS#
- The GFS system runs on many inexpensive commodity hardware, so failures are very common, including application bugs, operating system bugs, human errors, disk/memory/driver/network/power failures, etc. Therefore, there is a need for regular monitoring, error detection, fault tolerance, and automatic recovery mechanisms, treating component failures as normal rather than exceptions.
- The files stored in the GFS system are usually large, often reaching 100MB or even GB levels. Therefore, efficient management of large files is required, while small files can be supported but do not need optimization.
- File modifications are more often append writes rather than overwrites, and random writes are very rare. File reads are generally large-scale streaming reads (hundreds of KB to several MB) or small-scale (a few KB) random reads.
- The load mainly comes from: large-scale streaming reads (hundreds of KB to several MB), small-scale (a few KB) random reads, and large-scale serialized append writes.
- GFS files are often used as production-consumption queues or for multi-way merging, so the system must efficiently implement parallel append semantics for multiple clients to the same file.
- High sustained bandwidth is more important than low latency.
Interface#
The GFS interface provides the following operations: create, delete, open, close, read, write, snapshot, record append. Among them, snapshot is a low-cost file creation or directory copy operation, and record append is an atomic append operation that allows multiple clients to append to a file in parallel while ensuring the atomicity of each client's operation, used for multi-way merging.
Architecture#
A GFS cluster has one master node and multiple chunkservers. All files are divided into fixed-size chunks, and each chunk is assigned a globally unique immutable 64-bit chunk handle by the master at creation. Chunks are stored on local disks by chunkservers and are read and written using chunk handles and byte ranges. To ensure reliability, each chunk is replicated on multiple chunkservers, with a default of three replicas.
The master node stores all file metadata, including namespace permission control information, file-to-chunk mappings, and chunk location information. The master node also controls system-level activities such as chunk lease management, orphan chunk garbage collection, and chunk migration between chunkservers. The master node periodically communicates with chunkservers via HeartBeat messages to issue commands and collect status information.
Clients interact with the master to obtain and modify metadata. All data-bearing communications connect directly to chunkservers. Neither clients nor chunkservers cache files (though clients do cache metadata) for the following reasons:
- The benefits of client caching are minimal. Most requests will stream large files or have large working sets that cannot be cached. Not caching simplifies the client implementation and allows the entire system to avoid concerns about cache invalidation (consistency).
- Chunkservers also do not need an additional layer of caching because chunks themselves are stored as regular files, and Linux's buffer cache can keep frequently accessed files in memory.
Interaction Process#
- The client converts the filename and offset into a chunk index based on a fixed chunk size.
- The client sends the filename and chunk index to the master.
- The master returns the chunk handles and location information for multiple replicas of the corresponding chunk.
- The client caches the information returned by the server using the filename plus the chunk index as the key.
- The client specifies the chunk handle and byte range, sending a data request to one of the replicas (usually the nearest one) located on the chunkserver.
- Subsequent read requests for the same chunk use the local cache, without needing to interact with the master, until the cached information expires or the file is reopened.
- Typically, the client will include multiple chunks in a single request, and the master will also return multiple chunks that may be needed later to reduce the overhead of subsequent requests.
Chunk Size#
The chunk size chosen by GFS is 64MB, which is larger than the typical disk block size of operating systems. Each chunk replica is stored on the chunkserver as a regular Linux file, with space allocation being lazy to avoid space waste due to internal fragmentation. Choosing a larger chunk size has the following advantages:
- Fewer chunks per file reduce the number of requests for clients to interact with the master to obtain chunkserver location information.
- A client can perform multiple operations on the same chunk with just one persistent TCP connection, reducing the overhead of establishing many connections.
- The total number of chunks in the system is reduced, thereby decreasing the size of metadata that the master needs to store, allowing it to keep metadata in memory.
Typically, small files contain only a few or even just one chunk. If many clients access the same file, these chunkservers may become hotspots. In practice, hotspots are not a major issue because applications mostly read multi-chunk files sequentially. However, when GFS was first applied to batch processing queue systems, hotspot issues did arise: an executable file written to GFS as a single chunk file was accessed simultaneously by hundreds of machines, causing the few chunkservers storing this file to become overloaded due to hundreds of simultaneous requests. We fixed this issue in two ways: 1. Increasing the number of replicas. 2. Applications staggered the timing of chunk reads to avoid simultaneous requests. Another potentially effective solution is to allow clients to read data from other clients in such scenarios.
Metadata#
The master stores three types of metadata: the namespace of files and chunks, the mapping of files to chunks, and the location information of each chunk replica. All metadata resides in the master’s memory, and the master persists the change logs of the first two types of information to local and remote disks. The master can simply update its state through log replay, ensuring reliability and consistency in state recovery after crashes.
The master does not persistently store the location information of chunk replicas but queries chunkservers at startup or when new chunkservers join the cluster.
Memory State#
Since metadata is stored in memory, operations on the master are generally fast, and it can efficiently traverse the complete state periodically in the background. Periodic scanning is used for: implementing garbage collection, recreating replicas for chunks on failed chunkservers, and migrating chunks to ensure load balancing between chunkservers and disk space.
The issue with everything residing in memory is that the total capacity of chunks in the system is limited by the memory of the master machine. In practice, this is usually not a problem; the metadata for a 64MB chunk is generally less than 64 bytes, and most files' chunks are full, except for the last chunk, so there won't be many fragmented chunks. Additionally, the metadata for the file namespace is generally also less than 64 bytes, as filenames are stored compactly using prefix compression. Finally, if larger file systems need to be supported, it is simply a matter of increasing the master machine's memory.
Chunk Location#
The master does not maintain persistent records of which chunkserver has which chunk replicas; it only polls all chunkservers at startup, and the master can ensure that this information is up to date because it controls the arrangement of all chunk locations and monitors the status of all chunkservers through HeartBeat.
This startup query can eliminate many synchronization issues between the master and chunkservers, such as chunkservers joining or leaving the cluster, renaming, rebooting, crashing, etc. In a relatively large cluster, these events occur frequently. On the other hand, many errors on chunkservers can cause chunks to disappear (e.g., disk failures leading to unavailability) or renaming operations, and only the chunkserver itself knows whether it truly stores a valid chunk, so storing this information on the master is also meaningless.
Operation Logs#
Operation logs contain the change history of critical metadata and are the only persistent information for metadata, providing a logical timeline for all parallel operations. Each chunk creation and version number change is marked with a logical timestamp. Due to the importance of operation logs, they must be stored reliably, and they are not visible to clients until persistence is complete. Operation logs are replicated on remote machines and are only returned to clients after being flushed to both local and remote disks.
The master recovers by replaying operation logs. To minimize the master startup time, the logs need to be as small as possible. When the log size reaches a certain threshold, the master creates a checkpoint of the current complete state, and upon the next startup, it only needs to replay the logs from the last checkpoint. The checkpoint itself is a compact B-tree structure that can be directly mapped into memory for namespace queries without additional parsing.
Creating a checkpoint takes some time; the internal state of the master is structured, and the creation of a new checkpoint can proceed without blocking current modification operations. During checkpoint creation, it will switch to a new log file on another thread, and the newly created checkpoint will include all previous changes. A system with millions of files can create a checkpoint in under a minute, and the creation is only considered successful after it has been written to both local and remote disks. When the master recovers, it only needs to replay the operation logs from the last complete valid checkpoint, and earlier checkpoints can be deleted directly. Failures during checkpoint creation do not affect correctness, as the recovery process will check and skip incomplete checkpoints.
Consistency Model#
GFS uses a relatively relaxed consistency model but can still guarantee a relatively simple and efficient implementation. GFS guarantees the following: changes to the namespace (mutations) (e.g., creating files) are atomic, the master node adds mutex locks on the namespace to perform operations, and the operation logs on the master node can define the global order of changes.
The state of a file region (the byte range of a file on the storage device) after data changes depends on the type of data change: success/failure, and whether it is a concurrent change.
A file region refers to the byte range of a file on the storage device. If all clients can read the same data from any replica, then this file region is consistent. If a file's data changes while remaining consistent, and all clients can see the complete content of the change written, then this region is defined (this term is not easy to translate; it can be understood as the content of this file region being determinable by the client).
- Write failure: The result is, of course, inconsistent.
- Non-concurrent (sequential) write success: The result is definitely defined; the content of the corresponding file region is what the client wrote.
- Concurrent write success: The file region is consistent, but the content written to the corresponding file region may be a mix of multiple write segments, so it is undefined.
- Sequential/concurrent append success: Since appending is atomic, the content of the file in the successfully appended region is determined, i.e., defined, but there may have been successful or failed writes and padding data before this.
Data changes can be writes or appends: writes require the client to specify an offset, while appends use the position the client believes is the end of the file as the offset. Append operations will be executed atomically at least once, even in the presence of concurrent changes. However, the final offset used for the successful write is chosen by GFS, and the offset returned to the client will be marked as the starting point of the defined file region containing this append record (there may have been successful or failed writes before this).
GFS may insert padding or duplicate append records; this area will be considered inconsistent. After a series of changes, the final file can be guaranteed to be defined within a certain range and contain the data written in the last change.
GFS archives changes by applying them sequentially to all replicas of a chunk, then checks all outdated replicas using the chunk version number. These outdated replicas may have lost changes due to chunkservers going offline. Outdated replicas will not participate in subsequent changes and will not be returned to clients when they request the master; they will be garbage collected as early as possible. Since clients cache the location information of chunks, they may read data from an outdated replica before this cache is refreshed. This time window is limited by the timeout of the cache unit and the next file open time. Since most files are only appended, outdated replicas relative to the latest data only have an earlier end position of the file; clients needing to read the latest data will fail and, upon retrying, will request chunk information from the master, at which point they will immediately receive the latest chunk location information.
Even long after a successful change, component failures can lead to data corruption or loss. GFS checks checksums through handshake requests with all chunkservers to detect data corruption and then marks the chunkserver as invalid. Once an error is detected, data can be quickly re-copied from other valid replicas. A chunk can only be irreversibly lost if all replicas fail before GFS reacts, which typically takes a few minutes. In this case, the client will be informed that the data is unavailable rather than corrupted.
For applications using GFS, to adapt to this relaxed consistency, it is advisable to use append writes rather than random writes, define files based on the write position, periodically set checkpoints for completed writes, and include application-level checksums. Then only read data before the checkpoint and perform checksum checks. This approach can handle both consistency and concurrency issues well. In the event of a write error, the application can re-incrementally write after its recorded checkpoint. The application can discard padding content and fragment records based on checksums. The application can also assign a unique ID to each record to identify duplicate records.
System Interaction#
Leases and Change Order#
Changes are operations that alter data or metadata, and each change is applied to all replicas of a chunk. GFS uses a lease mechanism to maintain consistency of changes among all replicas. Before executing a change, GFS selects one replica from all replicas of the chunk to grant a lease; this replica is called the primary. The primary is responsible for determining the serial order of change operations executed on this chunk, and all other replicas follow this order when executing chunk change operations.
Thus, the global change order of the entire system can be determined by the order in which the master grants leases and the execution order of changes by each primary. The lease mechanism minimizes management overhead on the master node. The lease has an initial timeout of 60 seconds, but as long as a chunk change is completed, the primary can extend the timeout. The master will communicate this extended timeout to all chunkservers via HeartBeat messages. In certain situations, the master may revoke a lease, such as when it wants to disable changes on a file that has already been renamed. In cases where communication between the master and primary is lost, a new lease can be reassigned through the timeout mechanism.
The process for a client to execute a change is as follows:
- The client requests the chunkserver holding the lease for the specified chunk and the location information of other replicas from the master. If there is currently no lease, the master selects a replica to grant the lease.
- The master returns the location information of all replicas and identifies which one is the primary. The client then caches this information for future use. The client will only request the master again if it cannot communicate with the primary or if the primary no longer holds the lease.
- The client pushes data to all replicas in any order; each chunkserver will store the data in its internal LRU buffer cache until the data is used or expires. By decoupling data flow and control flow, data flow can be scheduled based on network topology to enhance performance.
- Once all replicas confirm receipt of the data, the client sends a write request to the primary, identifying the data that has already been pushed to all replicas. The primary assigns a continuous sequence number to all received changes and then executes all changes on its local state according to the sequence number.
- The primary forwards the write request to all secondaries, which then execute all changes in the order of the sequence numbers assigned by the primary.
- After all secondaries reply to the primary, it indicates that they have completed the operation.
- If any secondary fails (i.e., not all secondaries succeed), the primary will return the corresponding failure error to the client. The modified file region now becomes inconsistent, and the client will retry the failed change, repeating steps 3 to 7.
If a write is very large or spans multiple chunks, the GFS client code will split it into multiple write operations, each following the above process. However, if multiple clients execute concurrently, interleaved execution may lead to overwrites, resulting in the shared file's end containing segments written by multiple clients. In this case, since all replicas execute in the same order, the file region is consistent, but the file's state is undefined, making it unclear which came first.
Data Flow#
Decoupling control flow and data flow is intended to make network transmission more efficient. Control flow goes from the client to the primary and then to other secondaries, while data flow is pushed linearly along a carefully selected chain of chunkservers in a pipelined manner, aiming to maximize the bandwidth utilization of each machine and minimize the latency of pushing all data. The total outbound bandwidth of each machine is utilized as much as possible to transmit data quickly, rather than splitting data for multiple receivers.
To avoid network bottlenecks and high-latency links, each machine will choose the nearest machine that has not yet received data for forwarding. The internal network topology of GFS is very simple, allowing for precise distance estimation through IP addresses. GFS reduces latency by transmitting data in a pipelined manner; once a chunkserver starts receiving data, it will immediately begin forwarding.
Atomic Record Append#
GFS provides an atomic append operation called record append. Unlike regular writes, record append does not specify a write position. GFS guarantees that this data will be atomically appended to the end of the file at least once, and then returns the offset to the client. This is somewhat similar to the O_APPEND mode in Unix, where concurrent writes to a file do not lead to data races. Achieving this effect with regular writes would require distributed locks.
Record append is a type of change operation that follows the same control flow, but with slight differences on the primary. The client pushes the append data to all replicas of the last chunk at the end of the file, then sends the request to the primary, which checks whether appending the record to the current chunk would exceed the maximum size (64MB).
- If it exceeds, the primary will first fill the current chunk to the maximum size, instructing all secondaries to perform the same operation, and then return to the client that this operation should be retried on the next chunk (the size of record append is limited to 1/4 of the chunksize to avoid extreme fragmentation).
- If it does not exceed the chunksize, the primary appends the record to its local replica and sends the offset to all secondaries for data writing, ultimately returning the response to the client.
If record append fails on any replica, the client will retry this operation. Ultimately, different replicas of the same chunk may contain different data, including all or part of the same record being duplicated.
GFS does not guarantee that all replicas are byte-level identical; it can only ensure that data will be written at least atomically once. For successfully written operations, the data will be written at the same offset on all replicas of the chunk, and the lengths of all replicas will be at least the same as the length of the appended data, with subsequent appends only having higher offsets. According to the previous definition of consistency, after a successful record append operation, the file region can be considered defined and, of course, consistent.
Snapshot#
The snapshot operation can almost instantly create a copy of a file or directory tree, minimizing ongoing changes or any interruptions.
GFS uses copy-on-write technology to implement snapshots. When the master receives a snapshot request, it first revokes all unfinished leases on the chunks of the relevant files, ensuring that any subsequent writes to these chunks require the master to obtain a new lease holder. This gives the master the opportunity to create chunk copies beforehand.
After the lease is revoked/expired, the master writes the snapshot operation log to the local disk, modifies the memory state, and copies the metadata of the source file/directory tree.
The newly created snapshot file still points to the chunks of the source file. After the snapshot operation is complete, when a client wants to write content to a chunk, it will request the current lease holder from the master. At this point, the master can discover that the reference count for the corresponding chunk is greater than 1, and then the master will delay responding to the client, first selecting a new chunk handle and then informing all chunkservers holding the chunkC replica to create a new chunk. Each chunkserver will perform file copying locally without going through the network (the disk speed within the GFS system is three times that of the network, and this should be compared to reading files). Then the master will grant a new lease on the new chunk and return it to the client, allowing normal writes on the copied chunk thereafter.
Master Operations#
The master performs all namespace-related operations, manages all chunk replicas within the system, determines where to create new chunks and their replicas, coordinates various system-level activities, ensures all chunk backups are complete, balances the load of chunkservers, and reclaims unused storage, etc.
Namespace Management and Locks#
Many operations on the master are time-consuming. To avoid the impact of a single time-consuming operation on others, GFS allows multiple operations to be executed simultaneously by locking a certain region on the namespace to ensure correct serialization.
Unlike traditional file systems, GFS does not have a separate data structure for directories (compared to traditional file systems that store files under directories) and does not support file/directory aliases (soft/hard links). Logically, it is equivalent to having only one table that contains a mapping of full filenames to metadata information, which can be efficiently represented in memory using prefix compression.
Each node in the namespace (the absolute path of a filename or directory name) has an associated read-write lock. The master needs to acquire a certain set of these locks before executing each operation. For example, if an operation is associated with /d1/d2/.../dn/leaf, it needs to acquire read locks for /d1, /d1/d2, ..., /d1/d2/.../dn and a read or write lock for /d1/d2/.../dn/leaf (depending on the needs of the corresponding operation). Here, the leaf node in the path may be a file or a directory in different operations.
Consider a scenario: during the snapshot of /home/user to /save/user, if /home/user/foo is created, the snapshot operation will first acquire read locks for /home and /save and write locks for /home/user and /save/user, while the creation of /home/user/foo requires read locks for /home and /home/user and a write lock for /home/user/foo. These two operations will create lock conflicts on /home/user. The file creation operation does not need to add a write lock on /home/user because GFS's "directory" does not contain any information that needs to be modified; adding a read lock on the directory is sufficient to prevent it from being deleted/renamed or being snapshot.
This locking model can effectively support concurrent change operations in the same directory, such as concurrently creating multiple files in the same directory. Since there can be many nodes in the namespace, read-write lock objects are lazily created and immediately deleted once they are no longer in use.
To avoid deadlocks, all locks are requested in a consistent order: first sorted by hierarchy in the namespace, then sorted lexicographically at the same level.
Replica Location#
A GFS cluster may have hundreds of chunkservers distributed across different racks, and these chunkservers are accessed by hundreds of clients from the same or different racks. Communication between machines on different racks may need to pass through one or more switches. The input/output bandwidth of a rack may be smaller than the sum of the bandwidth of all machines in the rack.
The strategy for selecting replica locations serves two goals: maximizing data reliability and availability, and maximizing network bandwidth utilization.
Distributing replicas across all machines can only avoid disk or machine failures and maximize the utilization of each machine's network bandwidth. It is also necessary to distribute them across different racks to ensure that there are still available replicas of chunks in case an entire rack becomes unavailable (due to shared resource failures/switch/power failures, etc.) to ensure availability, and that reading a chunk can utilize the total bandwidth of multiple racks. However, on the other hand, writing data must be transmitted across multiple racks, which requires a trade-off.
Replica Creation, Re-replication, and Balancing#
Chunk replicas are created in three situations: chunk creation, re-replication, and rebalancing.
When the master creates a chunk, it selects the storage location for the replicas, considering several factors:
- Store on chunkservers with disk utilization below average, so that over time, the disk utilization of all chunkservers will be relatively balanced.
- Limit the number of recently created replicas on each chunkserver, as generally, there will be a large volume of writes after file creation, which needs to avoid network congestion in such cases.
- Distribute chunks as much as possible across chunkservers on different racks.
When the number of available replicas for a chunk falls below the user-specified target, such as when chunkservers become unavailable or disk failures lead to replica corruption, or when the user increases the specified number of replicas, the master will promptly initiate re-replication.
Each chunk that needs to be re-replicated will be prioritized based on the following conditions:
- How far it is from the replication target; for example, a chunk that has lost two replicas will have a higher priority than one that has lost one replica.
- Prioritize replicating currently alive files over recently deleted files.
- To minimize the impact of failures on running applications, chunks that block client requests will be given higher priority.
The master will select the highest-priority chunk and issue commands to the designated chunkserver to copy data directly from the currently existing valid replicas. The location selection for the new replicas follows the same rules as when chunks are created.
To avoid overwhelming client traffic during replica copying, the master will limit the number of replications executed within the cluster and on each chunkserver. Additionally, each chunkserver will limit the bandwidth usage for replica copying by restricting requests sent to the source chunkserver.
The master will also periodically rebalance replicas. It will check the current distribution of replicas and perform load balancing by moving replicas to more suitable disk spaces. The selection strategy for the location of new replicas is the same as above. After creating new replicas, since the number of replicas has increased, the master needs to choose existing replicas to delete, generally preferring to select chunkservers with below-average free space to balance disk space usage.
Garbage Collection#
When a file is deleted, GFS does not immediately reclaim the physical space it occupies but performs file and chunk-level reclamation during regular GC periods. This approach makes the entire system simpler and more reliable.
When an application deletes a file, the master immediately records a deletion log and renames it to a hidden file containing the deletion time. During the master’s regular scanning of the file system, it will remove hidden files that have existed for more than three days. Before removal, hidden files can still be accessed through their special filenames, and they can be restored by renaming them back to normal filenames.
Once a hidden file is removed from the namespace, its metadata in memory will also be erased, effectively severing its connection to the chunk. During regular chunk namespace scanning, the master will mark orphaned chunks (i.e., chunks that cannot be accessed from any file) and erase the metadata of these chunks. In the HeartBeat messages communicating between the master and chunkservers, chunkservers will report a subset of the chunks they own, and the master will mark those chunks without metadata in the master’s response, allowing chunkservers to delete their replicas at any time.
The advantages of this GC approach are:
- In large-scale distributed systems, this implementation is simple and reliable. The creation of each chunk may fail on certain chunkservers, and messages for deleting replicas may be lost, but the master does not need to retry or remember these failed replicas.
- All storage reclamation occurs in the background in batches and during the relatively idle periods of the master, allowing the master to respond more quickly to urgent client requests during normal times.
- Delaying the reclamation of space can prevent irreversible deletions caused by accidents.
The main disadvantage of delayed deletion is that it can hinder users' space adjustment efforts when storage is tight (e.g., wanting to delete some files). Applications repeatedly creating and deleting temporary files cannot directly reuse storage. The solution is that if a deleted file is deleted again, the internal system will accelerate the reclamation process for the corresponding storage. Additionally, applications can use different replica and reclamation strategies in different namespaces; for example, users can specify that all chunks under a certain directory tree do not need to store replicas, and any file deletion will be executed immediately, resulting in irreversible removal from the file system.
Checking Outdated Replicas#
Replicas may become outdated and lose changes due to chunkservers going offline. The master maintains a version number for each chunk to distinguish the latest replicas from outdated ones. When the master grants a lease on a chunk, it increments its version number and informs all its current latest replicas, which then persist the new version number. These steps occur before the client writes to the chunk.
If a replica is currently unavailable, its chunk version will fall behind. When a chunkserver restarts, the master will detect that it has outdated replicas and notify the chunkserver of the corresponding outdated chunk and the latest version number. If the master detects a chunk version higher than the current record, it can be assumed to have arisen during a failed lease grant (as no writes can occur during lease failures), and the master can directly modify this higher chunk version to the current version.
The master removes outdated replicas during regular GC periods and ignores outdated replicas when returning chunk-related information to clients. As an additional guarantee, when the master informs the client which chunkserver holds the lease or instructs a chunkserver to copy data from another chunkserver, it will include the chunk version, and the client and chunkserver will verify the chunk version during operations to ensure they are accessing the latest data.
Fault Tolerance and Diagnosis#
One of the biggest challenges in designing the GFS system is coping with frequent component failures. The quality and quantity of components make failures a norm rather than an exception: machines and disks cannot be fully trusted. Component failures can lead to system unavailability and even data corruption.
High Availability#
A GFS cluster has hundreds of services, some of which may be unavailable at any time. GFS ensures high availability of the system through two simple and effective methods: rapid recovery and replication.
Rapid Recovery#
Both the master and chunkservers are designed to recover their state and start up within seconds, regardless of how they terminate. In fact, GFS does not distinguish between normal and abnormal terminations; servers are simply killed during routine shutdowns. Clients or other services will briefly pause when their unfinished requests time out and then reconnect to the restarted service to retry.
Chunk Replication#
As mentioned earlier, chunk replication occurs when the master detects a chunkserver going offline or detects replica corruption (through checksums) and performs clone operations to replicate. Additionally, GFS uses parity or erasure codes to meet the increasing demand for read-only storage.
Master Replication#
To ensure reliability, the state of the master is also replicated. Operation logs and checkpoints are replicated across multiple machines, and each state-changing operation is only considered committed after the logs have been flushed to local disks and all master replicas.
To keep it simple, only one master is responsible for all change operations and background activities such as GC. When the master process fails, it can be restarted almost instantly. If the master’s machine or disk fails, monitoring infrastructure outside of GFS will start a new master process using the operation log replica to continue processing. Clients will only access the master using a standard name like gfs-test rather than an IP address or other names that may change with machine modifications.
Moreover, the master replicas, or shadow masters, can provide read-only access to the file system even when the primary master is offline. However, shadow masters may lag slightly behind the primary by a fraction of a second. For files that are not frequently modified or in cases where real-time requirements are relatively weak, this approach can enhance read availability. In fact, file content is read from chunkservers, and applications are generally unaware of whether the file content is outdated. Only directory content or permission control information, such as metadata, may expire within a very short time window.
Each shadow master reads and executes incremental operations on the operation log replica, and like the primary, executes and modifies its memory state in order. Similar to the primary master, the shadow master polls all chunkservers at startup to locate the information of all chunk replicas, and subsequently monitors their status through frequent handshake message exchanges. Additionally, updates to replica locations caused by decisions made by the primary, such as creating and deleting replicas, will depend on the primary master.
Data Integrity#
Each chunkserver uses checksums to check for data corruption. A GFS cluster may have thousands of disks distributed across hundreds of machines, and data corruption and loss during reads or writes are common. In such cases, recovery can be performed using other chunk replicas. Checking for data corruption by comparing different replicas on chunkservers is not very practical. Moreover, in some operations, such as atomic record appends, it cannot be guaranteed that all replicas are completely consistent, but these replicas are still valid. Therefore, each chunkserver must independently verify data integrity by maintaining checksums.
Each chunk is divided into 64KB blocks, and each block has a 32-bit checksum, which is stored in memory and persisted in logs, separate from user data. For read requests, before returning data to the client or chunkserver, the chunkserver verifies the checksums of the blocks within the read range. If a checksum fails, the chunkserver will not send the corrupted data but will return an error to the requester and report the error to the master. The requester will then re-request data from other replicas, and the master will perform replica redo (at this point, the number of valid replicas is insufficient) and notify the chunkserver with the checksum error to delete its replica.
Calculating checksums has a minimal impact on read performance, as most read requests span only a few blocks, so only a small portion of additional data is needed for checksums. Additionally, the client code of GFS will attempt to align reads according to block sizes to reduce this overhead. Furthermore, the lookup and verification of checksums incur no I/O overhead, and checksum calculations can typically overlap with I/O operations (calculating checksums while reading files).
For chunk append write operations, checksum calculations are highly optimized, as this work dominates the load. We only incrementally update the last incomplete block and then calculate a new checksum for the newly appended blocks. Even if the data in the last incomplete block is already corrupted, it cannot be detected now, but the new checksum will mismatch the data, allowing detection during the next read of that block.
In contrast, if a write operation overwrites an existing range in a chunk, we must first read and verify the checksums of the first to last blocks within the overwritten range before performing the write operation and then calculate and record the new checksum.
During idle periods, chunkservers will scan and verify the data of inactive chunks. Once corruption is detected, the master will create new replicas and delete the corrupted replicas. This can prevent inactive chunks from quietly becoming corrupted without the master’s knowledge, leading to insufficient valid replicas.
Diagnostic Tools#
Detailed and comprehensive diagnostic logs provide invaluable assistance in problem isolation, debugging, and performance analysis. GFS services generate many diagnostic logs, including key events (such as chunkserver coming online or going offline) and all RPC requests and responses. These logs can be deleted at any time, but we always try to retain enough for as long as storage space allows.
RPC logs contain the exact requests and responses transmitted over the communication lines. By matching request and response logs and records on different machines, the entire interaction history can be reconstructed to diagnose a problem. Logs can also be used for load testing tracking and performance analysis.
Logs have minimal impact on performance since they are written sequentially and asynchronously. Most recent events are also kept in memory for continuous online monitoring.
Experience#
Business systems are always rigorous and controllable, but users are not, so more infrastructure is needed to prevent interference among users.
Many of our problems are related to disks and Linux. Many disks claim to support a range of IDE protocols, but in fact, can only reliably respond to the most recent version. Since the protocols are very similar across multiple versions, the drivers generally work correctly, but occasional errors can lead to inconsistencies between the driver and kernel states, causing data to be gradually corrupted. This is also our motivation for using checksums, and we have modified the kernel to address these protocol errors.
In the early days, when we were using the Linux 2.2 kernel, we discovered some issues where the time taken for fsync was proportional to the overall file size rather than just the portion being written. For large-scale operation logs, this was a problem, especially before implementing checkpoints. We initially used synchronous writes to address this issue and later migrated to Linux 2.4.
Another Linux issue is the single (global) read-write lock. Any thread in the address space needs to acquire this lock when paging data in from disk (read lock) or modifying mmap-mapped memory addresses (write lock). We found that even under low system load, there would be brief timeouts, leading us to investigate resource bottlenecks and intermittent hardware failures. Ultimately, we discovered that this single (global) lock was blocking the network main thread from mapping new data into memory because the disk thread was paging in previously mapped memory. Since we were primarily limited by network interface bandwidth rather than memory copy bandwidth, we resolved this issue by replacing mmap with pread.
Conclusion#
GFS has successfully supported our storage needs and is widely used within Google as a storage platform for search and business data processing.