正在加载图片...
usage across chunkservers.Sections 4.3 and 4.4 will discuss Write Record Append these activities further. Serial defined defined One potential concern for this memory-only approach is success interspersed with that the number of chunks and hence the capacity of the Concurrent consistent inconsistent successes but undefined whole system is limited by how much memory the master Failure inconsistent has.This is not a serious limitation in practice.The mas- ter maintains less than 64 bytes of metadata for each 64 MB chunk.Most chunks are full because most files contain many Table 1:File Region State After Mutation chunks,only the last of which may be partially filled.Sim- ilarly,the file namespace data typically requires less then 64 bytes per file because it stores file names compactly us- limited number of log records after that.The checkpoint is ing prefix compression. If necessary to support even larger file systems,the cost in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup without ex- of adding extra memory to the master is a small price to pay tra parsing.This further speeds up recovery and improves for the simplicity,reliability,performance,and flexibility we availability. gain by storing the metadata in memory. Because building a checkpoint can take a while,the mas- 2.6.2 Chunk Locations ter's internal state is structured in such a way that a new checkpoint can be created without delaying incoming muta- The master does not keep a persistent record of which tions.The master switches to a new log file and creates the chunkservers have a replica of a given chunk.It simply polls new checkpoint in a separate thread.The new checkpoint chunkservers for that information at startup.The master includes all mutations before the switch.It can be created can keep itself up-to-date thereafter because it controls all in a minute or so for a cluster with a few million files.When chunk placement and monitors chunkserver status with reg- completed,it is written to disk both locally and remotely. ular HeartBeat messages. Recovery needs only the latest complete checkpoint and We initially attempted to keep chunk location information subsequent log files.Older checkpoints and log files can persistently at the master,but we decided that it was much be freely deleted.though we keep a few around to guard simpler to request the data from chunkservers at startup, against catastrophes.A failure during checkpointing does and periodically thereafter.This eliminated the problem of not affect correctness because the recovery code detects and keeping the master and chunkservers in sync as chunkservers skips incomplete checkpoints join and leave the cluster,change names,fail,restart,and so on.In a cluster with hundreds of servers,these events 2.7 Consistency Model happen all too often. Another way to understand this design decision is to real- GFS has a relaxed consistency model that supports our ize that a chunkserver has the final word over what chunks highly distributed applications well but remains relatively it does or does not have on its own disks.There is no point simple and efficient to implement.We now discuss GFS's in trying to maintain a consistent view of this information guarantees and what they mean to applications.We also on the master because errors on a chunkserver may cause highlight how GFS maintains these guarantees but leave the chunks to vanish spontaneously (e.g.,a disk may go bad details to other parts of the paper. and be disabled)or an operator may rename a chunkserver. 2.7.1 Guarantees by GFS 2.6.3 Operation Log File namespace mutations (e.g.,file creation)are atomic. The operation log contains a historical record of critical They are handled exclusively by the master:namespace metadata changes.It is central to GFS.Not only is it the locking guarantees atomicity and correctness (Section 4.1); only persistent record of metadata,but it also serves as a the master's operation log defines a global total order of logical time line that defines the order of concurrent op- these operations (Section 2.6.3). erations.Files and chunks,as well as their versions (see The state of a file region after a data mutation depends Section 4.5),are all uniquely and eternally identified by the on the type of mutation,whether it succeeds or fails,and logical times at which they were created. whether there are concurrent mutations.Table 1 summa- Since the operation log is critical,we must store it reli- rizes the result.A file region is consistent if all clients will ably and not make changes visible to clients until metadata always see the same data,regardless of which replicas they changes are made persistent.Otherwise,we effectively lose read from.A region is defined after a file data mutation if it the whole file system or recent client operations even if the is consistent and clients will see what the mutation writes in chunks themselves survive.Therefore,we replicate it on its entirety.When a mutation succeeds without interference multiple remote machines and respond to a client opera- from concurrent writers,the affected region is defined (and tion only after flushing the corresponding log record to disk by implication consistent):all clients will always see what both locally and remotely.The master batches several log the mutation has written.Concurrent successful mutations records together before flushing thereby reducing the impact leave the region undefined but consistent:all clients see the of flushing and replication on overall system throughput. same data,but it may not reflect what any one mutation The master recovers its file system state by replaying the has written.Typically,it consists of mingled fragments from operation log.To minimize startup time,we must keep the multiple mutations.A failed mutation makes the region in- log small.The master checkpoints its state whenever the log consistent (hence also undefined):different clients may see grows beyond a certain size so that it can recover by loading different data at different times.We describe below how our the latest checkpoint from local disk and replaying only the applications can distinguish defined regions from undefinedusage across chunkservers. Sections 4.3 and 4.4 will discuss these activities further. One potential concern for this memory-only approach is that the number of chunks and hence the capacity of the whole system is limited by how much memory the master has. This is not a serious limitation in practice. The mas￾ter maintains less than 64 bytes of metadata for each 64 MB chunk. Most chunks are full because most files contain many chunks, only the last of which may be partially filled. Sim￾ilarly, the file namespace data typically requires less then 64 bytes per file because it stores file names compactly us￾ing prefix compression. If necessary to support even larger file systems, the cost of adding extra memory to the master is a small price to pay for the simplicity, reliability, performance, and flexibility we gain by storing the metadata in memory. 2.6.2 Chunk Locations The master does not keep a persistent record of which chunkservers have a replica of a given chunk. It simply polls chunkservers for that information at startup. The master can keep itself up-to-date thereafter because it controls all chunkplacement and monitors chunkserver status with reg￾ular HeartBeat messages. We initially attempted to keep chunk location information persistently at the master, but we decided that it was much simpler to request the data from chunkservers at startup, and periodically thereafter. This eliminated the problem of keeping the master and chunkservers in sync as chunkservers join and leave the cluster, change names, fail, restart, and so on. In a cluster with hundreds of servers, these events happen all too often. Another way to understand this design decision is to real￾ize that a chunkserver has the final word over what chunks it does or does not have on its own disks. There is no point in trying to maintain a consistent view of this information on the master because errors on a chunkserver may cause chunks to vanish spontaneously (e.g., a disk may go bad and be disabled) or an operator may rename a chunkserver. 2.6.3 Operation Log The operation log contains a historical record of critical metadata changes. It is central to GFS. Not only is it the only persistent record of metadata, but it also serves as a logical time line that defines the order of concurrent op￾erations. Files and chunks, as well as their versions (see Section 4.5), are all uniquely and eternally identified by the logical times at which they were created. Since the operation log is critical, we must store it reli￾ably and not make changes visible to clients until metadata changes are made persistent. Otherwise, we effectively lose the whole file system or recent client operations even if the chunks themselves survive. Therefore, we replicate it on multiple remote machines and respond to a client opera￾tion only after flushing the corresponding log record to disk both locally and remotely. The master batches several log records together before flushing thereby reducing the impact of flushing and replication on overall system throughput. The master recovers its file system state by replaying the operation log. To minimize startup time, we must keep the log small. The master checkpoints its state whenever the log grows beyond a certain size so that it can recover by loading the latest checkpoint from local disk and replaying only the Write Record Append Serial defined defined success interspersed with Concurrent consistent inconsistent successes but undefined Failure inconsistent Table 1: File Region State After Mutation limited number of log records after that. The checkpoint is in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup without ex￾tra parsing. This further speeds up recovery and improves availability. Because building a checkpoint can take a while, the mas￾ter’s internal state is structured in such a way that a new checkpoint can be created without delaying incoming muta￾tions. The master switches to a new log file and creates the new checkpoint in a separate thread. The new checkpoint includes all mutations before the switch. It can be created in a minute or so for a cluster with a few million files. When completed, it is written to diskboth locally and remotely. Recovery needs only the latest complete checkpoint and subsequent log files. Older checkpoints and log files can be freely deleted, though we keep a few around to guard against catastrophes. A failure during checkpointing does not affect correctness because the recovery code detects and skips incomplete checkpoints. 2.7 Consistency Model GFS has a relaxed consistency model that supports our highly distributed applications well but remains relatively simple and efficient to implement. We now discuss GFS’s guarantees and what they mean to applications. We also highlight how GFS maintains these guarantees but leave the details to other parts of the paper. 2.7.1 Guarantees by GFS File namespace mutations (e.g., file creation) are atomic. They are handled exclusively by the master: namespace locking guarantees atomicity and correctness (Section 4.1); the master’s operation log defines a global total order of these operations (Section 2.6.3). The state of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations. Table 1 summa￾rizes the result. A file region is consistent if all clients will always see the same data, regardless of which replicas they read from. A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety. When a mutation succeeds without interference from concurrent writers, the affected region is defined (and by implication consistent): all clients will always see what the mutation has written. Concurrent successful mutations leave the region undefined but consistent: all clients see the same data, but it may not reflect what any one mutation has written. Typically, it consists of mingled fragments from multiple mutations. A failed mutation makes the region in￾consistent (hence also undefined): different clients may see different data at different times. We describe below how our applications can distinguish defined regions from undefined
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有