正在加载图片...
regions.The applications do not need to further distinguish file data that is still incomplete from the application's per- between different kinds of undefined regions. spective. Data mutations may be writes or record appends.A write In the other typical use,many writers concurrently ap- causes data to be written at an application-specified file pend to a file for merged results or as a producer-consumer offset.A record append causes data (the "record")to be queue.Record append's append-at-least-once semantics pre- appended atomically at least once even in the presence of serves each writer's output.Readers deal with the occa- concurrent mutations,but at an offset of GFS's choosing sional padding and duplicates as follows.Each record pre (Section 3.3).(In contrast,a "regular"append is merely a pared by the writer contains extra information like check- write at an offset that the client believes to be the current sums so that its validity can be verified.A reader can end of file.The offset is returned to the client and marks identify and discard extra padding and record fragments the beginning of a defined region that contains the record. using the checksums.If it cannot tolerate the occasional In addition,GFS may insert padding or record duplicates in duplicates (e.g.,if they would trigger non-idempotent op- between.They occupy regions considered to be inconsistent erations),it can filter them out using unique identifiers in and are typically dwarfed by the amount of user data. the records,which are often needed anyway to name corre- After a sequence of successful mutations,the mutated file sponding application entities such as web documents.These region is guaranteed to be defined and contain the data writ- functionalities for record I/O (except duplicate removal)are ten by the last mutation.GFS achieves this by (a)applying in library code shared by our applications and applicable to mutations to a chunk in the same order on all its replicas other file interface implementations at Google.With that, (Section 3.1),and (b)using chunk version numbers to detect the same sequence of records,plus rare duplicates,is always any replica that has become stale because it has missed mu- delivered to the record reader. tations while its chunkserver was down (Section 4.5).Stale replicas will never be involved in a mutation or given to 3.SYSTEM INTERACTIONS clients asking the master for chunk locations.They are garbage collected at the earliest opportunity. We designed the system to minimize the master's involve- Since clients cache chunk locations,they may read from a ment in all operations.With that background,we now de- stale replica before that information is refreshed.This win- scribe how the client,master,and chunkservers interact to dow is limited by the cache entry's timeout and the next implement data mutations,atomic record append,and snap- open of the file,which purges from the cache all chunk in- shot. formation for that file.Moreover,as most of our files are 3.1 Leases and Mutation Order append-only,a stale replica usually returns a premature end of chunk rather than outdated data.When a reader A mutation is an operation that changes the contents or retries and contacts the master,it will immediately get cur- metadata of a chunk such as a write or an append opera- rent chunk locations. tion.Each mutation is performed at all the chunk's replicas. Long after a successful mutation,component failures can We use leases to maintain a consistent mutation order across of course still corrupt or destroy data.GFS identifies failed replicas.The master grants a chunk lease to one of the repli- chunkservers by regular handshakes between master and all cas,which we call the primary.The primary picks a serial chunkservers and detects data corruption by checksumming order for all mutations to the chunk.All replicas follow this (Section 5.2).Once a problem surfaces,the data is restored order when applying mutations.Thus,the global mutation from valid replicas as soon as possible(Section 4.3).A chunk order is defined first by the lease grant order chosen by the is lost irreversibly only if all its replicas are lost before GFS master,and within a lease by the serial numbers assigned can react,typically within minutes.Even in this case.it be- by the primary. comes unavailable,not corrupted:applications receive clear The lease mechanism is designed to minimize manage- errors rather than corrupt data. ment overhead at the master.A lease has an initial timeout of 60 seconds.However,as long as the chunk is being mu- 2.7.2 Implications for Applications tated,the primary can request and typically receive exten- sions from the master indefinitely.These extension requests GFS applications can accommodate the relaxed consis- and grants are piggybacked on the HeartBeat messages reg- tency model with a few simple techniques already needed for ularly exchanged between the master and all chunkservers other purposes:relying on appends rather than overwrites. The master may sometimes try to revoke a lease before it checkpointing,and writing self-validating,self-identifying expires (e.g.,when the master wants to disable mutations records. on a file that is being renamed).Even if the master loses Practically all our applications mutate files by appending communication with a primary,it can safely grant a new rather than overwriting.In one typical use,a writer gener- lease to another replica after the old lease expires. ates a file from beginning to end.It atomically renames the In Figure 2,we illustrate this process by following the file to a permanent name after writing all the data,or pe- control flow of a write through these numbered steps. riodically checkpoints how much has been successfully writ- ten.Checkpoints may also include application-level check- 1.The client asks the master which chunkserver holds sums.Readers verify and process only the file region up the current lease for the chunk and the locations of to the last checkpoint,which is known to be in the defined the other replicas.If no one has a lease,the master state.Regardless of consistency and concurrency issues.this grants one to a replica it chooses (not shown). approach has served us well.Appending is far more effi- 2.The master replies with the identity of the primary and cient and more resilient to application failures than random the locations of the other (secondary)replicas.The writes.Checkpointing allows writers to restart incremen- client caches this data for future mutations.It needs tally and keeps readers from processing successfully written to contact the master again only when the primaryregions. The applications do not need to further distinguish between different kinds of undefined regions. Data mutations may be writes or record appends. A write causes data to be written at an application-specified file offset. A record append causes data (the “record”) to be appended atomically at least once even in the presence of concurrent mutations, but at an offset of GFS’s choosing (Section 3.3). (In contrast, a “regular” append is merely a write at an offset that the client believes to be the current end of file.) The offset is returned to the client and marks the beginning of a defined region that contains the record. In addition, GFS may insert padding or record duplicates in between. They occupy regions considered to be inconsistent and are typically dwarfed by the amount of user data. After a sequence of successful mutations, the mutated file region is guaranteed to be defined and contain the data writ￾ten by the last mutation. GFS achieves this by (a) applying mutations to a chunkin the same order on all its replicas (Section 3.1), and (b) using chunkversion numbers to detect any replica that has become stale because it has missed mu￾tations while its chunkserver was down (Section 4.5). Stale replicas will never be involved in a mutation or given to clients asking the master for chunk locations. They are garbage collected at the earliest opportunity. Since clients cache chunklocations, they may read from a stale replica before that information is refreshed. This win￾dow is limited by the cache entry’s timeout and the next open of the file, which purges from the cache all chunkin￾formation for that file. Moreover, as most of our files are append-only, a stale replica usually returns a premature end of chunkrather than outdated data. When a reader retries and contacts the master, it will immediately get cur￾rent chunklocations. Long after a successful mutation, component failures can of course still corrupt or destroy data. GFS identifies failed chunkservers by regular handshakes between master and all chunkservers and detects data corruption by checksumming (Section 5.2). Once a problem surfaces, the data is restored from valid replicas as soon as possible (Section 4.3). A chunk is lost irreversibly only if all its replicas are lost before GFS can react, typically within minutes. Even in this case, it be￾comes unavailable, not corrupted: applications receive clear errors rather than corrupt data. 2.7.2 Implications for Applications GFS applications can accommodate the relaxed consis￾tency model with a few simple techniques already needed for other purposes: relying on appends rather than overwrites, checkpointing, and writing self-validating, self-identifying records. Practically all our applications mutate files by appending rather than overwriting. In one typical use, a writer gener￾ates a file from beginning to end. It atomically renames the file to a permanent name after writing all the data, or pe￾riodically checkpoints how much has been successfully writ￾ten. Checkpoints may also include application-level check￾sums. Readers verify and process only the file region up to the last checkpoint, which is known to be in the defined state. Regardless of consistency and concurrency issues, this approach has served us well. Appending is far more effi- cient and more resilient to application failures than random writes. Checkpointing allows writers to restart incremen￾tally and keeps readers from processing successfully written file data that is still incomplete from the application’s per￾spective. In the other typical use, many writers concurrently ap￾pend to a file for merged results or as a producer-consumer queue. Record append’s append-at-least-once semantics pre￾serves each writer’s output. Readers deal with the occa￾sional padding and duplicates as follows. Each record pre￾pared by the writer contains extra information like check￾sums so that its validity can be verified. A reader can identify and discard extra padding and record fragments using the checksums. If it cannot tolerate the occasional duplicates (e.g., if they would trigger non-idempotent op￾erations), it can filter them out using unique identifiers in the records, which are often needed anyway to name corre￾sponding application entities such as web documents. These functionalities for record I/O (except duplicate removal) are in library code shared by our applications and applicable to other file interface implementations at Google. With that, the same sequence of records, plus rare duplicates, is always delivered to the record reader. 3. SYSTEM INTERACTIONS We designed the system to minimize the master’s involve￾ment in all operations. With that background, we now de￾scribe how the client, master, and chunkservers interact to implement data mutations, atomic record append, and snap￾shot. 3.1 Leases and Mutation Order A mutation is an operation that changes the contents or metadata of a chunksuch as a write or an append opera￾tion. Each mutation is performed at all the chunk’s replicas. We use leases to maintain a consistent mutation order across replicas. The master grants a chunklease to one of the repli￾cas, which we call the primary. The primary picks a serial order for all mutations to the chunk. All replicas follow this order when applying mutations. Thus, the global mutation order is defined first by the lease grant order chosen by the master, and within a lease by the serial numbers assigned by the primary. The lease mechanism is designed to minimize manage￾ment overhead at the master. A lease has an initial timeout of 60 seconds. However, as long as the chunkis being mu￾tated, the primary can request and typically receive exten￾sions from the master indefinitely. These extension requests and grants are piggybacked on the HeartBeat messages reg￾ularly exchanged between the master and all chunkservers. The master may sometimes try to revoke a lease before it expires (e.g., when the master wants to disable mutations on a file that is being renamed). Even if the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires. In Figure 2, we illustrate this process by following the control flow of a write through these numbered steps. 1. The client asks the master which chunkserver holds the current lease for the chunkand the locations of the other replicas. If no one has a lease, the master grants one to a replica it chooses (not shown). 2. The master replies with the identity of the primary and the locations of the other (secondary) replicas. The client caches this data for future mutations. It needs to contact the master again only when the primary
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有