正在加载图片...
stale within short windows is file metadata,like directory finally compute and record the new checksums.If we do contents or access control information. not verify the first and last blocks before overwriting them To keep itself informed,a shadow master reads a replica of partially,the new checksums may hide corruption that exists the growing operation log and applies the same sequence of in the regions not being overwritten. changes to its data structures exactly as the primary does. During idle periods,chunkservers can scan and verify the Like the primary,it polls chunkservers at startup(and infre- contents of inactive chunks.This allows us to detect corrup- quently thereafter)to locate chunk replicas and exchanges tion in chunks that are rarely read.Once the corruption is frequent handshake messages with them to monitor their detected,the master can create a new uncorrupted replica status.It depends on the primary master only for replica and delete the corrupted replica.This prevents an inactive location updates resulting from the primary's decisions to but corrupted chunk replica from fooling the master into create and delete replicas. thinking that it has enough valid replicas of a chunk. 5.2 Data Integrity 5.3 Diagnostic Tools Each chunkserver uses checksumming to detect corruption Extensive and detailed diagnostic logging has helped im- of stored data.Given that a GFS cluster often has thousands measurably in problem isolation,debugging,and perfor- of disks on hundreds of machines,it regularly experiences mance analysis,while incurring only a minimal cost.With- disk failures that cause data corruption or loss on both the out logs,it is hard to understand transient,non-repeatable read and write paths.(See Section 7 for one cause.)We interactions between machines.GFS servers generate di- can recover from corruption using other chunk replicas,but agnostic logs that record many significant events (such as it would be impractical to detect corruption by comparing chunkservers going up and down)and all RPC requests and replicas across chunkservers.Moreover,divergent replicas replies.These diagnostic logs can be freely deleted without may be legal:the semantics of GFS mutations,in particular affecting the correctness of the system.However,we try to atomic record append as discussed earlier,does not guar- keep these logs around as far as space permits. antee identical replicas.Therefore,each chunkserver must The RPC logs include the exact requests and responses independently verify the integrity of its own copy by main- sent on the wire,except for the file data being read or writ- taining checksums. ten.By matching requests with replies and collating RPC A chunk is broken up into 64 KB blocks.Each has a corre- records on different machines,we can reconstruct the en- sponding 32 bit checksum.Like other metadata,checksums tire interaction history to diagnose a problem.The logs also are kept in memory and stored persistently with logging. serve as traces for load testing and performance analysis. separate from user data. The performance impact of logging is minimal (and far For reads.the chunkserver verifies the checksum of data outweighed by the benefits)because these logs are written blocks that overlap the read range before returning any data sequentially and asynchronously.The most recent events to the requester,whether a client or another chunkserver are also kept in memory and available for continuous online Therefore chunkservers will not propagate corruptions to monitoring. other machines.If a block does not match the recorded checksum,the chunkserver returns an error to the requestor 6. MEASUREMENTS and reports the mismatch to the master.In response,the requestor will read from other replicas,while the master In this section we present a few micro-benchmarks to illus- will clone the chunk from another replica.After a valid new trate the bottlenecks inherent in the GFS architecture and replica is in place,the master instructs the chunkserver that implementation,and also some numbers from real clusters reported the mismatch to delete its replica. in use at Google. Checksumming has little effect on read performance for 6.1 Micro-benchmarks several reasons.Since most of our reads span at least a few blocks,we need to read and checksum only a relatively We measured performance on a GFS cluster consisting small amount of extra data for verification.GFS client code of one master,two master replicas,16 chunkservers,and further reduces this overhead by trying to align reads at 16 clients.Note that this configuration was set up for ease checksum block boundaries.Moreover,checksum lookups of testing.Typical clusters have hundreds of chunkservers and comparison on the chunkserver are done without any and hundreds of clients. I/O,and checksum calculation can often be overlapped with All the machines are configured with dual 1.4 GHz PIII I/Os. processors,2 GB of memory,two 80 GB 5400 rpm disks,and Checksum computation is heavily optimized for writes a 100 Mbps full-duplex Ethernet connection to an HP 2524 that append to the end of a chunk (as opposed to writes switch.All 19 GFS server machines are connected to one that overwrite existing data)because they are dominant in switch,and all 16 client machines to the other.The two our workloads.We just incrementally update the check- switches are connected with a 1 Gbps link sum for the last partial checksum block,and compute new checksums for any brand new checksum blocks filled by the 6.1.1 Reads append.Even if the last partial checksum block is already N clients read simultaneously from the file system.Each corrupted and we fail to detect it now,the new checksum client reads a randomly selected 4 MB region from a 320 GB value will not match the stored data.and the corruption will file set.This is repeated 256 times so that each client ends be detected as usual when the block is next read. up reading 1 GB of data.The chunkservers taken together In contrast,if a write overwrites an existing range of the have only 32 GB of memory,so we expect at most a 10%hit chunk,we must read and verify the first and last blocks of rate in the Linux buffer cache.Our results should be close the range being overwritten,then perform the write,and to cold cache results.stale within short windows is file metadata, like directory contents or access control information. To keep itself informed, a shadow master reads a replica of the growing operation log and applies the same sequence of changes to its data structures exactly as the primary does. Like the primary, it polls chunkservers at startup (and infre￾quently thereafter) to locate chunkreplicas and exchanges frequent handshake messages with them to monitor their status. It depends on the primary master only for replica location updates resulting from the primary’s decisions to create and delete replicas. 5.2 Data Integrity Each chunkserver uses checksumming to detect corruption of stored data. Given that a GFS cluster often has thousands of disks on hundreds of machines, it regularly experiences diskfailures that cause data corruption or loss on both the read and write paths. (See Section 7 for one cause.) We can recover from corruption using other chunkreplicas, but it would be impractical to detect corruption by comparing replicas across chunkservers. Moreover, divergent replicas may be legal: the semantics of GFS mutations, in particular atomic record append as discussed earlier, does not guar￾antee identical replicas. Therefore, each chunkserver must independently verify the integrity of its own copy by main￾taining checksums. A chunkis broken up into 64 KB blocks. Each has a corre￾sponding 32 bit checksum. Like other metadata, checksums are kept in memory and stored persistently with logging, separate from user data. For reads, the chunkserver verifies the checksum of data blocks that overlap the read range before returning any data to the requester, whether a client or another chunkserver. Therefore chunkservers will not propagate corruptions to other machines. If a blockdoes not match the recorded checksum, the chunkserver returns an error to the requestor and reports the mismatch to the master. In response, the requestor will read from other replicas, while the master will clone the chunkfrom another replica. After a valid new replica is in place, the master instructs the chunkserver that reported the mismatch to delete its replica. Checksumming has little effect on read performance for several reasons. Since most of our reads span at least a few blocks, we need to read and checksum only a relatively small amount of extra data for verification. GFS client code further reduces this overhead by trying to align reads at checksum block boundaries. Moreover, checksum lookups and comparison on the chunkserver are done without any I/O, and checksum calculation can often be overlapped with I/Os. Checksum computation is heavily optimized for writes that append to the end of a chunk(as opposed to writes that overwrite existing data) because they are dominant in our workloads. We just incrementally update the check￾sum for the last partial checksum block, and compute new checksums for any brand new checksum blocks filled by the append. Even if the last partial checksum block is already corrupted and we fail to detect it now, the new checksum value will not match the stored data, and the corruption will be detected as usual when the blockis next read. In contrast, if a write overwrites an existing range of the chunk, we must read and verify the first and last blocks of the range being overwritten, then perform the write, and finally compute and record the new checksums. If we do not verify the first and last blocks before overwriting them partially, the new checksums may hide corruption that exists in the regions not being overwritten. During idle periods, chunkservers can scan and verify the contents of inactive chunks. This allows us to detect corrup￾tion in chunks that are rarely read. Once the corruption is detected, the master can create a new uncorrupted replica and delete the corrupted replica. This prevents an inactive but corrupted chunkreplica from fooling the master into thinking that it has enough valid replicas of a chunk. 5.3 Diagnostic Tools Extensive and detailed diagnostic logging has helped im￾measurably in problem isolation, debugging, and perfor￾mance analysis, while incurring only a minimal cost. With￾out logs, it is hard to understand transient, non-repeatable interactions between machines. GFS servers generate di￾agnostic logs that record many significant events (such as chunkservers going up and down) and all RPC requests and replies. These diagnostic logs can be freely deleted without affecting the correctness of the system. However, we try to keep these logs around as far as space permits. The RPC logs include the exact requests and responses sent on the wire, except for the file data being read or writ￾ten. By matching requests with replies and collating RPC records on different machines, we can reconstruct the en￾tire interaction history to diagnose a problem. The logs also serve as traces for load testing and performance analysis. The performance impact of logging is minimal (and far outweighed by the benefits) because these logs are written sequentially and asynchronously. The most recent events are also kept in memory and available for continuous online monitoring. 6. MEASUREMENTS In this section we present a few micro-benchmarks to illus￾trate the bottlenecks inherent in the GFS architecture and implementation, and also some numbers from real clusters in use at Google. 6.1 Micro-benchmarks We measured performance on a GFS cluster consisting of one master, two master replicas, 16 chunkservers, and 16 clients. Note that this configuration was set up for ease of testing. Typical clusters have hundreds of chunkservers and hundreds of clients. All the machines are configured with dual 1.4 GHz PIII processors, 2 GB of memory, two 80 GB 5400 rpm disks, and a 100 Mbps full-duplex Ethernet connection to an HP 2524 switch. All 19 GFS server machines are connected to one switch, and all 16 client machines to the other. The two switches are connected with a 1 Gbps link. 6.1.1 Reads N clients read simultaneously from the file system. Each client reads a randomly selected 4 MB region from a 320 GB file set. This is repeated 256 times so that each client ends up reading 1 GB of data. The chunkservers taken together have only 32 GB of memory, so we expect at most a 10% hit rate in the Linux buffer cache. Our results should be close to cold cache results
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有