正在加载图片...
file names serialize attempts to create a file with the same The master picks the highest priority chunk and "clones" name twice. it by instructing some chunkserver to copy the chunk data Since the namespace can have many nodes,read-write lock directly from an existing valid replica.The new replica is objects are allocated lazily and deleted once they are not in placed with goals similar to those for creation:equalizing use.Also.locks are acquired in a consistent total order disk space utilization,limiting active clone operations on to prevent deadlock:they are first ordered by level in the any single chunkserver,and spreading replicas across racks namespace tree and lexicographically within the same level. To keep cloning traffic from overwhelming client traffic,the master limits the numbers of active clone operations both 4.2 Replica Placement for the cluster and for each chunkserver.Additionally,each A GFS cluster is highly distributed at more levels than chunkserver limits the amount of bandwidth it spends on one.It typically has hundreds of chunkservers spread across each clone operation by throttling its read requests to the many machine racks.These chunkservers in turn may be source chunkserver accessed from hundreds of clients from the same or different Finally,the master rebalances replicas periodically:it ex- racks.Communication between two machines on different amines the current replica distribution and moves replicas racks may cross one or more network switches.Addition- for better disk space and load balancing.Also through this ally,bandwidth into or out of a rack may be less than the process,the master gradually fills up a new chunkserver aggregate bandwidth of all the machines within the rack. rather than instantly swamps it with new chunks and the Multi-level distribution presents a unique challenge to dis- heavy write traffic that comes with them.The placement tribute data for scalability,reliability,and availability. criteria for the new replica are similar to those discussed The chunk replica placement policy serves two purposes: above.In addition.the master must also choose which ex- maximize data reliability and availability.and maximize net- isting replica to remove.In general,it prefers to remove work bandwidth utilization.For both.it is not enough to those on chunkservers with below-average free space so as spread replicas across machines,which only guards against to equalize disk space usage. disk or machine failures and fully utilizes each machine's net- work bandwidth.We must also spread chunk replicas across 4.4 Garbage Collection racks.This ensures that some replicas of a chunk will sur- After a file is deleted.GFS does not immediately reclaim vive and remain available even if an entire rack is damaged the available physical storage.It does so only lazily during or offline (for example,due to failure of a shared resource regular garbage collection at both the file and chunk levels like a network switch or power circuit).It also means that We find that this approach makes the system much simpler traffic,especially reads,for a chunk can exploit the aggre- and more reliable. gate bandwidth of multiple racks.On the other hand,write traffic has to flow through multiple racks,a tradeoff we make 4.4.I Mechanism willingly. When a file is deleted by the application,the master logs 4.3 Creation,Re-replication,Rebalancing the deletion immediately just like other changes.However Chunk replicas are created for three reasons:chunk cre- instead of reclaiming resources immediately,the file is just renamed to a hidden name that includes the deletion times- ation,re-replication,and rebalancing. When the master creates a chunk,it chooses where to tamp.During the master's regular scan of the file system namespace.it removes any such hidden files if they have ex- place the initially empty replicas.It considers several fac- tors.(1)We want to place new replicas on chunkservers with isted for more than three days (the interval is configurable). below-average disk space utilization.Over time this will Until then,the file can still be read under the new,special equalize disk utilization across chunkservers.(2)We want to name and can be undeleted by renaming it back to normal. limit the number of"recent"creations on each chunkserver. When the hidden file is removed from the namespace,its in- memory metadata is erased.This effectively severs its links Although creation itself is cheap,it reliably predicts immi- to all its chunks. nent heavy write traffic because chunks are created when de- manded by writes,and in our append-once-read-many work- In a similar regular scan of the chunk namespace,the load they typically become practically read-only once they master identifies orphaned chunks(i.e.,those not reachable from any file)and erases the metadata for those chunks.In have been completely written.(3)As discussed above,we want to spread replicas of a chunk across racks a HeartBeat message regularly exchanged with the master, The master re-replicates a chunk as soon as the number each chunkserver reports a subset of the chunks it has,and of available replicas falls below a user-specified goal.This the master replies with the identity of all chunks that are no could happen for various reasons:a chunkserver becomes longer present in the master's metadata.The chunkserver unavailable,it reports that its replica may be corrupted,one is free to delete its replicas of such chunks. of its disks is disabled because of errors,or the replication goal is increased.Each chunk that needs to be re-replicated 4.4.2 Discussion is prioritized based on several factors.One is how far it is Although distributed garbage collection is a hard problem from its replication goal.For example,we give higher prior- that demands complicated solutions in the context of pro- ity to a chunk that has lost two replicas than to a chunk that gramming languages,it is quite simple in our case.We can has lost only one.In addition,we prefer to first re-replicate easily identify all references to chunks:they are in the file- chunks for live files as opposed to chunks that belong to re- to-chunk mappings maintained exclusively by the master. cently deleted files (see Section 4.4).Finally.to minimize We can also easily identify all the chunk replicas:they are the impact of failures on running applications,we boost the Linux files under designated directories on each chunkserver priority of any chunk that is blocking client progress. Any such replica not known to the master is "garbage."file names serialize attempts to create a file with the same name twice. Since the namespace can have many nodes, read-write lock objects are allocated lazily and deleted once they are not in use. Also, locks are acquired in a consistent total order to prevent deadlock: they are first ordered by level in the namespace tree and lexicographically within the same level. 4.2 Replica Placement A GFS cluster is highly distributed at more levels than one. It typically has hundreds of chunkservers spread across many machine racks. These chunkservers in turn may be accessed from hundreds of clients from the same or different racks. Communication between two machines on different racks may cross one or more network switches. Addition￾ally, bandwidth into or out of a rackmay be less than the aggregate bandwidth of all the machines within the rack. Multi-level distribution presents a unique challenge to dis￾tribute data for scalability, reliability, and availability. The chunkreplica placement policy serves two purposes: maximize data reliability and availability, and maximize net￾workbandwidth utilization. For both, it is not enough to spread replicas across machines, which only guards against diskor machine failures and fully utilizes each machine’s net￾workbandwidth. We must also spread chunkreplicas across racks. This ensures that some replicas of a chunk will sur￾vive and remain available even if an entire rackis damaged or offline (for example, due to failure of a shared resource like a network switch or power circuit). It also means that traffic, especially reads, for a chunkcan exploit the aggre￾gate bandwidth of multiple racks. On the other hand, write traffic has to flow through multiple racks, a tradeoff we make willingly. 4.3 Creation, Re-replication, Rebalancing Chunkreplicas are created for three reasons: chunkcre￾ation, re-replication, and rebalancing. When the master creates a chunk, it chooses where to place the initially empty replicas. It considers several fac￾tors. (1) We want to place new replicas on chunkservers with below-average diskspace utilization. Over time this will equalize diskutilization across chunkservers. (2) We want to limit the number of “recent” creations on each chunkserver. Although creation itself is cheap, it reliably predicts immi￾nent heavy write traffic because chunks are created when de￾manded by writes, and in our append-once-read-many work￾load they typically become practically read-only once they have been completely written. (3) As discussed above, we want to spread replicas of a chunkacross racks. The master re-replicates a chunkas soon as the number of available replicas falls below a user-specified goal. This could happen for various reasons: a chunkserver becomes unavailable, it reports that its replica may be corrupted, one of its disks is disabled because of errors, or the replication goal is increased. Each chunkthat needs to be re-replicated is prioritized based on several factors. One is how far it is from its replication goal. For example, we give higher prior￾ity to a chunkthat has lost two replicas than to a chunkthat has lost only one. In addition, we prefer to first re-replicate chunks for live files as opposed to chunks that belong to re￾cently deleted files (see Section 4.4). Finally, to minimize the impact of failures on running applications, we boost the priority of any chunkthat is blocking client progress. The master picks the highest priority chunk and “clones” it by instructing some chunkserver to copy the chunk data directly from an existing valid replica. The new replica is placed with goals similar to those for creation: equalizing diskspace utilization, limiting active clone operations on any single chunkserver, and spreading replicas across racks. To keep cloning traffic from overwhelming client traffic, the master limits the numbers of active clone operations both for the cluster and for each chunkserver. Additionally, each chunkserver limits the amount of bandwidth it spends on each clone operation by throttling its read requests to the source chunkserver. Finally, the master rebalances replicas periodically: it ex￾amines the current replica distribution and moves replicas for better diskspace and load balancing. Also through this process, the master gradually fills up a new chunkserver rather than instantly swamps it with new chunks and the heavy write traffic that comes with them. The placement criteria for the new replica are similar to those discussed above. In addition, the master must also choose which ex￾isting replica to remove. In general, it prefers to remove those on chunkservers with below-average free space so as to equalize diskspace usage. 4.4 Garbage Collection After a file is deleted, GFS does not immediately reclaim the available physical storage. It does so only lazily during regular garbage collection at both the file and chunklevels. We find that this approach makes the system much simpler and more reliable. 4.4.1 Mechanism When a file is deleted by the application, the master logs the deletion immediately just like other changes. However instead of reclaiming resources immediately, the file is just renamed to a hidden name that includes the deletion times￾tamp. During the master’s regular scan of the file system namespace, it removes any such hidden files if they have ex￾isted for more than three days (the interval is configurable). Until then, the file can still be read under the new, special name and can be undeleted by renaming it backto normal. When the hidden file is removed from the namespace, its in￾memory metadata is erased. This effectively severs its links to all its chunks. In a similar regular scan of the chunknamespace, the master identifies orphaned chunks (i.e., those not reachable from any file) and erases the metadata for those chunks. In a HeartBeat message regularly exchanged with the master, each chunkserver reports a subset of the chunks it has, and the master replies with the identity of all chunks that are no longer present in the master’s metadata. The chunkserver is free to delete its replicas of such chunks. 4.4.2 Discussion Although distributed garbage collection is a hard problem that demands complicated solutions in the context of pro￾gramming languages, it is quite simple in our case. We can easily identify all references to chunks: they are in the file￾to-chunkmappings maintained exclusively by the master. We can also easily identify all the chunkreplicas: they are Linux files under designated directories on each chunkserver. Any such replica not known to the master is “garbage
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有