正在加载图片...
step I Master file region may end up containing fragments from different Client clients,although the replicas will be identical because the in- dividual operations are completed successfully in the same order on all replicas.This leaves the file region in consistent but undefined state as noted in Section 2.7. Secondary Replica A 3.2 Data Flow We decouple the flow of data from the flow of control to Primary use the network efficiently.While control flows from the Replica client to the primary and then to all secondaries,data is Legend: pushed linearly along a carefully picked chain of chunkservers in a pipelined fashion.Our goals are to fully utilize each Control machine's network bandwidth,avoid network bottlenecks Secondary Data and high-latency links,and minimize the latency to push Replica B through all the data To fully utilize each machine's network bandwidth,the Figure 2:Write Control and Data Flow data is pushed linearly along a chain of chunkservers rather than distributed in some other topology (e.g.,tree).Thus, each machine's full outbound bandwidth is used to trans- fer the data as fast as possible rather than divided among becomes unreachable or replies that it no longer holds multiple recipients. a lease. To avoid network bottlenecks and high-latency links(e.g., 3.The client pushes the data to all the replicas.A client inter-switch links are often both)as much as possible,each machine forwards the data to the "closest"machine in the can do so in any order.Each chunkserver will store the data in an internal LRU buffer cache until the network topology that has not received it.Suppose the data is used or aged out.By decoupling the data flow client is pushing data to chunkservers S1 through S4.It from the control flow,we can improve performance by sends the data to the closest chunkserver,say S1.S1 for- wards it to the closest chunkserver S2 through S4 closest to scheduling the expensive data flow based on the net- work topology regardless of which chunkserver is the S1,say S2.Similarly,S2 forwards it to S3 or S4,whichever primary.Section 3.2 discusses this further. is closer to S2,and so on.Our network topology is simple enough that "distances"can be accurately estimated from 4.Once all the replicas have acknowledged receiving the IP addresses. data.the client sends a write request to the primary. Finally,we minimize latency by pipelining the data trans- The request identifies the data pushed earlier to all of fer over TCP connections.Once a chunkserver receives some the replicas.The primary assigns consecutive serial data,it starts forwarding immediately.Pipelining is espe numbers to all the mutations it receives,possibly from cially helpful to us because we use a switched network with multiple clients,which provides the necessary serial- full-duplex links.Sending the data immediately does not ization.It applies the mutation to its own local state reduce the receive rate.Without network congestion,the in serial number order. ideal elapsed time for transferring B bytes to R replicas is 5.The primary forwards the write request to all sec- B/T+RL where T is the network throughput and L is la- ondary replicas.Each secondary replica applies mu- tency to transfer bytes between two machines.Our network tations in the same serial number order assigned by links are typically 100 Mbps (T),and L is far below 1 ms. the primary. Therefore,1 MB can ideally be distributed in about 80 ms. 6.The secondaries all reply to the primary indicating that they have completed the operation. 7.The primary replies to the client.Any errors encoun- 3.3 Atomic Record Appends tered at any of the replicas are reported to the client. GFS provides an atomic append operation called record In case of errors,the write may have succeeded at the append.In a traditional write,the client specifies the off- primary and an arbitrary subset of the secondary repli- set at which data is to be written.Concurrent writes to cas.(If it had failed at the primary,it would not the same region are not serializable:the region may end up have been assigned a serial number and forwarded. containing data fragments from multiple clients.In a record The client request is considered to have failed,and the append,however,the client specifies only the data.GFS modified region is left in an inconsistent state.Our appends it to the file at least once atomically (i.e.,as one client code handles such errors by retrying the failed continuous sequence of bytes)at an offset of GFS's choosing mutation.It will make a few attempts at steps (3) and returns that offset to the client.This is similar to writ- through(7)before falling back to a retry from the be- ing to a file opened in 0APPEND mode in Unix without the ginning of the write. race conditions when multiple writers do so concurrently. Record append is heavily used by our distributed applica- If a write by the application is large or straddles a chunk tions in which many clients on different machines append boundary.GFS client code breaks it down into multiple to the same file concurrently.Clients would need addi- write operations.They all follow the control flow described tional complicated and expensive synchronization,for ex- above but may be interleaved with and overwritten by con- ample through a distributed lock manager,if they do so current operations from other clients.Therefore,the shared with traditional writes.In our workloads,such files oftenPrimary Replica Secondary Replica B Secondary Replica A Master Legend: Control Data 3 Client 2 4 step 1 5 6 6 7 Figure 2: Write Control and Data Flow becomes unreachable or replies that it no longer holds a lease. 3. The client pushes the data to all the replicas. A client can do so in any order. Each chunkserver will store the data in an internal LRU buffer cache until the data is used or aged out. By decoupling the data flow from the control flow, we can improve performance by scheduling the expensive data flow based on the net￾worktopology regardless of which chunkserver is the primary. Section 3.2 discusses this further. 4. Once all the replicas have acknowledged receiving the data, the client sends a write request to the primary. The request identifies the data pushed earlier to all of the replicas. The primary assigns consecutive serial numbers to all the mutations it receives, possibly from multiple clients, which provides the necessary serial￾ization. It applies the mutation to its own local state in serial number order. 5. The primary forwards the write request to all sec￾ondary replicas. Each secondary replica applies mu￾tations in the same serial number order assigned by the primary. 6. The secondaries all reply to the primary indicating that they have completed the operation. 7. The primary replies to the client. Any errors encoun￾tered at any of the replicas are reported to the client. In case of errors, the write may have succeeded at the primary and an arbitrary subset of the secondary repli￾cas. (If it had failed at the primary, it would not have been assigned a serial number and forwarded.) The client request is considered to have failed, and the modified region is left in an inconsistent state. Our client code handles such errors by retrying the failed mutation. It will make a few attempts at steps (3) through (7) before falling backto a retry from the be￾ginning of the write. If a write by the application is large or straddles a chunk boundary, GFS client code breaks it down into multiple write operations. They all follow the control flow described above but may be interleaved with and overwritten by con￾current operations from other clients. Therefore, the shared file region may end up containing fragments from different clients, although the replicas will be identical because the in￾dividual operations are completed successfully in the same order on all replicas. This leaves the file region in consistent but undefined state as noted in Section 2.7. 3.2 Data Flow We decouple the flow of data from the flow of control to use the networkefficiently. While control flows from the client to the primary and then to all secondaries, data is pushed linearly along a carefully picked chain of chunkservers in a pipelined fashion. Our goals are to fully utilize each machine’s networkbandwidth, avoid networkbottlenecks and high-latency links, and minimize the latency to push through all the data. To fully utilize each machine’s networkbandwidth, the data is pushed linearly along a chain of chunkservers rather than distributed in some other topology (e.g., tree). Thus, each machine’s full outbound bandwidth is used to trans￾fer the data as fast as possible rather than divided among multiple recipients. To avoid network bottlenecks and high-latency links (e.g., inter-switch links are often both) as much as possible, each machine forwards the data to the “closest” machine in the networktopology that has not received it. Suppose the client is pushing data to chunkservers S1 through S4. It sends the data to the closest chunkserver, say S1. S1 for￾wards it to the closest chunkserver S2 through S4 closest to S1, say S2. Similarly, S2 forwards it to S3 or S4, whichever is closer to S2, and so on. Our networktopology is simple enough that “distances” can be accurately estimated from IP addresses. Finally, we minimize latency by pipelining the data trans￾fer over TCP connections. Once a chunkserver receives some data, it starts forwarding immediately. Pipelining is espe￾cially helpful to us because we use a switched networkwith full-duplex links. Sending the data immediately does not reduce the receive rate. Without networkcongestion, the ideal elapsed time for transferring B bytes to R replicas is B/T + RL where T is the networkthroughput and L is la￾tency to transfer bytes between two machines. Our network links are typically 100 Mbps (T), and L is far below 1 ms. Therefore, 1 MB can ideally be distributed in about 80 ms. 3.3 Atomic Record Appends GFS provides an atomic append operation called record append. In a traditional write, the client specifies the off- set at which data is to be written. Concurrent writes to the same region are not serializable: the region may end up containing data fragments from multiple clients. In a record append, however, the client specifies only the data. GFS appends it to the file at least once atomically (i.e., as one continuous sequence of bytes) at an offset of GFS’s choosing and returns that offset to the client. This is similar to writ￾ing to a file opened in O APPEND mode in Unix without the race conditions when multiple writers do so concurrently. Record append is heavily used by our distributed applica￾tions in which many clients on different machines append to the same file concurrently. Clients would need addi￾tional complicated and expensive synchronization, for ex￾ample through a distributed lockmanager, if they do so with traditional writes. In our workloads, such files often
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有