正在加载图片...
ZHANG ET AL.:EFFICIENT DATA CENTER FLOW SCHEDULING WITHOUT STARVATION USING EXPANSION RATIO 3163 1.2 link 2 H1-H4:short flow 1 0.8 H1-H4:long flow 0.6 H2-H3:long flow 0.4 HI H2 H3 H4 0.2 Fig.3.ECMP limitation. 0 200 400 600 800 1000 immediately detect the finish of the first-flow.Therefore,we Time(ms) suggest that号C≤O≤axC. Fig.4.The utilizations of S3-S1(link 1)and S3-S2(link 2)in the topology shown in Fig.3 under ECMP. 5 ROUTE CONTROL sizes vary significantly in data centers:90 percent of all Nowadays data centers often use multi-path topologies,so flows are less than 100 KB,and they contribute only it is also essential to design an appropriate routing scheme 5 percent to the overall traffic [2],[25]. which can wisely decide the routing path for each flow In summary,the limitations of stateless ECMP mainly among multiple equal-cost paths.We first show the limita- include the following aspects. tions of ECMP(Section 5.1),then give the design rationale of route control in MERP (Section 5.2),lastly,we present Collision may occur.Different flows may be hashed design details(Section 5.3). to the same path.For example,in Fig.3,even if there are only two flows (one is from HI to H4,and the 5.1 Limitations of ECMP other is from H2 to H3),the probability that they are To meet the need for large aggregate bandwidth and fault- routed to the same path is as high as 50 percent. tolerance,data centers are designed to have many equal- ECMP is sensitive to flow size distribution.When cost paths,where ECMP is widely deployed for routing [111, flow sizes vary significantly,long flows may be [20].To avoid out-of-order packets,which may cause obvi- scheduled to the same path.When short flows finish, ous throughput degradation of TCP [21],ECMP ensures link utilizations may be quite imbalanced. the packets belonging to the same TCP flow traverse the ECMP does not take downstream link conditions same path by hashing several TCP/IP fields of each packet into account.When downstream network links have to one of the available equal-cost routes to the destina- congestions or fail,upstream ECMP switches cannot tion [22].The widely used hash algorithm is Hash-Thresh- perceive such changes.Taking Fig.3 for example,if old [23].It first selects a key by performing hash over TCP/ link S2-S4 is congested,and S3 uses ECMP to route IP 5-tuple of a packet.Then a region size is obtained by flows,S3 cannot avoid routing packets to this con- dividing the codomain size of the hash function to the num- gested link. ber of available next hops.Lastly the hash key is divided by the region size to get the output port number.Note that the 5.2 Design Rationale algorithm does not specify the hash function to obtain the Appropriate design rationales are quite important to the key but typically uses Cyclic Redundancy Check [22]. overall performance.Hence,before introducing our specific Although ECMP provides load balancing to some extent, design,we first present three design rationales. it still has several limitations.Consider the example in First,the primary purpose of the route control is to reduce Fig.3,there are one short and two long flows in the simple latency,not to improve utilization.Many existing data center network.Suppose two flows from H1 to H4 are routed to routing protocols focus on load balancing.Such kind of different paths according to ECMP.Now,a new flow from design often puts network utilization in the first place. H2 to H3 arrives.Since ECMP is stateless,this new flow However,increasing utilization is not equivalent to reduc- may be routed to S1 by S3.When the short flow finishes,ing latency.Here is an example:suppose there are 2 equal- two long flows still compete with each other.We also con- cost paths,P and P2,starting from a router R;the capacity ducted experiments to verify our findings.We let HI send of either path is 1.There are 3 flows,say,f,f2,and fs;the 1 GB data to H3 via many short flows,and meanwhile,let sizes of them are 4,1,and 1.Suppose fi is routed to P,and H2 send 1 GB data to H4 via one single long flow.We peri-the others are routed to P2.Now,a new flow fa with a size odically check the utilizations of two links:S3-S1 (link 1),of 1 arrives at R.If we aims to maximize link utilization,f and S3-S2(link 2).The results are shown in Fig.4.We notice will be routed to P:under SRPT,the FCTs of them are 4,1, that,the utilization of link 2 is stable and high,while that of 2,and 3,and the average FCT is 2.5.However,if fa is routed link 1 varies significantly over time. to P,the FCTs of them become 4,1,2,and 1,and the aver- In fact,ECMP performance largely depends on flow size age FCT becomes 2.25.As we mentioned before,reducing distribution and traffic pattern [24],and it achieves the best average and/or tail FCT is very important to improving performance when flow sizes follow a uniform distribution user experiences,thus,the route control in MERP focuses and traffic pattern is all-to-all.However,in reality,flow on reducing latency.immediately detect the finish of the first-flow. Therefore, we suggest that a 2 C Q aC. 5 ROUTE CONTROL Nowadays data centers often use multi-path topologies, so it is also essential to design an appropriate routing scheme which can wisely decide the routing path for each flow among multiple equal-cost paths. We first show the limita￾tions of ECMP (Section 5.1), then give the design rationale of route control in MERP (Section 5.2), lastly, we present design details (Section 5.3). 5.1 Limitations of ECMP To meet the need for large aggregate bandwidth and fault￾tolerance, data centers are designed to have many equal￾cost paths, where ECMP is widely deployed for routing [11], [20]. To avoid out-of-order packets, which may cause obvi￾ous throughput degradation of TCP [21], ECMP ensures the packets belonging to the same TCP flow traverse the same path by hashing several TCP/IP fields of each packet to one of the available equal-cost routes to the destina￾tion [22]. The widely used hash algorithm is Hash-Thresh￾old [23]. It first selects a key by performing hash over TCP/ IP 5-tuple of a packet. Then a region size is obtained by dividing the codomain size of the hash function to the num￾ber of available next hops. Lastly the hash key is divided by the region size to get the output port number. Note that the algorithm does not specify the hash function to obtain the key but typically uses Cyclic Redundancy Check [22]. Although ECMP provides load balancing to some extent, it still has several limitations. Consider the example in Fig. 3, there are one short and two long flows in the simple network. Suppose two flows from H1 to H4 are routed to different paths according to ECMP. Now, a new flow from H2 to H3 arrives. Since ECMP is stateless, this new flow may be routed to S1 by S3. When the short flow finishes, two long flows still compete with each other. We also con￾ducted experiments to verify our findings. We let H1 send 1 GB data to H3 via many short flows, and meanwhile, let H2 send 1 GB data to H4 via one single long flow. We peri￾odically check the utilizations of two links: S3-S1 (link 1), and S3-S2 (link 2). The results are shown in Fig. 4. We notice that, the utilization of link 2 is stable and high, while that of link 1 varies significantly over time. In fact, ECMP performance largely depends on flow size distribution and traffic pattern [24], and it achieves the best performance when flow sizes follow a uniform distribution and traffic pattern is all-to-all. However, in reality, flow sizes vary significantly in data centers: 90 percent of all flows are less than 100 KB, and they contribute only 5 percent to the overall traffic [2], [25]. In summary, the limitations of stateless ECMP mainly include the following aspects.  Collision may occur. Different flows may be hashed to the same path. For example, in Fig. 3, even if there are only two flows (one is from H1 to H4, and the other is from H2 to H3), the probability that they are routed to the same path is as high as 50 percent.  ECMP is sensitive to flow size distribution. When flow sizes vary significantly, long flows may be scheduled to the same path. When short flows finish, link utilizations may be quite imbalanced.  ECMP does not take downstream link conditions into account. When downstream network links have congestions or fail, upstream ECMP switches cannot perceive such changes. Taking Fig. 3 for example, if link S2-S4 is congested, and S3 uses ECMP to route flows, S3 cannot avoid routing packets to this con￾gested link. 5.2 Design Rationale Appropriate design rationales are quite important to the overall performance. Hence, before introducing our specific design, we first present three design rationales. First, the primary purpose of the route control is to reduce latency, not to improve utilization. Many existing data center routing protocols focus on load balancing. Such kind of design often puts network utilization in the first place. However, increasing utilization is not equivalent to reduc￾ing latency. Here is an example: suppose there are 2 equal￾cost paths, P1 and P2, starting from a router R; the capacity of either path is 1. There are 3 flows, say, f1, f2, and f3; the sizes of them are 4, 1, and 1. Suppose f1 is routed to P1, and the others are routed to P2. Now, a new flow f4 with a size of 1 arrives at R. If we aims to maximize link utilization, f4 will be routed to P2: under SRPT, the FCTs of them are 4, 1, 2, and 3, and the average FCT is 2.5. However, if f4 is routed to P1, the FCTs of them become 4, 1, 2, and 1, and the aver￾age FCT becomes 2.25. As we mentioned before, reducing average and/or tail FCT is very important to improving user experiences, thus, the route control in MERP focuses on reducing latency. Fig. 3. ECMP limitation. Fig. 4. The utilizations of S3-S1 (link 1) and S3-S2 (link 2) in the topology shown in Fig. 3 under ECMP. ZHANG ET AL.: EFFICIENT DATA CENTER FLOW SCHEDULING WITHOUT STARVATION USING EXPANSION RATIO 3163
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有