IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL 28,NO.11.NOVEMBER 2017 3157 Efficient Data Center Flow Scheduling Without Starvation Using Expansion Ratio Sheng Zhang,Member,IEEE,Zhuzhong Qian,Member,IEEE,Hao Wu,and Sanglu Lu,Member,IEEE Abstract-Existing data center transport protocols are usually based on the Processor Sharing(PS)policy and/or the Shortest Remaining Processing Time(SRPT)policy.PS divides link bandwidth equally between competing flows,thus it fails to achieve optimal average flow completion time(FCT).SRPT prioritizes flows that have the shortest remaining processing time and provides near-optimal average FCT,but it may cause long flows to suffer unfair delays,or even starve them.In fact,these two types of policies represent two directions in the design space:PS prefers fairness (in terms of starvation freedom)while SRPT favors efficiency (in terms of average FCT).In this paper,we propose a novel metric,expansion ratio,which enables us to strike a balance between SRPT and PS.We design MERP that achieves efficient flow scheduling without starvation.MERP takes care of both average and tail FCTs by minimizing the expansion ratio of competing flows in a lexicographically manner.MERP controls the sending rate of competing flows via synchronized virtual deadlines and routes flows in a downstream-aware manner that reacts quickly to link failures.We evaluate MERP using extensive NS2-based simulations.Results show that,under various traffic loads,MERP reduces the tail FCT significantly with a negligible increase of average FCT compared with pFabric,and MERP reduces the average FCT notably compared with ECMP and CONGA when link failures occur. Index Terms-Data center,flow completion time,efficiency,starvation freedom,expansion ratio 1 INTRODUCTION ODAY's data centers usually organize online interactive partition-aggregate-based online services.This is because, L services with the partition-aggregate pattern.The comple- in such kind of services,there are flows with various sizes, tion time of a large task largely depends on the flow comple- and the overall response time is often dominated by the tion times (FCT)of the flows generated in this process.In "lagger".In this sense,some online service may have better current data centers,the FCTs of these flows fluctuate dramati- performance with short tail FCT rather than short average cally [1]-they may experience 2x more mean FCT than its the- FCT.The second concern with SRPT is that,an malicious oretical minimum [2],[3].This is unacceptable,since the service could preempt bandwidth resources by simply split- response to user requests for those interactive services must be ting large flows into many small ones.If all services split quick enough:even fractions of a second could have an impor- their large flows into small ones,none of them can achieve tant influence on user experiences,which in turn impairs the better performance than no split. business revenue of cloud service providers [41,[5]. PS-based policies [8],[9],[10]reduce the queueing length To achieve low latency,many transport protocols have at switch ports through novel congestion detection and rate been proposed in recent years,among which two main tech- control mechanisms.These approaches usually divide link niques are Shortest Remaining Processing Time(SRPT)and bandwidth equally among competing flows,and thus short Processor Sharing(PS). flows are often blocked by long flows even though they can SRPT-based policies [2],[6]focus on minimizing the finish earlier.Therefore,the average FCT of PS-based poli- average FCT by prioritizing flows that have the shortest cies is larger than that of SRPT-based policies [2],which remaining processing time and have near-optimal average implicitly implies that,we cannot simultaneously achieve FCT.However,applying SRPT-based policies has several the minimum average and the minimum tail FCTs. problems.First,long flows are completely ignored by SRPT- We observe that,these two types of policies represent two based protocols [7].When short flows contribute the major- extremes in the design space:SRPT favors efficiency (in terms ity of the overall traffic,SRPT could even unfairly "starve" of average FCT)while PS prefers fairness (in terms of starva- long flows,which probably increases the response time of tion freedom).To strike a balance between them,in this paper, we propose a novel metric,i.e.,expansion ratio,which is the ratio of the actual FCT of a flow to the optimal FCT.Under The authors are with the State Key Laboratory for Novel Software Technol- this definition,if two flows are delayed on the same link by ogy,Nanjing University,Nanjing Shi 210008,China.E-mail:(sheng,qzz, sanglul@nju.edu.cn,wuhao@dislab.nju.edu.cn. the same amount of time,then the expansion ratio of the Manuscript received 4 Apr.2016;revised 8 Apr.2017;accepted 15 May 2017. shorter flow is larger than that of the other one;and the expan- Date of publication 19 May 2017;date of currenf version 11 Oct.2017. sion ratio of a flow increases as its waiting time increases. (Corresponding author:Sheng Zhang.) These properties enable us to design MERP,a distributed pro- Recommended for acceptance by H.Jin tocol that takes care of both average and tail FCTs. For information on obtaining reprints of this article,please send e-mail to: reprints@ieee.org,and reference the Digital Object Identifier below. The rate control in MERP decides the transmitting order Digital Object Identifier no.10.1109/TPDS.2017.2706290 of concurrent flows through minimizing the maximum
Efficient Data Center Flow Scheduling Without Starvation Using Expansion Ratio Sheng Zhang , Member, IEEE, Zhuzhong Qian, Member, IEEE, Hao Wu, and Sanglu Lu, Member, IEEE Abstract—Existing data center transport protocols are usually based on the Processor Sharing (PS) policy and/or the Shortest Remaining Processing Time (SRPT) policy. PS divides link bandwidth equally between competing flows, thus it fails to achieve optimal average flow completion time (FCT). SRPT prioritizes flows that have the shortest remaining processing time and provides near-optimal average FCT, but it may cause long flows to suffer unfair delays, or even starve them. In fact, these two types of policies represent two directions in the design space: PS prefers fairness (in terms of starvation freedom) while SRPT favors efficiency (in terms of average FCT). In this paper, we propose a novel metric, expansion ratio, which enables us to strike a balance between SRPT and PS. We design MERP that achieves efficient flow scheduling without starvation. MERP takes care of both average and tail FCTs by minimizing the expansion ratio of competing flows in a lexicographically manner. MERP controls the sending rate of competing flows via synchronized virtual deadlines and routes flows in a downstream-aware manner that reacts quickly to link failures. We evaluate MERP using extensive NS2-based simulations. Results show that, under various traffic loads, MERP reduces the tail FCT significantly with a negligible increase of average FCT compared with pFabric, and MERP reduces the average FCT notably compared with ECMP and CONGA when link failures occur. Index Terms—Data center, flow completion time, efficiency, starvation freedom, expansion ratio Ç 1 INTRODUCTION TODAY’S data centers usually organize online interactive services with the partition-aggregate pattern. The completion time of a large task largely depends on the flow completion times (FCT) of the flows generated in this process. In current data centers, the FCTs of these flows fluctuate dramatically [1]—they may experience 2x more mean FCT than its theoretical minimum [2], [3]. This is unacceptable, since the response to user requests for those interactive services must be quick enough: even fractions of a second could have an important influence on user experiences, which in turn impairs the business revenue of cloud service providers [4], [5]. To achieve low latency, many transport protocols have been proposed in recent years, among which two main techniques are Shortest Remaining Processing Time (SRPT) and Processor Sharing (PS). SRPT-based policies [2], [6] focus on minimizing the average FCT by prioritizing flows that have the shortest remaining processing time and have near-optimal average FCT. However, applying SRPT-based policies has several problems. First, long flows are completely ignored by SRPTbased protocols [7]. When short flows contribute the majority of the overall traffic, SRPT could even unfairly “starve” long flows, which probably increases the response time of partition-aggregate-based online services. This is because, in such kind of services, there are flows with various sizes, and the overall response time is often dominated by the “lagger”. In this sense, some online service may have better performance with short tail FCT rather than short average FCT. The second concern with SRPT is that, an malicious service could preempt bandwidth resources by simply splitting large flows into many small ones. If all services split their large flows into small ones, none of them can achieve better performance than no split. PS-based policies [8], [9], [10] reduce the queueing length at switch ports through novel congestion detection and rate control mechanisms. These approaches usually divide link bandwidth equally among competing flows, and thus short flows are often blocked by long flows even though they can finish earlier. Therefore, the average FCT of PS-based policies is larger than that of SRPT-based policies [2], which implicitly implies that, we cannot simultaneously achieve the minimum average and the minimum tail FCTs. We observe that, these two types of policies represent two extremes in the design space: SRPT favors efficiency (in terms of average FCT) while PS prefers fairness (in terms of starvation freedom). To strike a balance between them, in this paper, we propose a novel metric, i.e., expansion ratio, which is the ratio of the actual FCT of a flow to the optimal FCT. Under this definition, if two flows are delayed on the same link by the same amount of time, then the expansion ratio of the shorter flow is larger than that of the other one; and the expansion ratio of a flow increases as its waiting time increases. These properties enable us to design MERP, a distributed protocol that takes care of both average and tail FCTs. The rate control in MERP decides the transmitting order of concurrent flows through minimizing the maximum The authors are with the State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing Shi 210008, China. E-mail: {sheng, qzz, sanglu}@nju.edu.cn, wuhao@dislab.nju.edu.cn. Manuscript received 4 Apr. 2016; revised 8 Apr. 2017; accepted 15 May 2017. Date of publication 19 May 2017; date of current version 11 Oct. 2017. (Corresponding author: Sheng Zhang.) Recommended for acceptance by H. Jin. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TPDS.2017.2706290 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 11, NOVEMBER 2017 3157 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information
3158 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL 28,NO.11,NOVEMBER 2017 expansion ratio of them.MERP senders insert some flow TABLE 1 states (e.g.,flow size,and expected sending rate)into each Motivating Example packet header,and such kind of states are updated by inter- mediate MERP switches.Each MERP switch maintains a Flow ID Arrival Time Size flow table that contains the state information (e.g.,virtual A 0 deadline)of current active flows that pass through the B 3 C 0 4 switch.Switches synchronize their decisions through the expected sending rate in each packet header.The flow with the earliest virtual deadline always monopolizes link As shown in Fig.la,a short flow (e.g.,A)will monopolize resources;the other flows periodically send detecting pack- the link resources until it is completed or preempted by ets to check whether they can start transmission,and the another shorter flow.When flow A finishes,although flow detecting rate is also effectively controlled by MERP.Exten- C arrives earlier than flow B,flow B will occupy the link sive NS2-based evaluations show that,compared with pFa- resources since it has a smaller remaining flow size than bric [21,MERP reduces the expansion ratio and tail FCT by flow C.The average FCT of SRPT is (3+3+10)/3=5 more than 15 and 12 percent,respectively,with only a negli- Although SRPT can achieve near-optimal performance in gible increase of average FCT. terms of average FCT [21,it may starve long flows.Consider The route control in MERP consists of a novel two-stage Fig.la,if there are a large number of flows with a size of 3 routing strategy,which first selects a subset of next-hop (i.e.,similar to flow A),with SRPT,flow C would be delayed ports that minimize the expansion ratio of the newly-until all these short flows are completed.Performance of coming flow and then chooses one of them to maximize the long flows may be hurt because they are scheduled later sending rate of the tail flow of a path.NS2-based evalua- when competing with short flows.Unfortunately,many tions show that,in multi-path data center topologies (e.g., online services follow the partition-aggregate pattern,in Fat-tree [111),compared with ECMP and CONGA,MERP which constructing the final response requires all intermedi- reduces the average FCT by up to 31 percent when there are ate results from smaller tasks.These intermediate results are link failures. flows with different sizes.Hence,the user-perceived latency We summarize our contributions here as follows. depends on tail FCT as well as average FCT [13].Therefore, SRPT may starve long flows and thus impair user experience. We propose the expansion ratio metric that enables PS.PS focuses on achieving fairness between competing fast flow completion and starvation freedom. flows.If we apply PS to the above example,the switch We design MERP,which controls the sending rates would divide the outgoing link bandwidth equally among of competing flows through synchronizing virtual three flows,as shown in Fig.1b. deadlines stored on intermediate switches,and PS can achieve fairness between competing flows with no routes flows in a downstream-aware manner that need to know their sizes in advance.However,the average reacts quickly to link failures. FCT of PS may be far from the minimum.For example,the We evaluate MERP using extensive NS2-based simu- average FCT of PS in Fig.1b is(7.5+9.5+7)/3=8,which lations that confirm our claims. is much larger than that of SRPT.Therefore,PS may hurt We now continue by motivating our work in Section 2. flow completion efficiency. Section 3 gives an overview of MERP.We present rate con- In summary,current flow scheduling policies can hardly trol in Section 4 and route control in Section 5.We evaluate MERP in Section 6.Before concluding the paper in Section 8, achieve flow completion efficiency (in terms of average FCT)and starvation freedom simultaneously,which are we present related work in Section 7. two primary goals in many applications such as web serv- ers [141.In this paper,we propose to schedule flows using 2 MOTIVATION the expansion ratio metric and design the MERP protocol. We start by presenting an example to demonstrate the drawbacks of SRPT and PS(Section 2.1).We then introduce 2.2 Expansion Ratio and Design Intuition the new metric and design intuitions of MERP(Section 2.2). MERP is intended to minimize average FCT without starv- ing long flows.To realize this,MERP should mimics SRPT 2.1 Drawbacks of SRPT and PS but meanwhile guaranteeing starvation freedom.Therefore, MERP should be able to: Consider the example shown in Table 1,where three flows (A,B,and C)share a common link. make short flows transmit data before long flows, SRPT.SRPT seeks to minimize average FCT [12]:the flow since SRPT prioritizes the flow with the smallest with the smallest remaining size is always transmitted first. remaining size and has near-optimal average FCT; link capacity link capacity link capacity B 4 6 R910 2 4 5.678910 (a)SRPT (b)PS (c)MERP Fig.1.Motivating example.(a)SRPT transmits the flow with the smallest remaining size first.(b)PS shares bandwidth between concurrent flows. (c)MERP schedules flows according to their dynamic expansion ratios
expansion ratio of them. MERP senders insert some flow states (e.g., flow size, and expected sending rate) into each packet header, and such kind of states are updated by intermediate MERP switches. Each MERP switch maintains a flow table that contains the state information (e.g., virtual deadline) of current active flows that pass through the switch. Switches synchronize their decisions through the expected sending rate in each packet header. The flow with the earliest virtual deadline always monopolizes link resources; the other flows periodically send detecting packets to check whether they can start transmission, and the detecting rate is also effectively controlled by MERP. Extensive NS2-based evaluations show that, compared with pFabric [2], MERP reduces the expansion ratio and tail FCT by more than 15 and 12 percent, respectively, with only a negligible increase of average FCT. The route control in MERP consists of a novel two-stage routing strategy, which first selects a subset of next-hop ports that minimize the expansion ratio of the newlycoming flow and then chooses one of them to maximize the sending rate of the tail flow of a path. NS2-based evaluations show that, in multi-path data center topologies (e.g., Fat-tree [11]), compared with ECMP and CONGA, MERP reduces the average FCT by up to 31 percent when there are link failures. We summarize our contributions here as follows. We propose the expansion ratio metric that enables fast flow completion and starvation freedom. We design MERP, which controls the sending rates of competing flows through synchronizing virtual deadlines stored on intermediate switches, and routes flows in a downstream-aware manner that reacts quickly to link failures. We evaluate MERP using extensive NS2-based simulations that confirm our claims. We now continue by motivating our work in Section 2. Section 3 gives an overview of MERP. We present rate control in Section 4 and route control in Section 5. We evaluate MERP in Section 6. Before concluding the paper in Section 8, we present related work in Section 7. 2 MOTIVATION We start by presenting an example to demonstrate the drawbacks of SRPT and PS (Section 2.1). We then introduce the new metric and design intuitions of MERP (Section 2.2). 2.1 Drawbacks of SRPT and PS Consider the example shown in Table 1, where three flows (A, B, and C) share a common link. SRPT. SRPT seeks to minimize average FCT [12]: the flow with the smallest remaining size is always transmitted first. As shown in Fig. 1a, a short flow (e.g., A) will monopolize the link resources until it is completed or preempted by another shorter flow. When flow A finishes, although flow C arrives earlier than flow B, flow B will occupy the link resources since it has a smaller remaining flow size than flow C. The average FCT of SRPT is ð3 þ 3 þ 10Þ=3 ¼ 5 1 3. Although SRPT can achieve near-optimal performance in terms of average FCT [2], it may starve long flows. Consider Fig. 1a, if there are a large number of flows with a size of 3 (i.e., similar to flow A), with SRPT, flow C would be delayed until all these short flows are completed. Performance of long flows may be hurt because they are scheduled later when competing with short flows. Unfortunately, many online services follow the partition-aggregate pattern, in which constructing the final response requires all intermediate results from smaller tasks. These intermediate results are flows with different sizes. Hence, the user-perceived latency depends on tail FCT as well as average FCT [13]. Therefore, SRPT may starve long flows and thus impair user experience. PS. PS focuses on achieving fairness between competing flows. If we apply PS to the above example, the switch would divide the outgoing link bandwidth equally among three flows, as shown in Fig. 1b. PS can achieve fairness between competing flows with no need to know their sizes in advance. However, the average FCT of PS may be far from the minimum. For example, the average FCT of PS in Fig. 1b is ð7:5 þ 9:5 þ 7Þ=3 ¼ 8, which is much larger than that of SRPT. Therefore, PS may hurt flow completion efficiency. In summary, current flow scheduling policies can hardly achieve flow completion efficiency (in terms of average FCT) and starvation freedom simultaneously, which are two primary goals in many applications such as web servers [14]. In this paper, we propose to schedule flows using the expansion ratio metric and design the MERP protocol. 2.2 Expansion Ratio and Design Intuition MERP is intended to minimize average FCT without starving long flows. To realize this, MERP should mimics SRPT but meanwhile guaranteeing starvation freedom. Therefore, MERP should be able to: make short flows transmit data before long flows, since SRPT prioritizes the flow with the smallest remaining size and has near-optimal average FCT; TABLE 1 Motivating Example Flow ID Arrival Time Size A 03 B 33 C 04 Fig. 1. Motivating example. (a) SRPT transmits the flow with the smallest remaining size first. (b) PS shares bandwidth between concurrent flows. (c) MERP schedules flows according to their dynamic expansion ratios. 3158 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 11, NOVEMBER 2017
ZHANG ET AL.:EFFICIENT DATA CENTER FLOW SCHEDULING WITHOUT STARVATION USING EXPANSION RATIO 3159 guarantee that long flows have chance to transmit TABLE 2 data when they wait for a sufficiently long time. MERP Packet Priorities MERP uses expansion ratio to scheduling flows,imply- ing that the expansion ratio of a flow should satisfy the fol- Packet type Priority lowing two conditions:(i)if two flows are delayed on the SYN,SYNACK,BOOST high same link by the same amount of time,then the expansion DATA,DATAACK,FIN,FINACK middle ratio of the shorter flow is larger than that of the other one; DET,DETACK low and (ii)the expansion ratio of a flow increases as its waiting time increases.The formal definition of the expansion ratio cannot happen in SRPT-based policies.We note that,the of a flow is as follows: average FCT of MERP is 53,which is only a slightly higher Definition 1 (Expansion ratio of a flow).Given a flow than that achieved by SRPT. With this example,we see that,MERP can gracefully scheduling strategy S and a flow f,let fet(S,f)be the actual FCT of f under S,and let fct"(f)be the optimal FCT of f navigate the tradeoff between SRPT and PS.Under MERP, when it monopolizes all link resources immediately after it short flows are typically scheduled before long flows like in arrives.The expansion ratio of f under S is defined as SRPT,but as time goes on,the expansion ratio of a long flow increases;in order to minimize the maximum expan- ER(S)=fet(S.f) sion ratio,MERP may prefer long flows that wait for a suffi- (1) fct*(f) ciently long time to short flows that just wait for a while.In doing so,MERP can effectively prevent any starvation. For example,in Fig.1a,we have ER(SRPT;A)=3=1and Therefore,MERP could achieve efficient flow scheduling ER(SRPT,C)=号=2.5;in Fig.1b,we have E(PS,A)= without starvation and ER(PS,C)=5=2.375. Let us look at whether Eg.(1)satisfies the above two con- 3 MERP OVERVIEW ditions.Denote the size of flow f by f.size,and the capacity of the link of interest by C.Suppose flow f is delayed by d In this section,we give an overview of MERP through pre- time before it can monopolize the link,then we have senting packet priorities,and the framework. We define 9 packet types in MERP,including SYN,DET, Cd DATA,FIN for MERP senders,and SYNACK,DETACK, ER(S,f)= fet(S.)d +1. (2) fct"(f) f.size BOOST,DATAACK,FINACK for MERP receivers.SYN/ f.size SYNACK packets are used in the handshaking phase to establish a connection.DET/DETACK packets are used by We see that,given two flows,when C and d are fixed,the the senders and receivers of postponed flows to detect net- flow with a smaller size has a larger expansion ratio;when work status.DATA/DATAACK packets are used to trans- C and f.size are fixed,the expansion ratio of any flow mit application-layer data.BOOST packets are used by a increases as d increases.Thus,Eq.(1)indeed satisfies the postponed flow receiver to quickly notify the corresponding two conditions. sender that it can transmit DATA packets immediately. We now show how MERP schedules competing flows FIN/FINACK packets are used to close a connection. based on this metric.We first extend the definition of expan- We have 3 different packet priorities,as shown in Table 2. sion ratio to a set of flows. To reduce the connection setup delay of each flow and the Definition 2 (Expansion ratio of a set of flows).Given a waiting delay of each postponed flow,SYN,SYNACK,and scheduling strategy S and a set of n flows f,f2,.....fn,the BOOST packets have the highest priority.DATA,DATAACK, expansion ratio of these flows is defined below FIN,and FINACK packets are with middle priority.To enable postponed flows to quickly perceive network status,MERP ER(S,f 1,n)=max ER(S,fi). (3) allows each postponed flow to send detecting packets with a e[1,n high rate,which is effectively controlled by MERP.However, That is,the expansion ratio of a set of flows depends on it may result in significant bandwidth overhead especially the maximum expansion ratio of one of the flows.In fact, when there are massive flows on the same link.Therefore, given a set of n competing flows,MERP tries to minimize when the buffer of a switch overflows,MERP should discard ER(S,f[1,n])(rather than the average FCT as in SRPT). the least important packets.Which kinds of packets are the Equivalently, least important?Obviously,DET and DETACK packets, implying they have the lowest priority. MERP arg min ER(S,f[1,n]). (4) MERP provides route control as well as rate control. Fig.2 shows the framework of MERP.MERP switch main- Take the scenario in Table 1 for example.If we use SRPT,tains a flow table that contains the state information of cur- then the expansion ratios of flows A,B,and C are=1, rent active flows that pass through it.Different packet types =1,and 0=2.5,respectively;the expansion ratio of them trigger different actions/algorithms on MERP switches. is 2.5.Can we do better?If we use MERP that tries to mini- Note that,the five types of ACK packets do not trigger mize the maximum expansion ratio,the optimal scheduling actions on switches. is shown in Fig.1c,and the expansion ratio of them is We The rate control in MERP controls the sending rate of find that,although flow C has a larger size than flow B,it each flow.MERP decides the transmitting order of compet- monopolizes the link bandwidth before flow B does,which ing flows through minimizing the maximum expansion
guarantee that long flows have chance to transmit data when they wait for a sufficiently long time. MERP uses expansion ratio to scheduling flows, implying that the expansion ratio of a flow should satisfy the following two conditions: (i) if two flows are delayed on the same link by the same amount of time, then the expansion ratio of the shorter flow is larger than that of the other one; and (ii) the expansion ratio of a flow increases as its waiting time increases. The formal definition of the expansion ratio of a flow is as follows: Definition 1 (Expansion ratio of a flow). Given a flow scheduling strategy S and a flow f, let fctðS; fÞ be the actual FCT of f under S, and let fctðfÞ be the optimal FCT of f when it monopolizes all link resources immediately after it arrives. The expansion ratio of f under S is defined as ERðS; fÞ ¼ fctðS; fÞ fctðfÞ : (1) For example, in Fig. 1a, we have ERðSRPT; AÞ ¼ 3 3 ¼ 1 and ERðSRPT; CÞ ¼ 10 4 ¼ 2:5; in Fig. 1b, we have ERðPS; AÞ ¼ 7 3 and ERðPS; CÞ ¼ 9:5 4 ¼ 2:375. Let us look at whether Eq. (1) satisfies the above two conditions. Denote the size of flow f by f:size, and the capacity of the link of interest by C. Suppose flow f is delayed by d time before it can monopolize the link, then we have ERðS; fÞ ¼ fctðS; fÞ fctðfÞ ¼ d þ f:size C f:size C ¼ Cd f:size þ 1: (2) We see that, given two flows, when C and d are fixed, the flow with a smaller size has a larger expansion ratio; when C and f:size are fixed, the expansion ratio of any flow increases as d increases. Thus, Eq. (1) indeed satisfies the two conditions. We now show how MERP schedules competing flows based on this metric. We first extend the definition of expansion ratio to a set of flows. Definition 2 (Expansion ratio of a set of flows). Given a scheduling strategy S and a set of n flows f1, f2, ..., fn, the expansion ratio of these flows is defined below ERðS; f½1; nÞ ¼ max i2½1;n ERðS; fiÞ: (3) That is, the expansion ratio of a set of flows depends on the maximum expansion ratio of one of the flows. In fact, given a set of n competing flows, MERP tries to minimize ERðS; f½1; nÞ (rather than the average FCT as in SRPT). Equivalently, MERP ¼ arg minS ERðS; f½1; nÞ: (4) Take the scenario in Table 1 for example. If we use SRPT, then the expansion ratios of flows A, B, and C are 3 3 ¼ 1, 3 3 ¼ 1, and 10 4 ¼ 2:5, respectively; the expansion ratio of them is 2:5. Can we do better? If we use MERP that tries to minimize the maximum expansion ratio, the optimal scheduling is shown in Fig. 1c, and the expansion ratio of them is 7 3. We find that, although flow C has a larger size than flow B, it monopolizes the link bandwidth before flow B does, which cannot happen in SRPT-based policies. We note that, the average FCT of MERP is 5 2 3, which is only a slightly higher than that achieved by SRPT. With this example, we see that, MERP can gracefully navigate the tradeoff between SRPT and PS. Under MERP, short flows are typically scheduled before long flows like in SRPT, but as time goes on, the expansion ratio of a long flow increases; in order to minimize the maximum expansion ratio, MERP may prefer long flows that wait for a suffi- ciently long time to short flows that just wait for a while. In doing so, MERP can effectively prevent any starvation. Therefore, MERP could achieve efficient flow scheduling without starvation. 3 MERP OVERVIEW In this section, we give an overview of MERP through presenting packet priorities, and the framework. We define 9 packet types in MERP, including SYN, DET, DATA, FIN for MERP senders, and SYNACK, DETACK, BOOST, DATAACK, FINACK for MERP receivers. SYN/ SYNACK packets are used in the handshaking phase to establish a connection. DET/DETACK packets are used by the senders and receivers of postponed flows to detect network status. DATA/DATAACK packets are used to transmit application-layer data. BOOST packets are used by a postponed flow receiver to quickly notify the corresponding sender that it can transmit DATA packets immediately. FIN/FINACK packets are used to close a connection. We have 3 different packet priorities, as shown in Table 2. To reduce the connection setup delay of each flow and the waiting delay of each postponed flow, SYN, SYNACK, and BOOST packets have the highest priority. DATA, DATAACK, FIN, and FINACK packets are with middle priority. To enable postponed flows to quickly perceive network status, MERP allows each postponed flow to send detecting packets with a high rate, which is effectively controlled by MERP. However, it may result in significant bandwidth overhead especially when there are massive flows on the same link. Therefore, when the buffer of a switch overflows, MERP should discard the least important packets. Which kinds of packets are the least important? Obviously, DET and DETACK packets, implying they have the lowest priority. MERP provides route control as well as rate control. Fig. 2 shows the framework of MERP. MERP switch maintains a flow table that contains the state information of current active flows that pass through it. Different packet types trigger different actions/algorithms on MERP switches. Note that, the five types of ACK packets do not trigger actions on switches. The rate control in MERP controls the sending rate of each flow. MERP decides the transmitting order of competing flows through minimizing the maximum expansion TABLE 2 MERP Packet Priorities Packet type Priority SYN, SYNACK, BOOST high DATA, DATAACK, FIN, FINACK middle DET, DETACK low ZHANG ET AL.: EFFICIENT DATA CENTER FLOW SCHEDULING WITHOUT STARVATION USING EXPANSION RATIO 3159
3160 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL 28,NO.11.NOVEMBER 2017 Sender MERP Switch Receiver in Section 4.3.The detecting rate threshold is used by pkt.ast Route Control Rate Control pkt.es switches to limit the sending rate of DET packets. Adjust GPE(P.pkt) RateAdj(pkt) MERP sender tries to establish a connection by sending Rate SYN packets.If the asr in the SYNACK or DETACK packets pkt.a Copy GMER(P,pkt VDG(pkt) is larger than then the sender transmits DATA packets, otherwise it transmits DET packets,both of which are trans- mitted with a rate of asr.When receiving BOOST packets, the sender immediately starts to transmit DATA packets. Fig.2.MERP framework. The connection is terminated by FIN packets.Note that, MERP is preemptive,i.e.,a flow may be postponed before it ratio of them.Since different network links have different completes its transmission. contentions and capacities,we must synchronize such kind Retransmission is controlled by retransmission timeout of heterogeneity between switches,which is achieved by (RTO).We adopt a similar approach to estimate round-trip inserting some flow states into packet headers (e.g.,pkt.asr time (RTT)and RTO as that in TCP [19]:the exponential and pkt.esr in Fig.2).MERP also reserves some bandwidth weighted moving average over sample RTTs is used as the (controlled by Algorithm 3)for postponed flows to detect current RTT,and RTO is set to the sum of the current RTT current network status,and flows with different virtual and four times RTT deviation. deadlines have different detecting rates.The details can be found in Section 4. 4.2 MERP Receiver The route control in MERP is responsible for selecting a When receiving a SYN/DATA/FIN packet,the receiver path for each flow in the handshaking phase.When a SYN copies asr and esr from the received packet into the corre- packet arrives,MERP first selects a subset of next-hops as sponding ACK packet,and send the ACK packet.When candidates by minimizing the expansion ratio of this flow, receiving a DET packet,if the asr in the DET packet is larger then chooses one of the candidate paths by maximizing the than 6,a BOOST packet will be sent back other than a expected sending rate of the tail flow on that path.In multi- DETACK packet,otherwise,the receiver does the same as path data center topologies,ECMP is the dominating proto- that for SYN/DATA/FIN packet. col for load balancing.Compared with ECMP,the two-stage route control of MERP has several desirable properties. 4.3 MERP Switch First,it is immune to hash collision;second,it is not sensi- To minimize the maximum expansion ratio of competing tive to flow size distribution;and lastly,it reacts quickly to flows,MERP must consider how to synchronize rate control link failures.The details can be found in Section 5. decisions among MERP switches in a distributed way.To achieve this,we design three main components in each 4 RATE CONTROL MERP switch: In this section,we assume the route path for each flow is Distributed rate controller (DRC):this component known ahead of time and focus on presenting the rate control decides the actions a MERP switch takes when there component.We first introduce MERP sender(Section 4.1), is a packet arriving at a switch. receiver (Section 4.2),and switch (Section 4.3),then briefly ● Virtual deadline generator (VDG):whenever a summarize this section by discussions (Section 4.4).We will MERP switch receives a SYN packet (i.e.,a new flow present the route control component in the next section. arrives),VDG is invoked to update virtual deadlines of all flows. 4.1 MERP Sender Rate adjustment(RateAdi):whenever a SYN,DATA, MERP is implemented as a distributed protocol to keep the or DET packet leaves a MERP switch,RateAdj is order in which competing flows are scheduled on different invoked to update the actual sending rate in the switches consistent.But how to make this happen?MERP packet header. utilize the flow states inserted in packet headers to synchro- nize decisions from different switches. 4.3.1 Distributed Rate Controller MERP sender inserts four flow states into each MERP Each MERP switch maintains a flow state table T that stores packet header:flow size(size),actual sending rate (asr), the ID (id),flow size (size),size of the rest of the flow (rs), expected sending rate (esr),and detecting rate threshold arrival time (aT),virtual deadline (udl),and expected send- ()Similar to previous studies [2],[61,[15],[16],[17],[18], ing rate (esr)for each flow f.When there is a packet arriving we assume flow sizes are known ahead of time at the trans- at a switch,if the buffer(priority queue)is not full,MERP port layer.The actual sending rate is updated by intermedi- adds the new packet into the queue,otherwise,MERP dis- ate switches and used by a MERP sender to send DATA or cards the packet with the lowest priority in the queue and DET packets using this rate.It is initialized to the capacity then adds the new packet into the queue.As for dequeue, of the link that is adjacent to the sender.The expected send-MERP always takes out the packet with the highest priority. ing rate is updated by intermediate switches and used by Algorithm 1 gives the details of the Distributed Rate Con- these switches to synchronize their decisions on flow sched-troller.When a switch receives a packet,if the type of this uling.It is initialized to infinity,and it is nof used by any packet is FIN,then we delete the corresponding flow states MERP sender to adjust its sending rate of DATA or DET from the state table T.If it is a SYN packet,we need to packets.We will explain how to update asr and esr shortly update virtual deadlines of all flows using Algorithm 2
ratio of them. Since different network links have different contentions and capacities, we must synchronize such kind of heterogeneity between switches, which is achieved by inserting some flow states into packet headers (e.g., pkt:asr and pkt:esr in Fig. 2). MERP also reserves some bandwidth (controlled by Algorithm 3) for postponed flows to detect current network status, and flows with different virtual deadlines have different detecting rates. The details can be found in Section 4. The route control in MERP is responsible for selecting a path for each flow in the handshaking phase. When a SYN packet arrives, MERP first selects a subset of next-hops as candidates by minimizing the expansion ratio of this flow, then chooses one of the candidate paths by maximizing the expected sending rate of the tail flow on that path. In multipath data center topologies, ECMP is the dominating protocol for load balancing. Compared with ECMP, the two-stage route control of MERP has several desirable properties. First, it is immune to hash collision; second, it is not sensitive to flow size distribution; and lastly, it reacts quickly to link failures. The details can be found in Section 5. 4 RATE CONTROL In this section, we assume the route path for each flow is known ahead of time and focus on presenting the rate control component. We first introduce MERP sender (Section 4.1), receiver (Section 4.2), and switch (Section 4.3), then briefly summarize this section by discussions (Section 4.4). We will present the route control component in the next section. 4.1 MERP Sender MERP is implemented as a distributed protocol to keep the order in which competing flows are scheduled on different switches consistent. But how to make this happen? MERP utilize the flow states inserted in packet headers to synchronize decisions from different switches. MERP sender inserts four flow states into each MERP packet header: flow size (size), actual sending rate (asr), expected sending rate (esr), and detecting rate threshold (Q). Similar to previous studies [2], [6], [15], [16], [17], [18], we assume flow sizes are known ahead of time at the transport layer. The actual sending rate is updated by intermediate switches and used by a MERP sender to send DATA or DET packets using this rate. It is initialized to the capacity of the link that is adjacent to the sender. The expected sending rate is updated by intermediate switches and used by these switches to synchronize their decisions on flow scheduling. It is initialized to infinity, and it is not used by any MERP sender to adjust its sending rate of DATA or DET packets. We will explain how to update asr and esr shortly in Section 4.3. The detecting rate threshold is used by switches to limit the sending rate of DET packets. MERP sender tries to establish a connection by sending SYN packets. If the asr in the SYNACK or DETACK packets is larger than Q, then the sender transmits DATA packets, otherwise it transmits DET packets, both of which are transmitted with a rate of asr. When receiving BOOST packets, the sender immediately starts to transmit DATA packets. The connection is terminated by FIN packets. Note that, MERP is preemptive, i.e., a flow may be postponed before it completes its transmission. Retransmission is controlled by retransmission timeout (RTO). We adopt a similar approach to estimate round-trip time (RTT) and RTO as that in TCP [19]: the exponential weighted moving average over sample RTTs is used as the current RTT, and RTO is set to the sum of the current RTT and four times RTT deviation. 4.2 MERP Receiver When receiving a SYN/DATA/FIN packet, the receiver copies asr and esr from the received packet into the corresponding ACK packet, and send the ACK packet. When receiving a DET packet, if the asr in the DET packet is larger than Q, a BOOST packet will be sent back other than a DETACK packet, otherwise, the receiver does the same as that for SYN/DATA/FIN packet. 4.3 MERP Switch To minimize the maximum expansion ratio of competing flows, MERP must consider how to synchronize rate control decisions among MERP switches in a distributed way. To achieve this, we design three main components in each MERP switch: Distributed rate controller (DRC): this component decides the actions a MERP switch takes when there is a packet arriving at a switch. Virtual deadline generator (VDG): whenever a MERP switch receives a SYN packet (i.e., a new flow arrives), VDG is invoked to update virtual deadlines of all flows. Rate adjustment (RateAdj): whenever a SYN, DATA, or DET packet leaves a MERP switch, RateAdj is invoked to update the actual sending rate in the packet header. 4.3.1 Distributed Rate Controller Each MERP switch maintains a flow state table T that stores the ID (id), flow size (size), size of the rest of the flow (rs), arrival time (aT), virtual deadline (vdl), and expected sending rate (esr) for each flow f. When there is a packet arriving at a switch, if the buffer (priority queue) is not full, MERP adds the new packet into the queue, otherwise, MERP discards the packet with the lowest priority in the queue and then adds the new packet into the queue. As for dequeue, MERP always takes out the packet with the highest priority. Algorithm 1 gives the details of the Distributed Rate Controller. When a switch receives a packet, if the type of this packet is FIN, then we delete the corresponding flow states from the state table T. If it is a SYN packet, we need to update virtual deadlines of all flows using Algorithm 2. Fig. 2. MERP framework. 3160 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 11, NOVEMBER 2017
ZHANG ET AL.:EFFICIENT DATA CENTER FLOW SCHEDULING WITHOUT STARVATION USING EXPANSION RATIO 3161 Algorithm 1.Distributed Rate Controller,DRC(pkt E f) assume there are n flows that are passing through a switch of interest,F={fi,f2....,fn}.Note that,VDG does not need Input:flow state table T to know this set of flows ahead of time.Denote by Bi the set 1:if pkt.type FIN then of flows that are scheduled before fi,and by A;the set of 2 T.delete(f) flows that are scheduled after fi.Then,we can simply use 3: return [Bi,fi,Ai}to represent a scheduling order.Let us further 4:end if denote the link capacity is C,and the current time is cT. 5:if pkt.type SYN then According to Eq.(2),the expansion ratio of fi with respect to 6:VDG(pkt) DAlgorithm 2 7:end if a scheduling order [Bi,fi,Ai}can be represented as 8:if pkt.type=SYN/DATA/DET then 9 pkt.esr -minipkt.esr.T.f.esr ER({Bi,fi,Ai),fi) 10: T.f.esr←-pkt.esr + 11: T.Jxl-T.f.aT+器 -eT-f.四+25 ● 12: (5) RateAdi(pkt) Algorithm 3 fisise 13:end if 14:return (cT-i.aT)C+∑eurs fi.size Algorithm3.Rate Adjustment,,RateAdj(pkt∈f月 Algorithm 2.Virtual Deadline Generator,VDG(pkt E f) 1:if f.udl is the smallest then Input:flow state table T,link capacity C,and current time cT 2: pkt.asr←-min{a·C,pkt.asr Output:new flow state table T 3:else 1:f.aT←-cT 4: if f.udl is the second smallest then 2:T.insert(f) 3:tRestSize-Eherfirs 5 fi-the flow with the smallest virtual deadline 6: if fr.rs a.C.RIT then 4:T←-0 7: pkt.asr-minfaC.pkt.asr Dearly start 5:while T≠0do 8: else 5 ninExpRatio←-oo 9 pkt.asr←-min{(n-1)-(1-a)·C,O,pkt.asr} 7: for each flow fi∈Tdo expRatioT-faT)-C+tRestsize 10: end if 8: i-i20 11: else 9 if expRatio minExpRatio then 12: pkt.asr -min{(1-a).C,,pkt.asr} 10: minExpRatio -expRatio 13: end if 11: lastFlow fi 14:end if 12: end if 13: end for It is not hard to see that,ER({Bi,fi,Ai},fi)is maximized 14: T.delete(lastFlow) when Bi=F\{fi}and Ai=0.That is,when the set of 15: lastFlow.vdl-cT+tRestsize flows is fixed,the expansion ratio of a flow is maximized if 16: asFow..esr←m2器aa it is scheduled as the last flow,which also means,the maxi- 17: T'insert(last Flow) mum expansion ratio of these n flows is then bounded by 18: tRestSize-tRestSize-last Flow.rs 19:end while max ER(Fi,fi,0) (6) ie{12.n} 20:return T We then have an optimal algorithm,as shown in However,different network links along the path between Algorithm 2,to determine the scheduling order of a set of a pair of sender and receiver may have heterogeneous capacities,hence,a flow may have different virtual dead- flows that leads to the minimum maximum expansion ratio. The basic idea is to find the flow that could be scheduling as lines on different switches.We leverage esr to make these virtual deadlines consistent.Lines 8-13 show the trick. the last flow with the smallest expansion ratio.Specifically, for each flow,we compute its expansion ratio by assuming it Whenever there is a SYN,DATA,or DET packet,we update is the last flow,then the flow with the smallest expansion ratio the esr in the packet header and the esr in the flow state is set as the last flow;by doing so,we reduce the problem size table T to be the minimum of them,recalculate the virtual by one,and we repeat this process on the rest(n-1)flows. deadline udl,and update the actual sending rate asr using Here are some brief notes on Algorithm 2.Line 3 gives Algorithm 3. the total size of the rest of flows;lines 6-13 determine the last flow that incurs the smallest expansion ratio;line 8 is 4.3.2 Virtual Deadline Generator due to Eq.(5);line 15 updates the virtual deadline of the last Each MERP switch updates the flow state table using the flow,and this information will be used to determine the virtual deadline generator whenever there is a new flow actual sending rate of each flow (see Algorithm 3);line 16 (indicated by SYN packets). updates the expected sending rate,and this information is The objective of VDG is to minimize the maximum ratio used to synchronize virtual deadlines among different of a set of flows.Let us first analyse what the expansion switches (see lines 8-13 in Algorithm 1).Denote by t(n)the ratio of flow fi looks like.Without loss of generality,we running time of Algorithm 2 with respect to the number of
Algorithm 1. Distributed Rate Controller, DRC(pkt 2 f) Input: flow state table T 1: if pkt:type = FIN then 2: T.delete(f) 3: return 4: end if 5: if pkt:type = SYN then 6: VDG(pkt) " Algorithm 2 7: end if 8: if pkt:type = SYN/DATA/DET then 9: pkt:esr minfpkt:esr; T:f:esrg 10: T:f:esr pkt:esr 11: T:f:vdl T:f:aT þ T:f:size T:f:esr 12: RateAdj(pkt) " Algorithm 3 13: end if 14: return Algorithm 2. Virtual Deadline Generator, VDG(pkt 2 f) Input: flow state table T, link capacity C, and current time cT Output: new flow state table T0 1: f:aT cT 2: T.insert(f) 3: tRestSize P fi2T fi:rs 4: T0 ; 5: while T 6¼ ; do 6: minExpRatio 1 7: for each flow fi 2 T do 8: expRatio ðcTfi:aTÞCþtRestSize fi:size 9: if expRatio < minExpRatio then 10: minExpRatio expRatio 11: lastFlow fi 12: end if 13: end for 14: T.delete(lastFlow) 15: lastFlow:vdl cT þ tRestSize C 16: lastFlow:esr lastFlow:size lastFlow:vdllastFlow:aT 17: T0 .insert(lastFlow) 18: tRestSize tRestSize lastFlow:rs 19: end while 20: return T0 However, different network links along the path between a pair of sender and receiver may have heterogeneous capacities, hence, a flow may have different virtual deadlines on different switches. We leverage esr to make these virtual deadlines consistent. Lines 8-13 show the trick. Whenever there is a SYN, DATA, or DET packet, we update the esr in the packet header and the esr in the flow state table T to be the minimum of them, recalculate the virtual deadline vdl, and update the actual sending rate asr using Algorithm 3. 4.3.2 Virtual Deadline Generator Each MERP switch updates the flow state table using the virtual deadline generator whenever there is a new flow (indicated by SYN packets). The objective of VDG is to minimize the maximum ratio of a set of flows. Let us first analyse what the expansion ratio of flow fi looks like. Without loss of generality, we assume there are n flows that are passing through a switch of interest, F ¼ ff1; f2; ... ; fng. Note that, VDG does not need to know this set of flows ahead of time. Denote by Bi the set of flows that are scheduled before fi, and by Ai the set of flows that are scheduled after fi. Then, we can simply use fBi; fi; Aig to represent a scheduling order. Let us further denote the link capacity is C, and the current time is cT. According to Eq. (2), the expansion ratio of fi with respect to a scheduling order fBi; fi; Aig can be represented as ERðfBi; fi; Aig; fiÞ ¼ ðcT fi:aTÞ þ P fj2Bi fj:rs C þ fi:rs C fi:size C ¼ ðcT fi:aTÞC þ P fj2Bi[ffig fj:rs fi:size : (5) Algorithm 3. Rate Adjustment, RateAdj(pkt 2 f) 1: if f:vdl is the smallest then 2: pkt:asr minfa C; pkt:asrg 3: else 4: if f:vdl is the second smallest then 5: fk the flow with the smallest virtual deadline 6: if fk:rs < a C RTT then 7: pkt:asr minfa C; pkt:asrg " early start 8: else 9: pkt:asr minfðn 1Þð1 aÞ C; Q; pkt:asrg 10: end if 11: else 12: pkt:asr minfð1 aÞ C; Q; pkt:asrg 13: end if 14: end if It is not hard to see that, ERðfBi; fi; Aig; fiÞ is maximized when Bi ¼ F n ffig and Ai ¼ ;. That is, when the set of flows is fixed, the expansion ratio of a flow is maximized if it is scheduled as the last flow, which also means, the maximum expansion ratio of these n flows is then bounded by max i2f1;2;...;ng ERðF n ffig; fi; ;Þ: (6) We then have an optimal algorithm, as shown in Algorithm 2, to determine the scheduling order of a set of flows that leads to the minimum maximum expansion ratio. The basic idea is to find the flow that could be scheduling as the last flow with the smallest expansion ratio. Specifically, for each flow, we compute its expansion ratio by assuming it is the last flow, then the flow with the smallest expansion ratio is set as the last flow; by doing so, we reduce the problem size by one, and we repeat this process on the restðn 1Þ flows. Here are some brief notes on Algorithm 2. Line 3 gives the total size of the rest of flows; lines 6-13 determine the last flow that incurs the smallest expansion ratio; line 8 is due to Eq. (5); line 15 updates the virtual deadline of the last flow, and this information will be used to determine the actual sending rate of each flow (see Algorithm 3); line 16 updates the expected sending rate, and this information is used to synchronize virtual deadlines among different switches (see lines 8-13 in Algorithm 1). Denote by tðnÞ the running time of Algorithm 2 with respect to the number of ZHANG ET AL.: EFFICIENT DATA CENTER FLOW SCHEDULING WITHOUT STARVATION USING EXPANSION RATIO 3161
3162 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL 28,NO.11,NOVEMBER 2017 flows n.It is not hard to see t(n)=t(n-1)+O(n),i.e.,the minfa x C.pkt.asr.When the receiver of the second-flow time complexity of VDG is O(n2). finds that the asr is larger than 6,it would send a BOOST Today's high-end data center switches usually are packet to the sender,and the sender would start transmit- equipped with fast CPUs,e.g.,Cisco Nexus 9,300 switch has ting DATA packets immediately after it receives the BOOST dual-core 2.5 GHz x86 CPUs.According to the measure- packet. ment in [3],the number of active flows is no more than Imagine the scenario without BOOST packets.When the 3,000 on more than 70 percent switches in 4 representative receiver of the second-flow finds that the asr is larger than data centers.For this scale of flows,VDG would be very effi- O,it would send a DETACK packet to the sender.Remem- cient and fast ber that DET and DETACK packets have the lowest priori- ties and may be discarded whenever buffer overflows. 4.3.3 Rate Adiustment Therefore,a DETACK packet would not take a shorter time Whenever a SYN,DATA,or DET packet leaves a MERP to arrive at the sender than a BOOST packet,which has the switch,the switch needs to update the actual sending rate in highest priority.Therefore,the BOOST packet can let the the packet header.MERP relies on DET/DETACK/BOOST sender of the second-flow to quickly start transmitting packets to inform MERP senders of the current network sta- DATA packets,and it may save one RTT latency. tus.To make senders perceive network status change in time,these detecting packets must be sent frequently.We 4.4 Discussions therefore reserve a portion of network resources for them. Flow Table Size.Each MERP switch maintains a flow state Denote by a the maximum proportion of actual sending table T that stores 6 states for each flow (see Section 4.3.1). rate to the link capacity,by n the number of concurrent Under a pessimistic scenario where each of these states has flows,and by RTT the round-trip time. to use 2 Bytes to store,a flow will need 12 Bytes at a switch. Algorithm 3 gives the rate adjustment (RateAdj)algo- Modern switches usually have 4 to 16 MByte of shared rithm.As we mentioned before,MERP uses virtual dead- high-speed memory,i.e.,a switch can store states for at least lines to indicate the scheduling order of competing flows.If 34.525 flows,which is large enough for normal a flow has the smallest virtual deadline (hereafter we call production data centers.'In our route control simulations, this flow the first-flow),that is,it is now sending DATA the maximum number of active flows at each switch never packets,its actual sending rate is update as exceeds 1,000.Still,suppose there are more flows than a switch can handle.In this case,we can run another flow pkt.asr minfa x C.pkt.asr. (7) scheduler that does not store flow states at switches along- side MERP.We inform this scheduler that the maximum For other flows,i.e.,postponed flows,pkt.asr will be link capacity is the amount of capacity not used by MERP. updated for the senders to transmit DET packets.For detect- Packets from flows that are handled by this scheduler are ing packets,traditional protocols usually send one detecting associated with a priority that is between low and middle packet per RTT;however,RTT in nowadays data centers is (see Table 2).Therefore,these packets would not be blocked pretty small,when there are massive concurrent flows,this by DET packets or block any DATA packets. method would incur unnecessary bandwidth waste.There- How to Set a and On one hand,the detecting rate of all fore,RateAdj proactively controls detecting rates according flows except the first-flow is n(1-a)C.This rate should not to the virtual deadlines of postponed flows.To enable the exceed the data sending rate of the first-flow,which is aC. flow with the second smallest virtual deadline (hereafter we So we have call this flow the second-flow)to perceive network status change more frequently than the other flows,the actual n1-)Cn n+1 sending rate of the second-flow is updated as On the other hand,a cannot be too large,otherwise there pkt.asr←-min{(n-1)·(1-ca)·C,Θ,pkt.asr}. (8) would be too little bandwidth for detecting packets.We want to make full utilization of the bandwidth,thus,the and the rate for the other flows is updated as detecting rate of the second-flow should not be too small. pkt.asr-min{(1-)·C,Θ,pkt.asr} (9) Without loss of generality,we assume that the detecting latency of the second-flow should not be larger than d.Then we have Note that,DET/DETACK packets have the lowest prior- ity.When the buffer at a switch is full,they will be discarded, pkt.size (m-1-c≤d→a<1- pkt.size if any.Therefore,the "aggressive"sending manner of the sec- (n-1)dC ond-flow would not block DATA packets from the first-flow. We observe that,even with such kind of strategy,the is mainly used to limit the detecting rates.First,it second-flow still needs to wait one RTT to transmit DATA should not be larger than aC.Otherwise,the detecting rate packets after the first-flow finishes.To make matters worse, of some flow may be larger than aC.Second,it should not if the first-flow is changed very frequently,the utilization of be too small.Otherwise,the second-flow could not the network would decrease.Inspired by PDQ [6],MERP introduces an early start mechanism (lines 5-8 in Algo- rithm 3).If the first-flow will finish within an RTT,MERP 1.As Greenberg et al.[20]and Hong et al.[6]demonstrated that,in a production data center of a large scale cloud service,the average num- updates the actual sending rate of the second-flow to be ber of active flows at each switch is around 12,000
flows n. It is not hard to see tðnÞ ¼ tðn 1Þ þ OðnÞ, i.e., the time complexity of VDG is Oðn2Þ. Today’s high-end data center switches usually are equipped with fast CPUs, e.g., Cisco Nexus 9,300 switch has dual-core 2.5 GHz x86 CPUs. According to the measurement in [3], the number of active flows is no more than 3,000 on more than 70 percent switches in 4 representative data centers. For this scale of flows, VDG would be very effi- cient and fast. 4.3.3 Rate Adjustment Whenever a SYN, DATA, or DET packet leaves a MERP switch, the switch needs to update the actual sending rate in the packet header. MERP relies on DET/DETACK/BOOST packets to inform MERP senders of the current network status. To make senders perceive network status change in time, these detecting packets must be sent frequently. We therefore reserve a portion of network resources for them. Denote by a the maximum proportion of actual sending rate to the link capacity, by n the number of concurrent flows, and by RTT the round-trip time. Algorithm 3 gives the rate adjustment (RateAdj) algorithm. As we mentioned before, MERP uses virtual deadlines to indicate the scheduling order of competing flows. If a flow has the smallest virtual deadline (hereafter we call this flow the first-flow), that is, it is now sending DATA packets, its actual sending rate is update as pkt:asr minfa C; pkt:asrg: (7) For other flows, i.e., postponed flows, pkt:asr will be updated for the senders to transmit DET packets. For detecting packets, traditional protocols usually send one detecting packet per RTT; however, RTT in nowadays data centers is pretty small, when there are massive concurrent flows, this method would incur unnecessary bandwidth waste. Therefore, RateAdj proactively controls detecting rates according to the virtual deadlines of postponed flows. To enable the flow with the second smallest virtual deadline (hereafter we call this flow the second-flow) to perceive network status change more frequently than the other flows, the actual sending rate of the second-flow is updated as pkt:asr minfðn 1Þð1 aÞ C; Q; pkt:asrg; (8) and the rate for the other flows is updated as pkt:asr minfð1 aÞ C; Q; pkt:asrg: (9) Note that, DET/DETACK packets have the lowest priority. When the buffer at a switch is full, they will be discarded, if any. Therefore, the “aggressive" sending manner of the second-flow would not block DATA packets from the first-flow. We observe that, even with such kind of strategy, the second-flow still needs to wait one RTT to transmit DATA packets after the first-flow finishes. To make matters worse, if the first-flow is changed very frequently, the utilization of the network would decrease. Inspired by PDQ [6], MERP introduces an early start mechanism (lines 5-8 in Algorithm 3). If the first-flow will finish within an RTT, MERP updates the actual sending rate of the second-flow to be minfa C; pkt:asrg. When the receiver of the second-flow finds that the asr is larger than Q, it would send a BOOST packet to the sender, and the sender would start transmitting DATA packets immediately after it receives the BOOST packet. Imagine the scenario without BOOST packets. When the receiver of the second-flow finds that the asr is larger than Q, it would send a DETACK packet to the sender. Remember that DET and DETACK packets have the lowest priorities and may be discarded whenever buffer overflows. Therefore, a DETACK packet would not take a shorter time to arrive at the sender than a BOOST packet, which has the highest priority. Therefore, the BOOST packet can let the sender of the second-flow to quickly start transmitting DATA packets, and it may save one RTT latency. 4.4 Discussions Flow Table Size. Each MERP switch maintains a flow state table T that stores 6 states for each flow (see Section 4.3.1). Under a pessimistic scenario where each of these states has to use 2 Bytes to store, a flow will need 12 Bytes at a switch. Modern switches usually have 4 to 16 MByte of shared high-speed memory, i.e., a switch can store states for at least 4 MB 12 B ¼ 349; 525 flows, which is large enough for normal production data centers.1 In our route control simulations, the maximum number of active flows at each switch never exceeds 1,000. Still, suppose there are more flows than a switch can handle. In this case, we can run another flow scheduler that does not store flow states at switches alongside MERP. We inform this scheduler that the maximum link capacity is the amount of capacity not used by MERP. Packets from flows that are handled by this scheduler are associated with a priority that is between low and middle (see Table 2). Therefore, these packets would not be blocked by DET packets or block any DATA packets. How to Set a and Q? On one hand, the detecting rate of all flows except the first-flow is nð1 aÞC. This rate should not exceed the data sending rate of the first-flow, which is aC. So we have nð1 aÞC n n þ 1 : On the other hand, a cannot be too large, otherwise there would be too little bandwidth for detecting packets. We want to make full utilization of the bandwidth, thus, the detecting rate of the second-flow should not be too small. Without loss of generality, we assume that the detecting latency of the second-flow should not be larger than d. Then we have pkt:size ðn 1Þð1 aÞC d ) a < 1 pkt:size ðn 1ÞdC : Q is mainly used to limit the detecting rates. First, it should not be larger than aC. Otherwise, the detecting rate of some flow may be larger than aC. Second, it should not be too small. Otherwise, the second-flow could not 1. As Greenberg et al. [20] and Hong et al. [6] demonstrated that, in a production data center of a large scale cloud service, the average number of active flows at each switch is around 12,000. 3162 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 11, NOVEMBER 2017
ZHANG ET AL.:EFFICIENT DATA CENTER FLOW SCHEDULING WITHOUT STARVATION USING EXPANSION RATIO 3163 1.2 link 2 H1-H4:short flow 1 0.8 H1-H4:long flow 0.6 H2-H3:long flow 0.4 HI H2 H3 H4 0.2 Fig.3.ECMP limitation. 0 200 400 600 800 1000 immediately detect the finish of the first-flow.Therefore,we Time(ms) suggest that号C≤O≤axC. Fig.4.The utilizations of S3-S1(link 1)and S3-S2(link 2)in the topology shown in Fig.3 under ECMP. 5 ROUTE CONTROL sizes vary significantly in data centers:90 percent of all Nowadays data centers often use multi-path topologies,so flows are less than 100 KB,and they contribute only it is also essential to design an appropriate routing scheme 5 percent to the overall traffic [2],[25]. which can wisely decide the routing path for each flow In summary,the limitations of stateless ECMP mainly among multiple equal-cost paths.We first show the limita- include the following aspects. tions of ECMP(Section 5.1),then give the design rationale of route control in MERP (Section 5.2),lastly,we present Collision may occur.Different flows may be hashed design details(Section 5.3). to the same path.For example,in Fig.3,even if there are only two flows (one is from HI to H4,and the 5.1 Limitations of ECMP other is from H2 to H3),the probability that they are To meet the need for large aggregate bandwidth and fault- routed to the same path is as high as 50 percent. tolerance,data centers are designed to have many equal- ECMP is sensitive to flow size distribution.When cost paths,where ECMP is widely deployed for routing [111, flow sizes vary significantly,long flows may be [20].To avoid out-of-order packets,which may cause obvi- scheduled to the same path.When short flows finish, ous throughput degradation of TCP [21],ECMP ensures link utilizations may be quite imbalanced. the packets belonging to the same TCP flow traverse the ECMP does not take downstream link conditions same path by hashing several TCP/IP fields of each packet into account.When downstream network links have to one of the available equal-cost routes to the destina- congestions or fail,upstream ECMP switches cannot tion [22].The widely used hash algorithm is Hash-Thresh- perceive such changes.Taking Fig.3 for example,if old [23].It first selects a key by performing hash over TCP/ link S2-S4 is congested,and S3 uses ECMP to route IP 5-tuple of a packet.Then a region size is obtained by flows,S3 cannot avoid routing packets to this con- dividing the codomain size of the hash function to the num- gested link. ber of available next hops.Lastly the hash key is divided by the region size to get the output port number.Note that the 5.2 Design Rationale algorithm does not specify the hash function to obtain the Appropriate design rationales are quite important to the key but typically uses Cyclic Redundancy Check [22]. overall performance.Hence,before introducing our specific Although ECMP provides load balancing to some extent, design,we first present three design rationales. it still has several limitations.Consider the example in First,the primary purpose of the route control is to reduce Fig.3,there are one short and two long flows in the simple latency,not to improve utilization.Many existing data center network.Suppose two flows from H1 to H4 are routed to routing protocols focus on load balancing.Such kind of different paths according to ECMP.Now,a new flow from design often puts network utilization in the first place. H2 to H3 arrives.Since ECMP is stateless,this new flow However,increasing utilization is not equivalent to reduc- may be routed to S1 by S3.When the short flow finishes,ing latency.Here is an example:suppose there are 2 equal- two long flows still compete with each other.We also con- cost paths,P and P2,starting from a router R;the capacity ducted experiments to verify our findings.We let HI send of either path is 1.There are 3 flows,say,f,f2,and fs;the 1 GB data to H3 via many short flows,and meanwhile,let sizes of them are 4,1,and 1.Suppose fi is routed to P,and H2 send 1 GB data to H4 via one single long flow.We peri-the others are routed to P2.Now,a new flow fa with a size odically check the utilizations of two links:S3-S1 (link 1),of 1 arrives at R.If we aims to maximize link utilization,f and S3-S2(link 2).The results are shown in Fig.4.We notice will be routed to P:under SRPT,the FCTs of them are 4,1, that,the utilization of link 2 is stable and high,while that of 2,and 3,and the average FCT is 2.5.However,if fa is routed link 1 varies significantly over time. to P,the FCTs of them become 4,1,2,and 1,and the aver- In fact,ECMP performance largely depends on flow size age FCT becomes 2.25.As we mentioned before,reducing distribution and traffic pattern [24],and it achieves the best average and/or tail FCT is very important to improving performance when flow sizes follow a uniform distribution user experiences,thus,the route control in MERP focuses and traffic pattern is all-to-all.However,in reality,flow on reducing latency
immediately detect the finish of the first-flow. Therefore, we suggest that a 2 C Q aC. 5 ROUTE CONTROL Nowadays data centers often use multi-path topologies, so it is also essential to design an appropriate routing scheme which can wisely decide the routing path for each flow among multiple equal-cost paths. We first show the limitations of ECMP (Section 5.1), then give the design rationale of route control in MERP (Section 5.2), lastly, we present design details (Section 5.3). 5.1 Limitations of ECMP To meet the need for large aggregate bandwidth and faulttolerance, data centers are designed to have many equalcost paths, where ECMP is widely deployed for routing [11], [20]. To avoid out-of-order packets, which may cause obvious throughput degradation of TCP [21], ECMP ensures the packets belonging to the same TCP flow traverse the same path by hashing several TCP/IP fields of each packet to one of the available equal-cost routes to the destination [22]. The widely used hash algorithm is Hash-Threshold [23]. It first selects a key by performing hash over TCP/ IP 5-tuple of a packet. Then a region size is obtained by dividing the codomain size of the hash function to the number of available next hops. Lastly the hash key is divided by the region size to get the output port number. Note that the algorithm does not specify the hash function to obtain the key but typically uses Cyclic Redundancy Check [22]. Although ECMP provides load balancing to some extent, it still has several limitations. Consider the example in Fig. 3, there are one short and two long flows in the simple network. Suppose two flows from H1 to H4 are routed to different paths according to ECMP. Now, a new flow from H2 to H3 arrives. Since ECMP is stateless, this new flow may be routed to S1 by S3. When the short flow finishes, two long flows still compete with each other. We also conducted experiments to verify our findings. We let H1 send 1 GB data to H3 via many short flows, and meanwhile, let H2 send 1 GB data to H4 via one single long flow. We periodically check the utilizations of two links: S3-S1 (link 1), and S3-S2 (link 2). The results are shown in Fig. 4. We notice that, the utilization of link 2 is stable and high, while that of link 1 varies significantly over time. In fact, ECMP performance largely depends on flow size distribution and traffic pattern [24], and it achieves the best performance when flow sizes follow a uniform distribution and traffic pattern is all-to-all. However, in reality, flow sizes vary significantly in data centers: 90 percent of all flows are less than 100 KB, and they contribute only 5 percent to the overall traffic [2], [25]. In summary, the limitations of stateless ECMP mainly include the following aspects. Collision may occur. Different flows may be hashed to the same path. For example, in Fig. 3, even if there are only two flows (one is from H1 to H4, and the other is from H2 to H3), the probability that they are routed to the same path is as high as 50 percent. ECMP is sensitive to flow size distribution. When flow sizes vary significantly, long flows may be scheduled to the same path. When short flows finish, link utilizations may be quite imbalanced. ECMP does not take downstream link conditions into account. When downstream network links have congestions or fail, upstream ECMP switches cannot perceive such changes. Taking Fig. 3 for example, if link S2-S4 is congested, and S3 uses ECMP to route flows, S3 cannot avoid routing packets to this congested link. 5.2 Design Rationale Appropriate design rationales are quite important to the overall performance. Hence, before introducing our specific design, we first present three design rationales. First, the primary purpose of the route control is to reduce latency, not to improve utilization. Many existing data center routing protocols focus on load balancing. Such kind of design often puts network utilization in the first place. However, increasing utilization is not equivalent to reducing latency. Here is an example: suppose there are 2 equalcost paths, P1 and P2, starting from a router R; the capacity of either path is 1. There are 3 flows, say, f1, f2, and f3; the sizes of them are 4, 1, and 1. Suppose f1 is routed to P1, and the others are routed to P2. Now, a new flow f4 with a size of 1 arrives at R. If we aims to maximize link utilization, f4 will be routed to P2: under SRPT, the FCTs of them are 4, 1, 2, and 3, and the average FCT is 2.5. However, if f4 is routed to P1, the FCTs of them become 4, 1, 2, and 1, and the average FCT becomes 2.25. As we mentioned before, reducing average and/or tail FCT is very important to improving user experiences, thus, the route control in MERP focuses on reducing latency. Fig. 3. ECMP limitation. Fig. 4. The utilizations of S3-S1 (link 1) and S3-S2 (link 2) in the topology shown in Fig. 3 under ECMP. ZHANG ET AL.: EFFICIENT DATA CENTER FLOW SCHEDULING WITHOUT STARVATION USING EXPANSION RATIO 3163
3164 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL 28,NO.11.NOVEMBER 2017 Second,the route control should be per-flow,not per-packet. expansion ratio of the current flow.Specifically,Algorithm 5 Per-packet load balancing can fully utilize the multi-path shows GPE that calculates the expansion ratio of flow f on a nature of data centers.However,most of current transport path P.In each iteration,we first assume f is the last flow protocols schedule flows based on flow sizes;if we split (lines 3-4),then for each flow f;in T[P]we compute its flows into multiple sub-flows in the network layer,it may expansion ratio by taking it as the last flow (line 6).If the hurt flow scheduling efficiency in the transport layer.More- expansion ratio of fi is smaller than minErpRatio,we over,per-packet strategies probably incur out-of-order update lastFlow (lines 7-9).The algorithm will terminate packets,which may affect the performance of transport when we find the expansion ratio of f on P(lines 12-13). layer protocols,again. The second stage (line 3)selects one path from the candi- Third,the route control should have ability to perceive down- date path set C.GMER is shown in Algorithm 6,it enables stream link conditions.The bursty data center traffic makes MERP to have the ability to perceive downstream link con- centralized route control inefficient to perceive network sta- ditions.Taking Fig.3 for example,when link S2-S4 becomes tus changes,thus the route control in MERP must be distrib- congested,if S3 does not know this,S3 may route upcoming uted.If a switch makes route decisions only based on its flows to S2,which may impair the overall routing perfor- local information,it cannot quickly react to downstream mance.But how to enable MERP to quickly learn down- condition changes. stream conditions? 5.3 Design Details Algorithm 6.Get Min Expected Rate,GMER(P,pkt E f) We design the route control in MERP following these ration- Input:flow state table T[P]of P,link capacity C,and current ales.It is named as Minimize Expansion ratio and Maximize time cT tail Rate (MEMR)and consists of two stages,as shown in Output:the expansion ratio of f on P Algorithm 4. 1:iRestSize←-∑.eruyf.rs 2:minErpRatiotRestsise+Lsise Ddue to Eq.(1) Algorithm 4.Minimize Expansion Ratio and Maximize .22e Tail Rate,MEMR(pkt E f) 3:for each flow fT[P]do Input:path set S erpRatioT-a)C+tRestSise+fmise fi.size Output:a path Pr for flow f 5 if expRatio minExpRatio then 1:m←minp∈sGPE(P,pkt) Algorithm 5 6: minExpRatio -expRatio 2:C-{PPE S,and if f is routed to P,the expansion ratio of lastFlow +fi f ism 6 end if 3:P-arg maxpEcGMER(Pi,pkt) DAlgorithm 6 9:end for 4:return P 10:min ErpectedRateminErpRatio tail rate 11:for each flow fiT[P]do 12: if minExpectedRate fi.esr and f.size<fi.size and Algorithm 5.Get Path Expansion Ratio,GPE(P,pkt E f) f.dst fi.dst then Input:flow state table T[P]of P,link capacity C,and current 13: minExpectedRate fi.esr time cT 14: end if Output:the expansion ratio of f on P 15:end for 1:iRestSize-fer firs 16:return minExpectedRate 2:while T[P]≠0do 3: minErpRatiotRestSise+fsise Remember that,the esr of each packet is updated by each Lsise Ddue to Eq.(1) switch and is used to synchronize the virtual deadline of 4: last Flow-f Dassuming f is the last flow 5 for each flow fi∈T[P]do each flow,which means that,this kind of information cap- 6: erpRatioT-faT-C+tRestSisetLsise tures the link conditions along each path.Therefore,we first :.822e compute the tail rate of path P(lines 2-10),then we addi- 7 if expRatio minExpRatio then tionally take a look at the expected sending rate (esr)of 8: minExpRatio +expRatio each flow that has the same destination of f,then let the 9: lastFlow-fi updating last Flow expected sending rate of f to be the minimum among them 10: end if (lines 11-15).In doing so,we can conservatively decrease 11: end for the expected sending rate of f when there is a downstream 12: if last Flow f then link failure. 13: return minErpRatio 14 else 15: T[P].delete(last Flow) 6 PERFORMANCE EVALUATION 16: tRestSize-tRestSize-last Flow.rs In this section,we evaluate the performance of MERP using 17: end if NS2-based simulations.We first evaluate the rate control of 18:end while MERP on top of a single-path data center(Section 6.1),we then focus on the route control component using a multi- The first stage (lines 1-2)is to check each possible path path data center topology (Section 6.2),finally we evaluate and calculate the expansion ratio of the current flow,we how quick MERP can perform flow switching and whether then select a subset C of these paths that minimize the it is robust to Incast(Section 6.3)
Second, the route control should be per-flow, not per-packet. Per-packet load balancing can fully utilize the multi-path nature of data centers. However, most of current transport protocols schedule flows based on flow sizes; if we split flows into multiple sub-flows in the network layer, it may hurt flow scheduling efficiency in the transport layer. Moreover, per-packet strategies probably incur out-of-order packets, which may affect the performance of transport layer protocols, again. Third, the route control should have ability to perceive downstream link conditions. The bursty data center traffic makes centralized route control inefficient to perceive network status changes, thus the route control in MERP must be distributed. If a switch makes route decisions only based on its local information, it cannot quickly react to downstream condition changes. 5.3 Design Details We design the route control in MERP following these rationales. It is named as Minimize Expansion ratio and Maximize tail Rate (MEMR) and consists of two stages, as shown in Algorithm 4. Algorithm 4. Minimize Expansion Ratio and Maximize Tail Rate, MEMR(pkt 2 f) Input: path set S Output: a path Pf for flow f 1: m minPi2SGPE(Pi, pkt) " Algorithm 5 2: C fPjP 2 S; and if f is routed to P, the expansion ratio of f is m g 3: Pf arg maxPi2CGMER(Pi, pkt) " Algorithm 6 4: return Pf Algorithm 5. Get Path Expansion Ratio, GPE(P, pkt 2 f) Input: flow state table T½P of P, link capacity C, and current time cT Output: the expansion ratio of f on P 1: tRestSize P fi2T½P fi:rs 2: while T½P 6¼ ; do 3: minExpRatio tRestSizeþf:size f:size " due to Eq. (1) 4: lastFlow f " assuming f is the last flow 5: for each flow fi 2 T½P do 6: expRatio ðcTfi:aTÞCþtRestSizeþf:size fi:size 7: if expRatio fi:esr and f:size fi:size and f:dst ¼ fi:dst then 13: minExpectedRate fi:esr 14: end if 15: end for 16: return minExpectedRate Remember that, the esr of each packet is updated by each switch and is used to synchronize the virtual deadline of each flow, which means that, this kind of information captures the link conditions along each path. Therefore, we first compute the tail rate of path P (lines 2-10), then we additionally take a look at the expected sending rate (esr) of each flow that has the same destination of f, then let the expected sending rate of f to be the minimum among them (lines 11-15). In doing so, we can conservatively decrease the expected sending rate of f when there is a downstream link failure. 6 PERFORMANCE EVALUATION In this section, we evaluate the performance of MERP using NS2-based simulations. We first evaluate the rate control of MERP on top of a single-path data center (Section 6.1), we then focus on the route control component using a multipath data center topology (Section 6.2), finally we evaluate how quick MERP can perform flow switching and whether it is robust to Incast (Section 6.3). 3164 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 11, NOVEMBER 2017
ZHANG ET AL.:EFFICIENT DATA CENTER FLOW SCHEDULING WITHOUT STARVATION USING EXPANSION RATIO 3165 MERP D MERP MERP三 pFabric pFabric pFabric 0.8 0.7 15 0.6 0.5 0.4 0.20.30.40.50.6070.8 020.30.40.50.6070.8 02030.40.50.60.70.8 Load Load Load (a)99hpercentile expansion ratio (b)Maximum expansion ratio (c)99.9t percentile FCT Fig.5.MERP versus pFabric under varying loads(the mean of flow sizes is 1,500 KB) 6.1 Rate Control Evaluation a more than 12 percent reduction of the 99.9th percentile To evaluate the rate control component of MERP,we use a FCT compared with pFabric.In the worst case,a flow is single tree topology including 9 servers arranged across three completed within no more 10 times of its optimal FCT.Fur- racks:each Top-of-Rack (ToR)switch connects three servers, thermore,the 99.9th percentile completion times are also and three ToR switches are connected by a root switch. reduced significantly by MERP,especially under high loads. This topology uses 1 Gbps links and the length of the However,we must mention that,MERP targets to strike buffer at each port of each switch is 100 packets.By default, a balance between efficiency and fairness.It reduces tail we generate flows with sizes following the Pareto distribu-FCT and expansion ratio at the price of a little increase in tion which gives a better fit for realistic data center work- average FCT.Fig.7 shows how the average FCT changes loads.The arrivals of flows follow the Poission process.The under different loads.We find that,the average FCT of size of each packet is 1,500 bytes,including a packet header MERP is almost the same as that of pFabric,while the aver- of 40 bytes.The per-hop link delay is set to 10us.The maxi- age FCT of TCP-DropTail is much larger than either of mum sending rate of DATA packets is set to 99.9 percent of them.The gap between MERP and pFabric increases as the link capacity,i.e.,a =0.999.The detecting rate threshold is load increases,and it is about 8 percent for 80 percent load, set to 80 percent of the link capacity,i.e.,=0.8 Gbps. which is almost negligible,especially compared to the per- MERP is compared with TCP-DropTail [26],in which we formance of TCP-DropTail.It is worth noting again that, use TCP NewReno with DropTail queues,and pFabric [21, MERP tries to balance fairness and efficiency;some long which is a state-of-art approach for minimizing flow com- flows could be scheduled earlier than short flows to avoid pletion times:each packet carries a single priority number starvation,thus,it is inevitable for MERP to achieve a that depends on flow sizes,and each switch schedules and slightly longer average FCT than pFabric. drops packets based on their priorities.Performance metrics include expansion ratio and tail FCT(i.e.,the 99.9th percen- tile FCT)[10]. 6.1.2 Impact of Flow Size We are also interested in evaluating the impact of flow sizes on the performance of MERP,pFabric,and TCP-DropTail. 6.1.1 Impact of Load To achieve this,we generate flows with sizes following the We generate flows with sizes following the Pareto distribu-Pareto distribution with a mean size of 150 KB(including tion with a mean size of 1,500 KB (including the packet the packet header),and the other settings are the same as header)for this set of experiments.Fig.5 shows that the those in Fig.5.The results are shown in Fig.6.Comparing long-tailed expansion ratio is significantly reduced by Fig.5 with Fig.6,we find that,the performance gap MERP.When the load is 80 percent,i.e.,the total size of all between MERP and pFabric is larger when flow sizes vary flows is about 80 percent of the network capacity,the expan-more significantly,because pFabric may starve long flows sion ratio is reduced by about 48 percent compared with while MERP fairly treat flows with different sizes.We note pFabric,indicating that MERP could provide a more stable that,MERP can achieve better performance when small flow completion time for each flow;besides,MERP provides flows contribute the majority of the overall traffic.When the 10 45 0.16 MERP D MERP MERP▣ prabric 40 0 Fabric题 0.14 pFabric 35 30 0.12 23 0.1 0.0 0.06 0.04 0.02 0.20.3040.50.60.70.8 020.30.40.50.60.70.8 0.20.30.40.50.60.70.8 Load Load Load (a)9percentile expansion ratio (b)Maximum expansion ratio (c)99.9percentile FCT Fig.6.MERP versus pFabric under varying loads (the mean of flow sizes is 150 KB)
6.1 Rate Control Evaluation To evaluate the rate control component of MERP, we use a single tree topology including 9 servers arranged across three racks: each Top-of-Rack (ToR) switch connects three servers, and three ToR switches are connected by a root switch. This topology uses 1 Gbps links and the length of the buffer at each port of each switch is 100 packets. By default, we generate flows with sizes following the Pareto distribution which gives a better fit for realistic data center workloads. The arrivals of flows follow the Poission process. The size of each packet is 1,500 bytes, including a packet header of 40 bytes. The per-hop link delay is set to 10ms. The maximum sending rate of DATA packets is set to 99:9 percent of link capacity, i.e., a ¼ 0:999. The detecting rate threshold is set to 80 percent of the link capacity, i.e., Q ¼ 0:8 Gbps. MERP is compared with TCP-DropTail [26], in which we use TCP NewReno with DropTail queues, and pFabric [2], which is a state-of-art approach for minimizing flow completion times: each packet carries a single priority number that depends on flow sizes, and each switch schedules and drops packets based on their priorities. Performance metrics include expansion ratio and tail FCT (i.e., the 99:9th percentile FCT) [10]. 6.1.1 Impact of Load We generate flows with sizes following the Pareto distribution with a mean size of 1,500 KB (including the packet header) for this set of experiments. Fig. 5 shows that the long-tailed expansion ratio is significantly reduced by MERP. When the load is 80 percent, i.e., the total size of all flows is about 80 percent of the network capacity, the expansion ratio is reduced by about 48 percent compared with pFabric, indicating that MERP could provide a more stable flow completion time for each flow; besides, MERP provides a more than 12 percent reduction of the 99:9th percentile FCT compared with pFabric. In the worst case, a flow is completed within no more 10 times of its optimal FCT. Furthermore, the 99:9th percentile completion times are also reduced significantly by MERP, especially under high loads. However, we must mention that, MERP targets to strike a balance between efficiency and fairness. It reduces tail FCT and expansion ratio at the price of a little increase in average FCT. Fig. 7 shows how the average FCT changes under different loads. We find that, the average FCT of MERP is almost the same as that of pFabric, while the average FCT of TCP-DropTail is much larger than either of them. The gap between MERP and pFabric increases as the load increases, and it is about 8 percent for 80 percent load, which is almost negligible, especially compared to the performance of TCP-DropTail. It is worth noting again that, MERP tries to balance fairness and efficiency; some long flows could be scheduled earlier than short flows to avoid starvation, thus, it is inevitable for MERP to achieve a slightly longer average FCT than pFabric. 6.1.2 Impact of Flow Size We are also interested in evaluating the impact of flow sizes on the performance of MERP, pFabric, and TCP-DropTail. To achieve this, we generate flows with sizes following the Pareto distribution with a mean size of 150 KB (including the packet header), and the other settings are the same as those in Fig. 5. The results are shown in Fig. 6. Comparing Fig. 5 with Fig. 6, we find that, the performance gap between MERP and pFabric is larger when flow sizes vary more significantly, because pFabric may starve long flows while MERP fairly treat flows with different sizes. We note that, MERP can achieve better performance when small flows contribute the majority of the overall traffic. When the Fig. 5. MERP versus pFabric under varying loads (the mean of flow sizes is 1,500 KB). Fig. 6. MERP versus pFabric under varying loads (the mean of flow sizes is 150 KB). ZHANG ET AL.: EFFICIENT DATA CENTER FLOW SCHEDULING WITHOUT STARVATION USING EXPANSION RATIO 3165
3166 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL 28,NO.11,NOVEMBER 2017 0.1 0.11 MERP pFabric 0.1 0-0.8 0.08 TCP 0.09 空0.06 0.08 0.07 0.06 0.05 0.02 0.04 0.03 0 0.2 0.4 0.6 0.8 0.20.30.40.50.60.70.8 Load Load Fig.7.Average FCT Fig.9.Impact of load is 80 percent,the maximum expansion ratio is reduced overcome the limitations of ECMP (i.e.,hash collision,sensi- by more than 50 percent,and the maximum expansion ratio tivity to flow size distribution,and lack of upward feed- is about 15,i.e.,the average FCT is within 15 times of its back).In these experiments,we use fat-tree as the data optimal average FCT;the 99.9th percentile FCT is also center topology.There are,in all,20 switches,and each of reduced by more than 30 percent compared with pFabric. them has 4 ports.Four switches are used as root switches, therefore,there are a total of 16 servers.Every network link 6.1.3 Impact ofa andΘ has a bandwidth of 1 Gbps.The buffer size of each port of MERP sends DATA packets using at most aC bandwidth, each switch is 150 KB,and the single-hop delay is set to and sends DET packets using at most bandwidth.MERP 10 us.Similar to the settings in rate control evaluation,flow sizes follow the Pareto distribution,and flow arrivals follow uses these two parameters to effectively control the sending the Poisson distribution. rates of these packets,therefore,we are also interested in evaluating the effect of them on the performance of MERP. MERP is compared with ECMP and CONGA [27],which Figs.8 and 9 show the results,where the mean of flow sizes is a state-of-art approach of load balancing using sub-flows. is 150 KB.In Fig.8,when the load is light,these 4 values of o Since ECMP does not provide rate control mechanisms,in do not result in different tail FCTs in MERP;when the load our experiments ECMP also uses MERP rate control compo- increases,a large a has smaller tail FCT than a small a.This is nent as the transport layer protocol.Performance metrics reasonable,since a large o implies MERP can use more link include expansion ratio and FCT. capacity.However,in our simulations,we observed that, switch buffer overflows happen very frequently when a is 6.2.1 Impact of Load large,e.g.,0.9999.Similarly,in Fig.9,MERP with different MERP makes a good balance between average and tail 's has the same performance,when the load is light.How- FCTs.Fig.11 shows the cumulative distribution function ever,when is small,postponed flows in MERP may not be (CDF)of tail FCTs of MERP,ECMP,and CONGA when the able to quickly perceive network status change,thus cannot load is 60 percent.In general,the tail FCT of MERP is signif- immediately start transmission after the first-flow finishes, icantly smaller than that of ECMP or CONGA.For example, which implicitly wastes bandwidth resources.Therefore, more than 99 percent of all flows are finished within 100 ms when we increase the load,MERP with a small has longer by MERP,while the FCT of more than 1 percent of all flows tail FCTs than MERP with a large are larger than 200 ms in ECMP or CONGA.Considering We note that,however,the performance difference due to the significant impact of the tail FCT on the user experience o and 6 is very small.As long as we set them as discussed in of online services(due to the aggregate/partition pattern), Section 4.4,MERP can have a reasonably good performance. MERP is more suitable than ECMP or CONGA for most online interactive services deployed in data centers. 6.2 Route Control Evaluation We are also interested in how these three protocols per- We now focus on evaluating the route control component form in terms of average FCT and average expansion ratio. of MERP by examining whether MERP can effectively Fig.10 shows the comparison results.Under varying loads, MERP can reduce the average FCT by at least 20 percent 0.11 0t=0.9999 compared to ECMP or CONGA,and not surprisingly,it can 0.1 x=0.999 a=0.99 reduce the average expansion ratio by about 75 percent com- 0.09 =0.9 pared to ECMP or CONGA.We note that,CONGA has a 0.08 worse performance than ECMP or MERP,because CONGA 0.07 focuses on load balancing;besides,CONGA relies on sub- 0.06 flow level routing,and it cannot take advantage of upper 0.05 layer flow scheduling policies.These results show that,bal- 0.04 ancing traffic load is not equivalent to minimizing FCT. 0.03 0.2 0.30.40.50.60.70.8 6.2.2 Impact of Link Failure Load When network conditions change,especially when link fail- Fig.8.Impact of a. ures occur,a good flow scheduling protocol should be able
load is 80 percent, the maximum expansion ratio is reduced by more than 50 percent, and the maximum expansion ratio is about 15, i.e., the average FCT is within 15 times of its optimal average FCT; the 99:9th percentile FCT is also reduced by more than 30 percent compared with pFabric. 6.1.3 Impact of a and Q MERP sends DATA packets using at most aC bandwidth, and sends DET packets using at most Q bandwidth. MERP uses these two parameters to effectively control the sending rates of these packets, therefore, we are also interested in evaluating the effect of them on the performance of MERP. Figs. 8 and 9 show the results, where the mean of flow sizes is 150 KB. In Fig. 8, when the load is light, these 4 values of a do not result in different tail FCTs in MERP; when the load increases, a large a has smaller tail FCT than a small a. This is reasonable, since a large a implies MERP can use more link capacity. However, in our simulations, we observed that, switch buffer overflows happen very frequently when a is large, e.g., 0.9999. Similarly, in Fig. 9, MERP with different Q’s has the same performance, when the load is light. However, when Q is small, postponed flows in MERP may not be able to quickly perceive network status change, thus cannot immediately start transmission after the first-flow finishes, which implicitly wastes bandwidth resources. Therefore, when we increase the load, MERP with a small Q has longer tail FCTs than MERP with a large Q. We note that, however, the performance difference due to a and Q is very small. As long as we set them as discussed in Section 4.4, MERP can have a reasonably good performance. 6.2 Route Control Evaluation We now focus on evaluating the route control component of MERP by examining whether MERP can effectively overcome the limitations of ECMP (i.e., hash collision, sensitivity to flow size distribution, and lack of upward feedback). In these experiments, we use fat-tree as the data center topology. There are, in all, 20 switches, and each of them has 4 ports. Four switches are used as root switches, therefore, there are a total of 16 servers. Every network link has a bandwidth of 1 Gbps. The buffer size of each port of each switch is 150 KB, and the single-hop delay is set to 10 ms. Similar to the settings in rate control evaluation, flow sizes follow the Pareto distribution, and flow arrivals follow the Poisson distribution. MERP is compared with ECMP and CONGA [27], which is a state-of-art approach of load balancing using sub-flows. Since ECMP does not provide rate control mechanisms, in our experiments ECMP also uses MERP rate control component as the transport layer protocol. Performance metrics include expansion ratio and FCT. 6.2.1 Impact of Load MERP makes a good balance between average and tail FCTs. Fig. 11 shows the cumulative distribution function (CDF) of tail FCTs of MERP, ECMP, and CONGA when the load is 60 percent. In general, the tail FCT of MERP is significantly smaller than that of ECMP or CONGA. For example, more than 99 percent of all flows are finished within 100 ms by MERP, while the FCT of more than 1 percent of all flows are larger than 200 ms in ECMP or CONGA. Considering the significant impact of the tail FCT on the user experience of online services (due to the aggregate/partition pattern), MERP is more suitable than ECMP or CONGA for most online interactive services deployed in data centers. We are also interested in how these three protocols perform in terms of average FCT and average expansion ratio. Fig. 10 shows the comparison results. Under varying loads, MERP can reduce the average FCT by at least 20 percent compared to ECMP or CONGA, and not surprisingly, it can reduce the average expansion ratio by about 75 percent compared to ECMP or CONGA. We note that, CONGA has a worse performance than ECMP or MERP, because CONGA focuses on load balancing; besides, CONGA relies on sub- flow level routing, and it cannot take advantage of upper layer flow scheduling policies. These results show that, balancing traffic load is not equivalent to minimizing FCT. 6.2.2 Impact of Link Failure When network conditions change, especially when link failures occur, a good flow scheduling protocol should be able Fig. 7. Average FCT. Fig. 8. Impact of a. Fig. 9. Impact of Q. 3166 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 11, NOVEMBER 2017