Algorithm 2 A flow-based algorithm for HTA. 5.1.2 Bal-assign 1:input:an HMR-system (T,S,p,unoc,wrem()) 2:define A,B as assignments Definition 13.Let B and B'be partial assignments,t a 3:define o as a partial assignment 4:a(t)=L(task t is unassigned)for all t task and s a server.We say that BB'is a step that 5:for r=1 to m do assigns t to s if t is unassigned in B and B'=Bu{(t,s)} 6: a-max-cover(Gp,T,a) We say B'is a step,if B for some t and s. B-bal-assign(T,S,a,Wloc,Wrem()) 8:end for A sequence of steps a=ao一a1一.一Qu is a 9:set A equal to a B with least maximum load trace if for each i∈L,叫,ifa-lsa,is a step,then 10:output:assignment A Vaa≤Vt for all s'≠s. Given two partial assignments a:-1 and ai in a trace such The core of the max-cover phase is an augmenting path that ai,it follows that algorithm by Ford and Fulkerson 10.The Ford-Fulkerson algorithm takes as input a network with edge capacities and V4,a≤V-1a+m an existing network flow,and outputs a maximum flow that Vaa=Vt-1 for all s'≠s respects the capacity constraints.A fact about this algo- rithm is well-known 6,10. The following lemma is immediate. FACT 5.1.Given a flow network with integral capacities LEMMA 5.3.Let u=u and a"ao s ai .. and an initial integral s-t flow f,the Ford-Fulkerson algo- be a sequence of partial assignments generated by bal- rithm computes an integral marimum s-t flow f'in time assign at iteration T.This sequence is a trace that ends in O(E.(f'-lf)),where El is the number of edges in the a complete assignment B=au. network and f is the value of the flow f,i.e.,the amount of flow passing from the source to the sink. 5.2 Main Result During the max-cover phase at iteration T,the input It is obvious that Algorithm 2 is optimal for n=1 since placement graph Ge is first converted to a corresponding only one assignment is possible.Now we show that for n>2. flow network G.Ge includes all nodes in Ge and an extra Algorithm 2 computes,in polynomial time,assignments that source u and an extra sink v.In G,there is an edge (u,t)for are optimal to within an additive constant.The result is all t E T and an edge (s,v)for all s E S.All of the original formally stated as Theorem 5.4. edges (t,s)in Ge remain in G.The edge capacity is defined as follows:edge (s,v)has capacity T for all s E S,while all THEOREM 5.4.Let n 2.Given an HMR-system with m the other edges have capacity 1.Therefore,for any pair of tasks and n servers,Algorithm 2 computes an assignment A (t,s),if there is a flow through the path ut-s-v,the in time O(m2n))such that LA≤Lo+(1-点)u2m value of this flow is no greater than 1.Then the input partial assignment a is converted into a network flow fo as follows: LEMMA 5.5.Algorithm 2 runs in time O(m2n). if task t is assigned to server s in the partial assignment a, assign one unit of flow through the path u→t÷s÷u. PROOF.By Fact 5.1.we know that the Ford-Fulkerson The Ford-Fulkerson algorithm is then run on graph Ge algorithm takes time O(E Af)to augment the network with flow fa to find a maximum flow fo.From Fact 5.1, flow by A.At iteration T 1,max-cover takes time we know that the Ford-Fulkerson algorithm takes time O(IE·lfi),where f≤n.Then at iteration r=2, O(E(fa-f))in this iteration.This output flow fa at max-cover takes time O(E.(If2l-f1l)),where lf2l <2n. iteration r will act as the input flow to the Ford-Fulkerson The same process is repeated until Ifml m.The total algorithm at iteration r+1.The flow network at iteration running time of max-cover for all iterations thus adds up r+1 is the same as the one at iteration r except that each to O(E·(fl+lf2l-lfl+lf3l-lf2l+·+lfml)= edge (s,v)has capacity r+1 for all s E S.This incremen- O(El.Ifml)=O(El.m)=O(m2n). tal use of Ford-Fulkerson algorithm in successive iterations We implement the greedy algorithm in the bal-assign helps reduce the time complexity of the whole algorithm. phase with a priority queue.Since there are n servers,each At the end of the max-cover phase,the augmented flow fa operation of the priority queue takes O(log n)time.During is converted back into a partial assignment a'.If there is one the bal-assign phase at each iteration,at most m tasks need unit of flow through the path u→t→s→v in fa,we assign to be assigned.This takes time O(m log n).The total run- task t is to server s in a'.This conversion from a network ning time of bal-assign for all iterations is thus O(m-log n). flow to a partial assignment can always be done,because the Combining the running time of the two phases for all it- flow is integral and all edges between tasks and servers have erations gives time complexity O(m'n). capacity 1.Therefore,there is a one-to-one correspondence between a unit flow through the path u→t一s一vand Lemma 5.5 suggests the max-cover phase is the main con- the assignment of task t to its local server s.It follows that tributor to the time complexity of Algorithm 2.However fl=e.By Fact 5.1,the Ford-Fulkerson algorithm com- in a typical Hadoop system,the number of replicas for each putes a maximum flow that respects the capacity constraint data block is a small constant,say 2 or 3.Then the degree r.Thus,the following lemma is immediate of each t EG is bounded by this constant.In this case.the placement graph G is sparse and E=O(m +n).As a LEMMA 5.2.Let a'be the partial assignment computed result,max-cover runs in time O(m(m+n)).Therefore the by mar-cover at iterationT;and B be any partial assignment bal-assign phase might become the main contributor to the such that k≤r.Then time complexity.Algorithm 2 A flow-based algorithm for HTA. 1: input: an HMR-system (T, S, ρ, wloc, wrem(·)) 2: define A, B as assignments 3: define α as a partial assignment 4: α(t) = ⊥ (task t is unassigned) for all t 5: for τ = 1 to m do 6: α ← max-cover(Gρ, τ, α) 7: B ← bal-assign(T, S, α, wloc, wrem(·)) 8: end for 9: set A equal to a B with least maximum load 10: output: assignment A The core of the max-cover phase is an augmenting path algorithm by Ford and Fulkerson [10]. The Ford-Fulkerson algorithm takes as input a network with edge capacities and an existing network flow, and outputs a maximum flow that respects the capacity constraints. A fact about this algorithm is well-known [6, 10]. Fact 5.1. Given a flow network with integral capacities and an initial integral s-t flow f, the Ford-Fulkerson algorithm computes an integral maximum s-t flow f 0 in time O(|E| · (|f 0 | − |f|)), where |E| is the number of edges in the network and |f| is the value of the flow f, i.e., the amount of flow passing from the source to the sink. During the max-cover phase at iteration τ , the input placement graph Gρ is first converted to a corresponding flow network G 0 ρ. G 0 ρ includes all nodes in Gρ and an extra source u and an extra sink v. In G 0 ρ, there is an edge (u, t) for all t ∈ T and an edge (s, v) for all s ∈ S. All of the original edges (t, s) in Gρ remain in G 0 ρ. The edge capacity is defined as follows: edge (s, v) has capacity τ for all s ∈ S, while all the other edges have capacity 1. Therefore, for any pair of (t, s), if there is a flow through the path u → t → s → v, the value of this flow is no greater than 1. Then the input partial assignment α is converted into a network flow fα as follows: if task t is assigned to server s in the partial assignment α, assign one unit of flow through the path u → t → s → v. The Ford-Fulkerson algorithm is then run on graph G 0 ρ with flow fα to find a maximum flow f 0 α. From Fact 5.1, we know that the Ford-Fulkerson algorithm takes time O(|E|·(|f 0 α|−|fα|)) in this iteration. This output flow f 0 α at iteration τ will act as the input flow to the Ford-Fulkerson algorithm at iteration τ + 1. The flow network at iteration τ + 1 is the same as the one at iteration τ except that each edge (s, v) has capacity τ + 1 for all s ∈ S. This incremental use of Ford-Fulkerson algorithm in successive iterations helps reduce the time complexity of the whole algorithm. At the end of the max-cover phase, the augmented flow f 0 α is converted back into a partial assignment α 0 . If there is one unit of flow through the path u → t → s → v in f 0 α, we assign task t is to server s in α 0 . This conversion from a network flow to a partial assignment can always be done, because the flow is integral and all edges between tasks and servers have capacity 1. Therefore, there is a one-to-one correspondence between a unit flow through the path u → t → s → v and the assignment of task t to its local server s. It follows that |f 0 α| = ` α 0 . By Fact 5.1, the Ford-Fulkerson algorithm computes a maximum flow that respects the capacity constraint τ . Thus, the following lemma is immediate. Lemma 5.2. Let α τ be the partial assignment computed by max-cover at iteration τ , and β be any partial assignment such that k β ≤ τ . Then ` α τ ≥ ` β . 5.1.2 Bal-assign Definition 13. Let β and β 0 be partial assignments, t a task and s a server. We say that β t:s −→ β 0 is a step that assigns t to s if t is unassigned in β and β 0 = β ∪ {(t, s)}. We say β → β 0 is a step, if β t:s −→ β 0 for some t and s. A sequence of steps α = α0 → α1 → . . . → αu is a trace if for each i ∈ [1, u], if αi−1 t:s −→ αi is a step, then V αi,α s ≤ V αi,α s0 for all s 0 6= s. Given two partial assignments αi−1 and αi in a trace such that αi−1 t:s −→ αi, it follows that V αi,α s ≤ V αi−1,α s + w α rem V αi,α s = V αi−1,α s0 for all s 0 6= s The following lemma is immediate. Lemma 5.3. Let u = u α τ and α τ = α τ 0 ⊆ α τ 1 ⊆ · · · ⊆ α τ u be a sequence of partial assignments generated by balassign at iteration τ . This sequence is a trace that ends in a complete assignment B τ = α τ u. 5.2 Main Result It is obvious that Algorithm 2 is optimal for n = 1 since only one assignment is possible. Now we show that for n ≥ 2, Algorithm 2 computes, in polynomial time, assignments that are optimal to within an additive constant. The result is formally stated as Theorem 5.4. Theorem 5.4. Let n ≥ 2. Given an HMR-system with m tasks and n servers, Algorithm 2 computes an assignment A in time O(m2n) such that L A ≤ L O + “ 1 − 1 n−1 ” · w O rem. Lemma 5.5. Algorithm 2 runs in time O(m2n). Proof. By Fact 5.1, we know that the Ford-Fulkerson algorithm takes time O(|E| · |∆f |) to augment the network flow by |∆f |. At iteration τ = 1, max-cover takes time O(|E| · |f1|), where |f1| ≤ n. Then at iteration τ = 2, max-cover takes time O(|E| · (|f2| − |f1|)), where |f2| ≤ 2n. The same process is repeated until |fm| = m. The total running time of max-cover for all iterations thus adds up to O(|E| · (|f1| + |f2| − |f1| + |f3| − |f2| + · · · + |fm|)) = O(|E| · |fm|) = O(|E| · m) = O(m2n). We implement the greedy algorithm in the bal-assign phase with a priority queue. Since there are n servers, each operation of the priority queue takes O(log n) time. During the bal-assign phase at each iteration, at most m tasks need to be assigned. This takes time O(m log n). The total running time of bal-assign for all iterations is thus O(m2 log n). Combining the running time of the two phases for all iterations gives time complexity O(m2n). Lemma 5.5 suggests the max-cover phase is the main contributor to the time complexity of Algorithm 2. However, in a typical Hadoop system, the number of replicas for each data block is a small constant, say 2 or 3. Then the degree of each t ∈ G is bounded by this constant. In this case, the placement graph G is sparse and |E| = O(m + n). As a result, max-cover runs in time O(m(m + n)). Therefore the bal-assign phase might become the main contributor to the time complexity