Assigning Tasks for Efficiency in Hadoop [Extended Abstract] Michael J.Fischer Xueyuan Su Yitong Yin Computer Science Computer Science State Key Laboratory for Novel Yale University Yale University Software Technology P○.Box208285 P.○.Box208285 Nanjing University,China New Haven,CT,USA New Haven,CT,USA yinyt@nju.edu.cn michael.fischer@yale.edu xueyuan.su@yale.edu ABSTRACT by Abstract Devices:Complexity Measures and In recent years Google's MapReduce has emerged as a lead- Classes-reducibility and completeness:F.2.2 Analysis of ing large-scale data processing architecture.Adopted by Algorithms and Problem Complexity:Nonnumerical companies such as Amazon,Facebook,Google,IBM and Algorithms and Problems-sequencing and scheduling Yahoo!in daily use,and more recently put in use by several universities,it allows parallel processing of huge volumes of data over cluster of machines.Hadoop is a free Java im- General Terms plementation of MapReduce.In Hadoop,files are split into Algorithms,Performance,Theory blocks and replicated and spread over all servers in a net- work.Each job is also split into many small pieces called tasks.Several tasks are processed on a single server.and Keywords a job is not completed until all the assigned tasks are fin- task assignment,load balancing,NP-completeness,approx- ished.A crucial factor that affects the completion time of a imation algorithm,MapReduce,Hadoop job is the particular assignment of tasks to servers.Given a placement of the input data over servers,one wishes to find the assignment that minimizes the completion time.In this 1.INTRODUCTION paper,an idealized Hadoop model is proposed to investigate the Hadoop task assignment problem.It is shown that there is no feasible algorithm to find the optimal Hadoop task as- 1.1 Background signment unless P =AP.Assignments that are computed The cloud computing paradigm has recently received sig- by the round robin algorithm inspired by the current Hadoop nificant attention in the media.The cloud is a metaphor scheduler are shown to deviate from optimum by a multi- for the Internet,which is an abstraction for the complex plicative factor in the worst case.A flow-based algorithm infrastructure it conceals.Cloud computing refers to both is presented that computes assignments that are optimal to the applications delivered as services over the Internet and within an additive constant. the hardware and software that provide such services.It envisions shifting data storage and computing power away Categories and Subject Descriptors from local servers,across the network cloud,and into large clusters of machines hosted by companies such as Amazon D.3.2 Programming Languages:Language Classifica- Google,IBM,Microsoft,Yahoo!and so on tions-concurrent,distributed,and parallel languages;F.1.2 Google's MapReduce [8,9,16]parallel computing archi- Computation by Abstract Devices:Modes of Compu- tecture,for example,splits workload over large clusters of tation-parallelism and concurrency:F.1.3 Computation commodity PCs and enables automatic parallelization.By Supported by the Kempner Fellowship from the Depart- exploiting parallel processing,it provides a software plat- ment of Computer Science at Yale University. form that lets one easily write and run applications that fSupported by the National Science Foundation of China un- process vast amounts of data. der Grant No.60721002.This work was done when Yitong Apache Hadoop [4]is a free Java implementation of Yin was at Yale University. MapReduce in the open source software community.It is originally designed to efficiently process large volumes of data by parallel processing over commodity computers in local networks.In academia,researchers have adapted Permission to make digital or hard copies of all or part of this work for Hadoop to several different architectures.For example personal or classroom use is granted without fee provided that copies are Ranger et al.18 evaluate MapReduce in multi-core and not made or distributed for profit or commercial advantage and that copies multi-processor systems,Kruijf et al.[7]implement MapRe- bear this notice and the full citation on the first page.To copy otherwise,to duce on the Cell B.E.processor architecture,and He et republish,to post on servers or to redistribute to lists,requires prior specific permission and/or a fee. al.[14]propose a MapReduce framework on graphics pro- SPAA'10.June 13-15,2010,Thira,Santorini,Greece. cessors.Many related applications using Hadoop have also Copyright2010ACM978-1-4503-0079-7/1006.$10.00. been developed to solve various practical problems
Assigning Tasks for Efficiency in Hadoop [Extended Abstract] Michael J. Fischer Computer Science Yale University P.O. Box 208285 New Haven, CT, USA michael.fischer@yale.edu Xueyuan Su ∗ Computer Science Yale University P.O. Box 208285 New Haven, CT, USA xueyuan.su@yale.edu Yitong Yin † State Key Laboratory for Novel Software Technology Nanjing University, China yinyt@nju.edu.cn ABSTRACT In recent years Google’s MapReduce has emerged as a leading large-scale data processing architecture. Adopted by companies such as Amazon, Facebook, Google, IBM and Yahoo! in daily use, and more recently put in use by several universities, it allows parallel processing of huge volumes of data over cluster of machines. Hadoop is a free Java implementation of MapReduce. In Hadoop, files are split into blocks and replicated and spread over all servers in a network. Each job is also split into many small pieces called tasks. Several tasks are processed on a single server, and a job is not completed until all the assigned tasks are finished. A crucial factor that affects the completion time of a job is the particular assignment of tasks to servers. Given a placement of the input data over servers, one wishes to find the assignment that minimizes the completion time. In this paper, an idealized Hadoop model is proposed to investigate the Hadoop task assignment problem. It is shown that there is no feasible algorithm to find the optimal Hadoop task assignment unless P = N P. Assignments that are computed by the round robin algorithm inspired by the current Hadoop scheduler are shown to deviate from optimum by a multiplicative factor in the worst case. A flow-based algorithm is presented that computes assignments that are optimal to within an additive constant. Categories and Subject Descriptors D.3.2 [Programming Languages]: Language Classifications—concurrent, distributed, and parallel languages; F.1.2 [Computation by Abstract Devices]: Modes of Computation—parallelism and concurrency; F.1.3 [Computation ∗ Supported by the Kempner Fellowship from the Department of Computer Science at Yale University. † Supported by the National Science Foundation of China under Grant No. 60721002. This work was done when Yitong Yin was at Yale University. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SPAA’10, June 13–15, 2010, Thira, Santorini, Greece. Copyright 2010 ACM 978-1-4503-0079-7/10/06 ...$10.00. by Abstract Devices]: Complexity Measures and Classes—reducibility and completeness; F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems—sequencing and scheduling General Terms Algorithms, Performance, Theory Keywords task assignment, load balancing, NP-completeness, approximation algorithm, MapReduce, Hadoop 1. INTRODUCTION 1.1 Background The cloud computing paradigm has recently received significant attention in the media. The cloud is a metaphor for the Internet, which is an abstraction for the complex infrastructure it conceals. Cloud computing refers to both the applications delivered as services over the Internet and the hardware and software that provide such services. It envisions shifting data storage and computing power away from local servers, across the network cloud, and into large clusters of machines hosted by companies such as Amazon, Google, IBM, Microsoft, Yahoo! and so on. Google’s MapReduce [8, 9, 16] parallel computing architecture, for example, splits workload over large clusters of commodity PCs and enables automatic parallelization. By exploiting parallel processing, it provides a software platform that lets one easily write and run applications that process vast amounts of data. Apache Hadoop [4] is a free Java implementation of MapReduce in the open source software community. It is originally designed to efficiently process large volumes of data by parallel processing over commodity computers in local networks. In academia, researchers have adapted Hadoop to several different architectures. For example, Ranger et al. [18] evaluate MapReduce in multi-core and multi-processor systems, Kruijf et al. [7] implement MapReduce on the Cell B.E. processor architecture, and He et al. [14] propose a MapReduce framework on graphics processors. Many related applications using Hadoop have also been developed to solve various practical problems
1.2 The MapReduce Framework identical-machine model,there are some well-known approx- A Hadoop system runs on top of a distributed file sys- imation algorithms.For example,Graham [12]proposed a tem,called the Hadoop Distributed File System (HDFS) (2-1/n)-approximation algorithm in 1966,where n is the HDFS usually runs on networked commodity PCs,where total number of machines.Graham 13 proposed another data are replicated and locally stored on hard disks of each 4/3-approximation algorithm in 1969.However,under the machine.To store and process huge volume of data sets. unrelated-machine model,this problem is known to be APX- HDFS typically uses a block size of 64MB.Therefore.mov- hard,both in terms of its offline 17 and online 1,2 ap- ing computation close to the data is a design goal in the proximability. MapReduce framework. As some researchers [3,4 pointed out,the scheduling In the MapReduce framework,any application is specified mechanisms and polices that assign tasks to servers within by jobs.A MapReduce job splits the input data into inde- the MapReduce framework can have a profound effect on ef- pendent blocks,which are processed by the map tasks in ficiency.An early version of Hadoop uses a simple heuristic parallel.Each map task processes a single block'consisting algorithm that greedily exploits data locality.Zaharia,Kon- of some number of records.Each record in turn consists of winski and Joseph 19 proposed some heuristic refinements a key/value pair.A map task applies the user defined map based on experimental results. function to each input key/value pair and produces inter- 1.4 Our Contributions mediate key/value pairs.The framework then sorts the in- termediate data,and forwards them to the reduce tasks via We investigate task assignment in Hadoop.In Section 2, interconnected networks.After receiving all intermediate we propose an idealized Hadoop model to evaluate the cost key/value pairs with the same key,a reduce task executes of task assignments.Based on this model.we show in Sec- the user defined reduce function and produces the output tion 3 that there is no feasible algorithm to find the optimal data.Finally,these output data are written back to the assignment unless P =AP.In Section 4,we show that task HDFS. assignments computed by a simple greedy round-robin algo- In such a framework.there is a single server.called the rithm might deviate from the optimum by a multiplicative master,that keeps track of all jobs in the whole distributed factor.In Section 5,we present an algorithm that employs system.The master runs a special process,called the job- maximum flow and increasing threshold techniques to com- tracker,that is responsible for task assignment and schedul- pute task assignments that are optimal to within an additive ing for the whole system.For the rest of servers that are constant called the slaves,each of them runs a process called the task- tracker.The tasktracker schedules the several tasks assigned 2. PROBLEM FORMALIZATION to the single server in a way similar to a normal operating system. The map task assignment is a vital part that affects the Definition 1.A Map-Reduce schema (MR-schema)is a completion time of the whole job.First,each reduce task pair (T.S),where T is a set of tasks and S is a set of servers. cannot begin until it receives the required intermediate data Let m =T and n =S.A task assignment is a function from all finished map tasks.Second,the assignment de- A:T-S that assigns each task t to a server A(t).2 Let termines the location of intermediate data and the pattern A ={T-S}be the set of all possible task assignments. of the communication traffic.Therefore,some algorithms An MR-system is a triple (T,S,w),where (T,S)is an MR- should be in place to optimize the task assignment. schema and w:T×A→Q is a cost function. 1.3 Related Work Intuitively,w(t,A)is the time to perform task t on server A(t)in the context of the complete assignment A.The mo- Since Kuhn [15]proposed the first method for the classic tivation for this level of generality is that the time to execute assignment problem in 1955,variations of the assignment problem have been under extensive study in many areas [5] a task t in Hadoop depends not only on the task and the server speed,but also on possible network congestion,which In the classic assignment problem,there are identical num- in turn is influenced by the other tasks running on the clus- ber of jobs and persons.An assignment is a one-to-one map- ter. ping from tasks to persons.Each job introduces a cost when it is assigned to a person.Therefore,an optimal assignment Definition 2.The load of server s under assignment A minimizes the total cost over all persons. In the area of parallel and distributed computing,when is defined as L=:A()=.w(t,A).The marimum load jobs are processed in parallel over several machines,one under assignment A is defined as L4=max,L.The total is interested in minimizing the maximum processing time load under assignment A is defined as of any machines.This problem is sometimes called the minimum makespan scheduling problem.This problem An MR-system models a cloud computer where all servers in general is known to be AP-complete [11].Under the work in parallel.Tasks assigned to the same server are processed sequentially,whereas tasks assigned to different iStrictly speaking,a map task in Hadoop sometimes pro- servers run in parallel.Thus,the total completion time of cesses data that comes from two successive file blocks.This the cloud under task assignment A is given by the maximum occurs because file blocks do not respect logical record load LA boundaries,so the last logical record processed by a map task might lie partly in the current data block and partly 2In an MR-schema,it is common that IT>S.Therefore in the succeeding block,requiring the map task to access in this paper,unlike the classic assignment problem where an the succeeding block in order to fetch the tail end of its last assignment refers to a one-to-one mapping or a permutation logical record. 5,15,we instead use the notion of many-to-one mapping
1.2 The MapReduce Framework A Hadoop system runs on top of a distributed file system, called the Hadoop Distributed File System (HDFS). HDFS usually runs on networked commodity PCs, where data are replicated and locally stored on hard disks of each machine. To store and process huge volume of data sets, HDFS typically uses a block size of 64MB. Therefore, moving computation close to the data is a design goal in the MapReduce framework. In the MapReduce framework, any application is specified by jobs. A MapReduce job splits the input data into independent blocks, which are processed by the map tasks in parallel. Each map task processes a single block1 consisting of some number of records. Each record in turn consists of a key/value pair. A map task applies the user defined map function to each input key/value pair and produces intermediate key/value pairs. The framework then sorts the intermediate data, and forwards them to the reduce tasks via interconnected networks. After receiving all intermediate key/value pairs with the same key, a reduce task executes the user defined reduce function and produces the output data. Finally, these output data are written back to the HDFS. In such a framework, there is a single server, called the master, that keeps track of all jobs in the whole distributed system. The master runs a special process, called the jobtracker, that is responsible for task assignment and scheduling for the whole system. For the rest of servers that are called the slaves, each of them runs a process called the tasktracker. The tasktracker schedules the several tasks assigned to the single server in a way similar to a normal operating system. The map task assignment is a vital part that affects the completion time of the whole job. First, each reduce task cannot begin until it receives the required intermediate data from all finished map tasks. Second, the assignment determines the location of intermediate data and the pattern of the communication traffic. Therefore, some algorithms should be in place to optimize the task assignment. 1.3 Related Work Since Kuhn [15] proposed the first method for the classic assignment problem in 1955, variations of the assignment problem have been under extensive study in many areas [5]. In the classic assignment problem, there are identical number of jobs and persons. An assignment is a one-to-one mapping from tasks to persons. Each job introduces a cost when it is assigned to a person. Therefore, an optimal assignment minimizes the total cost over all persons. In the area of parallel and distributed computing, when jobs are processed in parallel over several machines, one is interested in minimizing the maximum processing time of any machines. This problem is sometimes called the minimum makespan scheduling problem. This problem in general is known to be N P-complete [11]. Under the 1Strictly speaking, a map task in Hadoop sometimes processes data that comes from two successive file blocks. This occurs because file blocks do not respect logical record boundaries, so the last logical record processed by a map task might lie partly in the current data block and partly in the succeeding block, requiring the map task to access the succeeding block in order to fetch the tail end of its last logical record. identical-machine model, there are some well-known approximation algorithms. For example, Graham [12] proposed a (2 − 1/n)-approximation algorithm in 1966, where n is the total number of machines. Graham [13] proposed another 4/3-approximation algorithm in 1969. However, under the unrelated-machine model, this problem is known to be APXhard, both in terms of its offline [17] and online [1, 2] approximability. As some researchers [3, 4] pointed out, the scheduling mechanisms and polices that assign tasks to servers within the MapReduce framework can have a profound effect on ef- ficiency. An early version of Hadoop uses a simple heuristic algorithm that greedily exploits data locality. Zaharia, Konwinski and Joseph [19] proposed some heuristic refinements based on experimental results. 1.4 Our Contributions We investigate task assignment in Hadoop. In Section 2, we propose an idealized Hadoop model to evaluate the cost of task assignments. Based on this model, we show in Section 3 that there is no feasible algorithm to find the optimal assignment unless P = N P. In Section 4, we show that task assignments computed by a simple greedy round-robin algorithm might deviate from the optimum by a multiplicative factor. In Section 5, we present an algorithm that employs maximum flow and increasing threshold techniques to compute task assignments that are optimal to within an additive constant. 2. PROBLEM FORMALIZATION Definition 1. A Map-Reduce schema (MR-schema) is a pair (T, S), where T is a set of tasks and S is a set of servers. Let m = |T| and n = |S|. A task assignment is a function A: T → S that assigns each task t to a server A(t).2 Let A = {T → S} be the set of all possible task assignments. An MR-system is a triple (T, S, w), where (T, S) is an MRschema and w : T × A → Q + is a cost function. Intuitively, w(t, A) is the time to perform task t on server A(t) in the context of the complete assignment A. The motivation for this level of generality is that the time to execute a task t in Hadoop depends not only on the task and the server speed, but also on possible network congestion, which in turn is influenced by the other tasks running on the cluster. Definition 2. The load of server s under assignment A is defined as L A s = P t:A(t)=s w(t, A). The maximum load under assignment A is defined as L A = maxs L A s . The total load under assignment A is defined as HA = P s L A s . An MR-system models a cloud computer where all servers work in parallel. Tasks assigned to the same server are processed sequentially, whereas tasks assigned to different servers run in parallel. Thus, the total completion time of the cloud under task assignment A is given by the maximum load L A. 2 In an MR-schema, it is common that |T| ≥ |S|. Therefore in this paper, unlike the classic assignment problem where an assignment refers to a one-to-one mapping or a permutation [5, 15], we instead use the notion of many-to-one mapping
Our notion of an MR-system is very general and admits The definition of remote cost under a partial assignment B arbitrary cost functions.To usefully model Hadoop as an is pessimistic.It assumes that tasks not assigned by B will MR-system,we need a realistic but simplified cost model. eventually become remote,and each remote task will even- In Hadoop,the cost of a map task depends frequently on tually have costrem(+).This definition agrees with the location of its data.If the data is on the server's local the definition of remote cost under a complete assignment A. disk,then the cost (execution time)is considerably lower because u4 =0 and thus wrem =Wrem(rA+u4)=Wrem(rA) than if the data is located remotely and must be fetched Since p is encoded by mn bits,wloe is encoded by one across the network before being processed. rational number,and wrem()is encoded by m+1 ratio- We make several simplifying assumptions.We assume nal numbers,the Hadoop cost function w(p,wloe,Wrem())is that all tasks and all servers are identical,so that for any encoded by mn bits plus m+2 rational numbers. particular assignment of tasks to servers,all tasks whose data is locally available take the same amount of time wloe, Definition 8.A Hadoop MR-system (HMR-system)is the and all tasks whose data is remote take the same amount MR-system (T,S,w),where w is the Hadoop cost function of time wrem.However,we do not assume that wrem is con- with parameters p,wloc,and wrem().A HMR-system is stant over all assignments.Rather,we let it grow with the defined by (T,S,p,whoe,Wrem()). total number of tasks whose data is remote.This reflects the increased data fetch time due to overall network con- gestion.Thus,wrem(r)is the cost of each remote task in Problem 1 Hadoop Task Assignment Problem (HTA) every assignment with exactly r remote tasks.We assume that wrem(r)>wioe for all r and that wrem(r)is (weakly) 1.Instance:An HMR-system (T,S,p,whoc,Wrem()). monotone increasing in r. We formalize these concepts below.In each of the follow- 2.Objective:Find an assignment A that minimizes L4. ing,(T,S)is an MR-schema. Definition 3.A data placement is a relation p C T x S Sometimes the cost of running a task on a server only such that for every task t E T,there exists at least one server depends on the placement relation and its data locality,but sES such that p(t,s)holds. not on the assignment of other tasks. The placement relation describes where the input data blocks are placed.If p(t,s)holds,then server s locally stores Definition 9.A Hadoop cost function w is called uniform if wrem(r)=c for some constant c and all r E N.A a replica of the data block that task t needs. uniform HMR-system (UHMR-system)is an HMR-system Definition 4.We represent the placement relation p by an (T,S,p,wloe,Wrem()),where w is uniform. unweighted bipartite graph,called the placement graph.In the placement graph Ge =((T,S),E),T consists of m task nodes and S consists of n server nodes.There is an edge Problem 2 Uniform Hadoop Task Assignment Problem (t,s)E iff p(t,s)holds. (UHTA) Definition 5.A partial assignment a is a partial function 1.Instance:A UHMR-system (T,S,p,wioe,Wrem()) from T to S.We regard a partial assignment as a set of 2.Objective:Find an assignment A that minimizes L4. ordered pairs with pairwise distinct first elements,so for partial assignments B and o,B a means B ertends a.If s ES,the restriction of a to s is the partial assignment als=an(T x {s}).Thus,als agrees with a for those tasks The number of replicas of each data block may be that a assigns to s,but all other tasks are unassigned in l. bounded,often by a small number such as 2 or 3. Definition 6.Let p be a data placement and 3 be a partial Definition 10.Call a placement graph G=((T,S),E)j- assignment.A task t E T is local in B if B(t)is defined and replica-bounded if the degree of t is at most j for all t E T. p(t,B(t)).A task tT is remote in a if B(t)is defined A j-replica-bounded-UHMR-system (j-UHMR-system)is a and p(t,B(t)).Otherwise t is unassigned in B.Let e UHMR-system(T,S,p,wloc,Wrem()),where Ge is j-replica- r3 and u3 be the number of local tasks,remote tasks,and bounded. unassigned tasks in B,respectively.For any s E S,let es be the number of local tasks assigned to s by B.Let k= maxs∈sg. Problem 3 j-Uniform Hadoop Task Assignment Problem (j-UHTA) Definition 7.Let p be a data placement,B be a partial assignment,hoe∈Q+,and Wrem:N一Q+such that wioe≤ 1.Instance:A j-UHMR-system (T,S,p,Wioe,Wrem()). wrem(O)≤rem(1)≤wrem(②).…Let wrem=urem(r3 2.Objective:Find an assignment A that minimizes L4. u).The Hadoop cost function with parameters p,wloe, and wrem()is the function w defined by w(t,B)= 了oe if t is local in B, 3.HARDNESS OF TASK ASSIGNMENT wrem otherwise. In this section,we analyze the hardness of the various We call p the placement of w,and wloe and wrem()the local HTA optimization problems by showing the corresponding and remote costs of w,respectively.Let K=k.woc. decision problems to be AP-complete
Our notion of an MR-system is very general and admits arbitrary cost functions. To usefully model Hadoop as an MR-system, we need a realistic but simplified cost model. In Hadoop, the cost of a map task depends frequently on the location of its data. If the data is on the server’s local disk, then the cost (execution time) is considerably lower than if the data is located remotely and must be fetched across the network before being processed. We make several simplifying assumptions. We assume that all tasks and all servers are identical, so that for any particular assignment of tasks to servers, all tasks whose data is locally available take the same amount of time wloc, and all tasks whose data is remote take the same amount of time wrem. However, we do not assume that wrem is constant over all assignments. Rather, we let it grow with the total number of tasks whose data is remote. This reflects the increased data fetch time due to overall network congestion. Thus, wrem(r) is the cost of each remote task in every assignment with exactly r remote tasks. We assume that wrem(r) ≥ wloc for all r and that wrem(r) is (weakly) monotone increasing in r. We formalize these concepts below. In each of the following, (T, S) is an MR-schema. Definition 3. A data placement is a relation ρ ⊆ T × S such that for every task t ∈ T, there exists at least one server s ∈ S such that ρ(t, s) holds. The placement relation describes where the input data blocks are placed. If ρ(t, s) holds, then server s locally stores a replica of the data block that task t needs. Definition 4. We represent the placement relation ρ by an unweighted bipartite graph, called the placement graph. In the placement graph Gρ = ((T, S), E), T consists of m task nodes and S consists of n server nodes. There is an edge (t, s) ∈ E iff ρ(t, s) holds. Definition 5. A partial assignment α is a partial function from T to S. We regard a partial assignment as a set of ordered pairs with pairwise distinct first elements, so for partial assignments β and α, β ⊇ α means β extends α. If s ∈ S, the restriction of α to s is the partial assignment α|s = α∩(T × {s}). Thus, α|s agrees with α for those tasks that α assigns to s, but all other tasks are unassigned in α|s. Definition 6. Let ρ be a data placement and β be a partial assignment. A task t ∈ T is local in β if β(t) is defined and ρ(t, β(t)). A task t ∈ T is remote in α if β(t) is defined and ¬ρ(t, β(t)). Otherwise t is unassigned in β. Let ` β , r β and u β be the number of local tasks, remote tasks, and unassigned tasks in β, respectively. For any s ∈ S, let ` β s be the number of local tasks assigned to s by β. Let k β = maxs∈S ` β s . Definition 7. Let ρ be a data placement, β be a partial assignment, wloc ∈ Q +, and wrem : N → Q + such that wloc ≤ wrem(0) ≤ wrem(1) ≤ wrem(2). . .. Let w β rem = wrem(r β + u β ). The Hadoop cost function with parameters ρ, wloc, and wrem(·) is the function w defined by w(t, β) = wloc if t is local in β, w β rem otherwise. We call ρ the placement of w, and wloc and wrem(·) the local and remote costs of w, respectively. Let Kβ = k β · wloc. The definition of remote cost under a partial assignment β is pessimistic. It assumes that tasks not assigned by β will eventually become remote, and each remote task will eventually have cost wrem(r β + u β ). This definition agrees with the definition of remote cost under a complete assignment A, because u A = 0 and thus w A rem = wrem(r A+u A) = wrem(r A). Since ρ is encoded by mn bits, wloc is encoded by one rational number, and wrem(·) is encoded by m + 1 rational numbers, the Hadoop cost function w(ρ, wloc, wrem(·)) is encoded by mn bits plus m + 2 rational numbers. Definition 8. A Hadoop MR-system (HMR-system) is the MR-system (T, S, w), where w is the Hadoop cost function with parameters ρ, wloc, and wrem(·). A HMR-system is defined by (T, S, ρ, wloc, wrem(·)). Problem 1 Hadoop Task Assignment Problem (HTA) 1. Instance: An HMR-system (T, S, ρ, wloc, wrem(·)). 2. Objective: Find an assignment A that minimizes L A. Sometimes the cost of running a task on a server only depends on the placement relation and its data locality, but not on the assignment of other tasks. Definition 9. A Hadoop cost function w is called uniform if wrem(r) = c for some constant c and all r ∈ N. A uniform HMR-system (UHMR-system) is an HMR-system (T, S, ρ, wloc, wrem(·)), where w is uniform. Problem 2 Uniform Hadoop Task Assignment Problem (UHTA) 1. Instance: A UHMR-system (T, S, ρ, wloc, wrem(·)). 2. Objective: Find an assignment A that minimizes L A. The number of replicas of each data block may be bounded, often by a small number such as 2 or 3. Definition 10. Call a placement graph G = ((T, S), E) jreplica-bounded if the degree of t is at most j for all t ∈ T. A j-replica-bounded-UHMR-system (j-UHMR-system) is a UHMR-system (T, S, ρ, wloc, wrem(·)), where Gρ is j-replicabounded. Problem 3 j-Uniform Hadoop Task Assignment Problem (j-UHTA) 1. Instance: A j-UHMR-system (T, S, ρ, wloc, wrem(·)). 2. Objective: Find an assignment A that minimizes L A. 3. HARDNESS OF TASK ASSIGNMENT In this section, we analyze the hardness of the various HTA optimization problems by showing the corresponding decision problems to be N P-complete
3.1 Task Assignment Decision Problems tasks lu1,lu2,lu3 and an auriliary task au.There is an edge between each of these tasks and the clause server.Since o Definition 11.Given a server capacity k,a task assign- contains a clauses,G contains a clause gadgets.Thus,G ment A is k-feasible if L4 k.An HMR-system is k- contains a clause servers,3o literal tasks and a auxiliary admissible if there exists a k-feasible task assignment tasks.Figure 1 describes the structure of the u-th clause gadget.We use circles and boxes to represent tasks and The decision problem corresponding to a class of HMR- servers,respectively systems and capacity k asks whether a given HMR-system in the class is k-admissible.Thus,the k-HTA problem asks about arbitrary HMR-systems,the k-UHTA problem asks about arbitrary UHMR-systems,and the k-j-UHTA prob- lem (which we write (j,k)-UHTA)asks about arbitrary j- UHMR-systems. 3.2 NP-completeness of(2,3)-UHTA The (2,3)-UHTA problem is a very restricted subclass of the general k-admissibility problem for HMR-systems.In this section,we restrict even further by taking wnoe =1 and wrem =3.This problem represents a simple scenario where the cost function assumes only the two possible values 1 and 3.each data block has at most 2 replicas,and each Figure 1:The structure of the u-th clause gadget. server has capacity 3.Despite its obvious simplicity,we show that (2,3)-UHTA is AP-complete.It follows that all of the The second type of gadget is called a variable gadget.Each less restritive decision problems are also NP-complete,and variable gadget contains 2w ring servers placed around a the correponding optimization problems do not have feasible circle.Let R denote the server at positionj1,w in solutions unless p=Ap ring i.Define the set Ti to be the servers in odd-numbered positions.Similarly,define the set Fi to be the servers in THEOREM 3.1.(2,3)-UHTA with costs wloe 1 and even-numbered positions.Between each pair of ring servers Wrem =3 is NP-complete. Rand we place a ring taskconected to its two The proof method is to construct a polynomial-time re- neighboring servers.To complete the circle,is connected duction from 3SAT to (2,3)-UHTA.Let g be the set of to R and R).There are also w variable tasks v:jE all 2-replica-bounded placement graphs.Given GeE g, [1,w]in ring i,but they do not connect to any ring server. we define the HMR-system MG =(T,S,p,wloe,wrem()), Since contains B variables,G contains B variable gadgets. where wioe =1 and wrem(r)=3 for all r.We say that Thus,G contains 2Bw ring servers,23w ring tasks and Bw G is 3-admissible if MG is 3-admissible.We construct a variable tasks.Figure 2 describes the structure of the i-th polynomial-time computable mapping f:3CNF-G,and variable gadget. show that a 3CNF formula o is satisfiable iff f(o)is 3- admissible.Ve shorten“3-admissible”to“admissible”in the following discussion. We first describe the construction of f.Let =CIA C2...AC be a 3CNF formula,where each Cu=(Vlu2V 2) lus)is a clause and each luv is a literal.Let 1,..,ra be the variables that appear in Therefore,contains exactly 3a instances of literals,each of which is either ri or i,where i [1,B].3 Let w be the maximum number of occurrences of any literal in o.Table 1 summarizes the parameters of o. Table 1:Parameters of the 3CNF clauses (Cu)a variables (vi)B literals (lu)3a max-occur of any literalw For example,in (1 V x2 V E3)A (1 V-4 V Is)A (T1 Vx4 V-Z6),we have a=3,B=6,and w=2 since Ti occurs twice. Given o,we construct the corresponding placement graph Figure 2:The structure of the i-th variable gadget. G which comprises several disjoint copies of the three types of gadget described below,connected together with addi- The third type of gadget is called a sink gadget.The tional edges. sink gadget contains a sink server P and three sink tasks The first type of gadget is called a clause gadget.Each pi,p2,P3.Each sink task is connected to the sink server. clause gadget u contains a clause server Cu,three literal G only contains one sink gadget.Figure 3 describes the 3The notation [a,b]in our discussion represents the set of structure of the sink gadget. integers fa,a+1,...,6-1,b. There are also some inter-gadget edges in G.We connect
3.1 Task Assignment Decision Problems Definition 11. Given a server capacity k, a task assignment A is k-feasible if L A ≤ k. An HMR-system is kadmissible if there exists a k-feasible task assignment. The decision problem corresponding to a class of HMRsystems and capacity k asks whether a given HMR-system in the class is k-admissible. Thus, the k-HTA problem asks about arbitrary HMR-systems, the k-UHTA problem asks about arbitrary UHMR-systems, and the k-j-UHTA problem (which we write (j, k)-UHTA) asks about arbitrary jUHMR-systems. 3.2 N P-completeness of (2,3)-UHTA The (2,3)-UHTA problem is a very restricted subclass of the general k-admissibility problem for HMR-systems. In this section, we restrict even further by taking wloc = 1 and wrem = 3. This problem represents a simple scenario where the cost function assumes only the two possible values 1 and 3, each data block has at most 2 replicas, and each server has capacity 3. Despite its obvious simplicity, we show that (2,3)-UHTA is N P-complete. It follows that all of the less restritive decision problems are also N P-complete, and the correponding optimization problems do not have feasible solutions unless P = N P. Theorem 3.1. (2, 3)-UHTA with costs wloc = 1 and wrem = 3 is N P-complete. The proof method is to construct a polynomial-time reduction from 3SAT to (2,3)-UHTA. Let G be the set of all 2-replica-bounded placement graphs. Given Gρ ∈ G, we define the HMR-system MG = (T, S, ρ, wloc, wrem(·)), where wloc = 1 and wrem(r) = 3 for all r. We say that G is 3-admissible if MG is 3-admissible. We construct a polynomial-time computable mapping f : 3CNF → G, and show that a 3CNF formula φ is satisfiable iff f(φ) is 3- admissible. We shorten “3-admissible” to “admissible” in the following discussion. We first describe the construction of f. Let φ = C1 ∧ C2 · · ·∧Cα be a 3CNF formula, where each Cu = (lu1∨lu2∨ lu3) is a clause and each luv is a literal. Let x1, · · · , xβ be the variables that appear in φ. Therefore, φ contains exactly 3α instances of literals, each of which is either xi or ¬xi, where i ∈ [1, β].3 Let ω be the maximum number of occurrences of any literal in φ. Table 1 summarizes the parameters of φ. Table 1: Parameters of the 3CNF φ clauses (Cu) α variables (vi) β literals (luv) 3α max-occur of any literal ω For example, in φ = (x1 ∨ x2 ∨ x3) ∧ (x1 ∨ ¬x4 ∨ x5) ∧ (¬x1 ∨ x4 ∨ ¬x6), we have α = 3, β = 6, and ω = 2 since x1 occurs twice. Given φ, we construct the corresponding placement graph G which comprises several disjoint copies of the three types of gadget described below, connected together with additional edges. The first type of gadget is called a clause gadget. Each clause gadget u contains a clause server Cu, three literal 3The notation [a,b] in our discussion represents the set of integers {a, a + 1, · · · , b − 1, b}. tasks lu1, lu2, lu3 and an auxiliary task au. There is an edge between each of these tasks and the clause server. Since φ contains α clauses, G contains α clause gadgets. Thus, G contains α clause servers, 3α literal tasks and α auxiliary tasks. Figure 1 describes the structure of the u-th clause gadget. We use circles and boxes to represent tasks and servers, respectively. Cu u1 lu2 lu3 l u a Figure 1: The structure of the u-th clause gadget. The second type of gadget is called a variable gadget. Each variable gadget contains 2ω ring servers placed around a circle. Let R (i) j denote the server at position j ∈ [1, 2ω] in ring i. Define the set Ti to be the servers in odd-numbered positions. Similarly, define the set Fi to be the servers in even-numbered positions. Between each pair of ring servers R (i) j and R (i) j+1, we place a ring task r (i) j connected to its two neighboring servers. To complete the circle, r (i) 2ω is connected to R (i) 2ω and R (i) 1 . There are also ω variable tasks v (i) j : j ∈ [1, ω] in ring i, but they do not connect to any ring server. Since φ contains β variables, G contains β variable gadgets. Thus, G contains 2βω ring servers, 2βω ring tasks and βω variable tasks. Figure 2 describes the structure of the i-th variable gadget. ( ) 1 i R ( ) 2 i r ( ) 2 i R ( ) 3 i R ( ) 2 i R ( ) 1 i r ( ) 2 i r ( ) 1 i v ( ) 2 i v (i) v Ti Fi Ti Ti Ti Fi Fi Fi Figure 2: The structure of the i-th variable gadget. The third type of gadget is called a sink gadget. The sink gadget contains a sink server P and three sink tasks p1, p2, p3. Each sink task is connected to the sink server. G only contains one sink gadget. Figure 3 describes the structure of the sink gadget. There are also some inter-gadget edges in G. We connect
auxiliary task),and each task is local to C.Thus,the load is at most 3.The sink server is assigned three local sink tasks and the load is exactly 3.Therefore.all constraints are satisfied and A is feasible.This completes the proof of Lemma3.4.▣ The proof of the converse of Lemma 3.4 is more involved. The method is given a feasible assignment A in G=f(o),we Figure 3:The structure of the sink gadget. first construct a feasible assignment B in G such that B(t) P for all t ET-p1,p2,ps}.Then we remove the sink tasks and the sink server from further consideration and consider each variable task to the sink server P.We also connect the resulting graph G.After that,we partition G into two each literal task to a unique ring server R.To be more subgraphs,and construct a feasible assignment B'such that no tasks from one partition are remotely assigned to servers precise,if literal luv is the j-th occurrence of xi in o,connect the literal task to ring serverTi;if literal is in the other partition.This step involves a case analysis. Finally,a natural way of constructing the satisfying truth the j-th occurrence of r;in o,connect the literal task lu assignment for o follows to ring serverRFThese inter-gadget edges complete the graph G.Table 2 summarizes the parameters of G. LEMMA 3.5.Let A be a feasible task assignment.Then there erists a feasible task assignment B such that B(t)P Table 2:Parameters of the HMR-graph G for all t ET-{p1,p2,P3). clause server C literal task lu 3a PROOF.When A satisfies that A(t)≠P for all t∈T auxiliary task au ring server R 23w {p1,p2,p3},let B=A.Otherwise,assume there exists a ring taskr 26w variable task v Bw task t'such that A(t')=P and t'ET-{p1,p2,p3}.Since sink server P 1 sink task pj 3 the capacity of P is 3,there is at least one sink task,say pi, is not assigned to P.Let A(p1)=Q.Since p(p1,Q)does not hold,Q has only been assigned pi and L=3.Let LEMMA 3.2.For any E 3CNF,the graph f()is 2- B(p1)=P and B(t')=Q.Repeat the same process for all replica-bounded. tasks other than pi,p2,ps that are assigned to P in A.Then let B(t)=A(t)for the remaining tasks t E T.To see B is PRooF.We count the number of edges from each task feasible,,note that L≤La≤3 for all servers s∈S.口 node in f().Each clause task has 2 edges,each auxiliary task has 1 edge,each ring task has 2 edges,each variable Let G'be the subgraph induced by (T-{pi,p2,ps},S- task has 1 edge,and each sink task has 1 edge.Therefore. (P))=(T,S').We have the following lemma. f()is 2-replica-bounded. The following lemma is immediate. LEMMA 3.6.Let A be a feasible task assignment in G. Then there erists a feasible task assignment A'in G. LEMMA3.3.The mapping∫:3CNF一G is polynomial-- PROOF.Given A,Lemma 3.5 tells us that there exists time computable. another feasible assignment B in G such that B(t)P for LEMMA 3.4.If o is satisfiable,then G=f(o)is admis- all t E T.Let A'(t)=B(t)for all t E T'.Then A'is an sible. assignment in G since A'(t)ES-PI for all t E T.To see A'is feasible,note that L'L 3 for all servers PROOF.Let o be a satisfying truth assignment for o,and s∈S.☐ we construct a feasible assignment A in G=f().First of all,assign each sink task to the sink server,i.e.,let A(pi)= P for all ie[1,3].Then assign each auxiliary task au to the We further partition G'into two subgraphs Gc and GR. clause server Cu;i.e.,let A(au)=Cu for all u [1,a].If Gc is induced by nodes {Cu:uE [1,a]}ufau uE [1,a]}U a()=true,then assign ring tasks je[1,w to ring {luv u e [1,a],v E [1,3]}and GR is induced by nodes servers in Ti,variable tasksj[]to ring servers in {o:ieL,3,j∈[1,2}U{r:i∈[1,g,je1,2}U F.If a()=false,then assign ring tasksj1w u):i[1,B],j [1,w]).In other words,Gc consists of all clause gadgets while GR consists of all variable gadgets. to ring servers in Fi,variable tasks v:j[1,w]to ring If a task in one partition is remotely assigned to a server in servers in Ti.If literal luv =i and o(ti)=true,then assign the other partition,we call this task a cross-boundary task. task luv to its local ring server in Ti.If literal luv =i and Let ne be the number of cross-boundary tasks that are in o(ri)=false,then assign task l to its local ring server in Gc and assigned to servers in GR by A,n be the number of Fi.Otherwise,assign task lue to its local clause server Cu. cross-boundary tasks that are in GR and assigned to servers We then check this task assignment is feasible.Each ring in Gc by A.We have the following lemmas server is assigned either at most three local tasks (two ring tasks and one literal task),or one remote variable task.In LEMMA 3.7.Let A be a feasible assignment in G'such either case,the load does not exceed the capacity 3.The that n0 and n0.Then there erist a feasible assign- number of tasks assigned to each clause server Cu is exactly ment B in G'such that one of ne and nf equals nenA the number of false literals in Cu under o plus one (the and the other one equals 0
P 1 p 2 p 3 p Figure 3: The structure of the sink gadget. each variable task v (i) j to the sink server P. We also connect each literal task luv to a unique ring server R (i) j . To be more precise, if literal luv is the j-th occurrence of xi in φ, connect the literal task luv to ring server R (i) 2j−1 ∈ Ti; if literal luv is the j-th occurrence of ¬xi in φ, connect the literal task luv to ring server R (i) 2j ∈ Fi. These inter-gadget edges complete the graph G. Table 2 summarizes the parameters of G. Table 2: Parameters of the HMR-graph G clause server Cu α literal task luv 3α auxiliary task au α ring server R (i) j 2βω ring task r (i) j 2βω variable task v (i) j βω sink server P 1 sink task pj 3 Lemma 3.2. For any φ ∈ 3CNF, the graph f(φ) is 2- replica-bounded. Proof. We count the number of edges from each task node in f(φ). Each clause task has 2 edges, each auxiliary task has 1 edge, each ring task has 2 edges, each variable task has 1 edge, and each sink task has 1 edge. Therefore, f(φ) is 2-replica-bounded. The following lemma is immediate. Lemma 3.3. The mapping f : 3CNF → G is polynomialtime computable. Lemma 3.4. If φ is satisfiable, then G = f(φ) is admissible. Proof. Let σ be a satisfying truth assignment for φ, and we construct a feasible assignment A in G = f(φ). First of all, assign each sink task to the sink server, i.e., let A(pi) = P for all i ∈ [1, 3]. Then assign each auxiliary task au to the clause server Cu, i.e., let A(au) = Cu for all u ∈ [1, α]. If σ(xi) = true, then assign ring tasks r (i) j : j ∈ [1, 2ω] to ring servers in Ti, variable tasks v (i) j : j ∈ [1, ω] to ring servers in Fi. If σ(xi) = false, then assign ring tasks r (i) j : j ∈ [1, 2ω] to ring servers in Fi, variable tasks v (i) j : j ∈ [1, ω] to ring servers in Ti. If literal luv = xi and σ(xi) = true, then assign task luv to its local ring server in Ti. If literal luv = ¬xi and σ(xi) = false, then assign task luv to its local ring server in Fi. Otherwise, assign task luv to its local clause server Cu. We then check this task assignment is feasible. Each ring server is assigned either at most three local tasks (two ring tasks and one literal task), or one remote variable task. In either case, the load does not exceed the capacity 3. The number of tasks assigned to each clause server Cu is exactly the number of false literals in Cu under σ plus one (the auxiliary task), and each task is local to Cu. Thus, the load is at most 3. The sink server is assigned three local sink tasks and the load is exactly 3. Therefore, all constraints are satisfied and A is feasible. This completes the proof of Lemma 3.4. The proof of the converse of Lemma 3.4 is more involved. The method is given a feasible assignment A in G = f(φ), we first construct a feasible assignment B in G such that B(t) 6= P for all t ∈ T − {p1, p2, p3}. Then we remove the sink tasks and the sink server from further consideration and consider the resulting graph G 0 . After that, we partition G 0 into two subgraphs, and construct a feasible assignment B 0 such that no tasks from one partition are remotely assigned to servers in the other partition. This step involves a case analysis. Finally, a natural way of constructing the satisfying truth assignment for φ follows. Lemma 3.5. Let A be a feasible task assignment. Then there exists a feasible task assignment B such that B(t) 6= P for all t ∈ T − {p1, p2, p3}. Proof. When A satisfies that A(t) 6= P for all t ∈ T − {p1, p2, p3}, let B = A. Otherwise, assume there exists a task t 0 such that A(t 0 ) = P and t 0 ∈ T − {p1, p2, p3}. Since the capacity of P is 3, there is at least one sink task, say p1, is not assigned to P. Let A(p1) = Q. Since ρ(p1, Q) does not hold, Q has only been assigned p1 and L A Q = 3. Let B(p1) = P and B(t 0 ) = Q. Repeat the same process for all tasks other than p1, p2, p3 that are assigned to P in A. Then let B(t) = A(t) for the remaining tasks t ∈ T. To see B is feasible, note that L B s ≤ L A s ≤ 3 for all servers s ∈ S. Let G 0 be the subgraph induced by (T − {p1, p2, p3}, S − {P}) = (T 0 , S0 ). We have the following lemma. Lemma 3.6. Let A be a feasible task assignment in G. Then there exists a feasible task assignment A 0 in G 0 . Proof. Given A, Lemma 3.5 tells us that there exists another feasible assignment B in G such that B(t) 6= P for all t ∈ T 0 . Let A 0 (t) = B(t) for all t ∈ T 0 . Then A 0 is an assignment in G 0 since A 0 (t) ∈ S − {P} for all t ∈ T 0 . To see A 0 is feasible, note that L A0 s ≤ L B s ≤ 3 for all servers s ∈ S 0 . We further partition G 0 into two subgraphs GC and GR. GC is induced by nodes {Cu : u ∈ [1, α]}∪ {au : u ∈ [1, α]}∪ {luv : u ∈ [1, α], v ∈ [1, 3]} and GR is induced by nodes {R (i) j : i ∈ [1, β], j ∈ [1, 2ω]} ∪ {r (i) j : i ∈ [1, β], j ∈ [1, 2ω]} ∪ {v (i) j : i ∈ [1, β], j ∈ [1, ω]}. In other words, GC consists of all clause gadgets while GR consists of all variable gadgets. If a task in one partition is remotely assigned to a server in the other partition, we call this task a cross-boundary task. Let n A c be the number of cross-boundary tasks that are in GC and assigned to servers in GR by A, n A r be the number of cross-boundary tasks that are in GR and assigned to servers in GC by A. We have the following lemmas. Lemma 3.7. Let A be a feasible assignment in G 0 such that n A c > 0 and n A r > 0. Then there exist a feasible assignment B in G 0 such that one of n B c and n B r equals |n A c − n A r | and the other one equals 0
PROOF.Assume ti∈Gc,s:∈GR and A(t:)=s To check that this truth assignment is a satisfying as- t∈Gr,sf∈Gc and A()=s.Then each of si and s signment,note that for the three literal tasks lu,lu2,lu3, is assigned one remote task.Let B(ti)=s and B()=si at most two of them are assigned to the clause server Cu and then L≤LA=3andL号≤Ld=3.This process de- There must be one literal task,say l,that is locally as- creases ne and nr each by one,and the resulting assignment signed to a ring server.In this case,o()=true and thus is also feasible.Repeat the same process until the smaller the clause o(Cu)=true.This fact holds for all clauses and one of ne and nr becomes 0.Then let B(t)=A(t)for all thus indicates that ()=(ACu)=true.This completes the remaining tasks t E T.It is obvious that B is feasible. the proof of Lemma 3.10. and one of ne and nf equals Ine-ndl and the other one equals0.☐ Finally we prove the main theorem PROOF OF THEOREM 3.1.Lemmas 3.3,3.4 and 3.10 es- LEMMA 3.8.Let A be a feasible assignment in G'such tablish that 3SAT 3,contradicting the In this section.we analyze a simple round robin algo- fact that A is feasible.Therefore,there is no tiE GR and rithm for the UHTA problem.Algorithm 1 is inspired by s∈Gc such that A(t)=st.Thus,n,4=0.☐ the Hadoop scheduler algorithm.It scans over each server in a round robin fashion.When assigning a new task to a LEMMA 3.9.Let A be a feasible assignment in G'such server,Algorithm 1 tries heuristically to exploit data local- that na =0.Then ne =0. ity.Since we have not specified the order of assigned tasks, PROOF.For the sake of contradiction,assume tiE Gc, Algorithm 1 may produce many possible outputs (assign- si E GR and A(ti)=si.Let ko,k1,k2,ks denote the number ments). of ring servers filled to load 0,1,2,3,respectively.From the total number of servers in GR,we have Algorithm 1 The round robin algorithm exploring locality. ko+k1+k2+k3=23w (1) 1:input:a set of unassigned tasks T,a list of servers {s1,s2,..,sn,a placement relation p Similarly.from the total number of tasks in GR.we have 2:define i1 as an index variable 3:define A as an assignment 0·k0+1·k1+2·k2+1·k3=33w (2) 4:A(t)=L(task t is unassigned)for all t 5:while exists unassigned task do Subtracting (1)from (2)gives k2 Bw+ko.Assigning 6: if exists unassigned task t such that p(t,si)holds then both neighboring ring tasks to the same ring server fills it update A by assigning A(t)=si to load 2.Since there are only 23w ring servers,we have else k23.If it is assigned one variable task,its load is though.First,the Hadoop algorithm assumes three kinds of L=3+3>3.A is not feasible in either case.Therefore, placement:data-local,rack-local and rack-remote,whereas there is no ti E Gc and si E GR such that A(ti)=si.Thus, Algorithm 1 assumes only two:local and remote.Second ne=0.☐ the Hadoop scheduler works incrementally rather than as- signing all tasks initially.Last,the Hadoop algorithm is Now we prove the following Lemma. deterministic,whereas Algorithm 1 is nondeterministic. LEMMA 3.10.If G=f(o)is admissible,then o is satis- THEOREM 4.1.If wrem woc,increasing the number of fiable. data block replicas may increase the marimum load of the PROOF.Given feasible task assignment A in G=f(), assignment computed by Algorithm 1. we construct the satisfying truth assignment o for o.From PROOF.The number of edges in the placement graph is Lemmas 3.6,3.7,3.8 and 3.9,we construct a feasible assign- equal to the number of data block replicas,and thus adding ment B in G,such that ne=n=0,and in each variable a new edge in the placement graph is equivalent to adding gadget i,either servers in Ti or servers in Fi are saturated by a new replica in the system.Consider the simple placement variable tasks.If ring servers in Fi are saturated by variable graph G where m =n,and there is an edge between task tasks,let o(ri)=true.If ring servers in Ti are saturated by ti and si for all 1 i<n.Running Algorithm 1 gives variable tasks,let o(xi)=false. an assignment A in which task ti is assigned to si for all
Proof. Assume ti ∈ GC , si ∈ GR and A(ti) = si; t 0 i ∈ GR, s 0 i ∈ GC and A(t 0 i) = s 0 i . Then each of si and s 0 i is assigned one remote task. Let B(ti) = s 0 i and B(t 0 i) = si, and then L B si ≤ L A si = 3 and L B s 0 i ≤ L A s 0 i = 3. This process decreases nc and nr each by one, and the resulting assignment is also feasible. Repeat the same process until the smaller one of nc and nr becomes 0. Then let B(t) = A(t) for all the remaining tasks t ∈ T 0 . It is obvious that B is feasible, and one of n B c and n B r equals |n A c − n A r | and the other one equals 0. Lemma 3.8. Let A be a feasible assignment in G 0 such that n A c = 0. Then n A r = 0. Proof. For the sake of contradiction, assume ti ∈ GR, si ∈ GC and A(ti) = si. For each server sj ∈ GC , there is one auxiliary task au : u ∈ [1, α] such that ρ(au, sj ) holds. Since wloc = 1 and wrem = 3, if A is feasible then A(au) 6= A(av) for u 6= v. Since there are α auxiliary tasks and α servers in GC , one server is assigned exactly one auxiliary task. Since A(ti) = si, L A si ≥ 1 + 3 > 3, contradicting the fact that A is feasible. Therefore, there is no ti ∈ GR and si ∈ GC such that A(ti) = si. Thus, n A r = 0. Lemma 3.9. Let A be a feasible assignment in G 0 such that n A r = 0. Then n A c = 0. Proof. For the sake of contradiction, assume ti ∈ GC , si ∈ GR and A(ti) = si. Let k0, k1, k2, k3 denote the number of ring servers filled to load 0, 1, 2, 3, respectively. From the total number of servers in GR, we have k0 + k1 + k2 + k3 = 2βω (1) Similarly, from the total number of tasks in GR, we have 0 · k0 + 1 · k1 + 2 · k2 + 1 · k3 = 3βω (2) Subtracting (1) from (2) gives k2 = βω + k0. Assigning both neighboring ring tasks to the same ring server fills it to load 2. Since there are only 2βω ring servers, we have k2 ≤ βω. Hence, k0 = 0 and k2 = βω. This implies that all ring tasks are assigned to ring servers in alternating positions in each ring. There are βω remaining ring servers and βω variable tasks. Therefore, a variable task is remotely assigned to one of the remaining ring servers by A. Now consider the server si that has been remotely assigned ti ∈ GC . If it is assigned two ring tasks, its load is L A si = 2+ 3 > 3. If it is assigned one variable task, its load is L A si = 3 + 3 > 3. A is not feasible in either case. Therefore, there is no ti ∈ GC and si ∈ GR such that A(ti) = si. Thus, n A c = 0. Now we prove the following Lemma. Lemma 3.10. If G = f(φ) is admissible, then φ is satis- fiable. Proof. Given feasible task assignment A in G = f(φ), we construct the satisfying truth assignment σ for φ. From Lemmas 3.6, 3.7, 3.8 and 3.9, we construct a feasible assignment B in G 0 , such that n B c = n B r = 0, and in each variable gadget i, either servers in Ti or servers in Fi are saturated by variable tasks. If ring servers in Fi are saturated by variable tasks, let σ(xi) = true. If ring servers in Ti are saturated by variable tasks, let σ(xi) = false. To check that this truth assignment is a satisfying assignment, note that for the three literal tasks lu1, lu2, lu3, at most two of them are assigned to the clause server Cu. There must be one literal task, say luv, that is locally assigned to a ring server. In this case, σ(luv) = true and thus the clause σ(Cu) = true. This fact holds for all clauses and thus indicates that σ(φ) = σ( V Cu) = true. This completes the proof of Lemma 3.10. Finally we prove the main theorem. Proof of Theorem 3.1. Lemmas 3.3, 3.4 and 3.10 establish that 3SAT ≤p (2,3)-UHTA via f. Therefore, (2,3)- UHTA is N P-hard. It is easy to see that (2,3)-UHTA ∈ N P because in time O(mn) a nondeterministic Turing machine could guess the assignment and accept iff the maximum load under the assignment does not exceed 3. Therefore, (2, 3)- UHTA is N P-complete. 4. A ROUND ROBIN ALGORITHM In this section, we analyze a simple round robin algorithm for the UHTA problem. Algorithm 1 is inspired by the Hadoop scheduler algorithm. It scans over each server in a round robin fashion. When assigning a new task to a server, Algorithm 1 tries heuristically to exploit data locality. Since we have not specified the order of assigned tasks, Algorithm 1 may produce many possible outputs (assignments). Algorithm 1 The round robin algorithm exploring locality. 1: input: a set of unassigned tasks T, a list of servers {s1, s2, · · · , sn}, a placement relation ρ 2: define i ← 1 as an index variable 3: define A as an assignment 4: A(t) = ⊥ (task t is unassigned) for all t 5: while exists unassigned task do 6: if exists unassigned task t such that ρ(t, si) holds then 7: update A by assigning A(t) = si 8: else 9: pick any unassigned task t 0 , update A by assigning A(t 0 ) = si 10: end if 11: i ← (i mod n) + 1 12: end while 13: output: assignment A Algorithm 1 is analogous to the Hadoop scheduler algorithm up to core version 0.19. There are three differences, though. First, the Hadoop algorithm assumes three kinds of placement: data-local, rack-local and rack-remote, whereas Algorithm 1 assumes only two: local and remote. Second, the Hadoop scheduler works incrementally rather than assigning all tasks initially. Last, the Hadoop algorithm is deterministic, whereas Algorithm 1 is nondeterministic. Theorem 4.1. If wrem > wloc, increasing the number of data block replicas may increase the maximum load of the assignment computed by Algorithm 1. Proof. The number of edges in the placement graph is equal to the number of data block replicas, and thus adding a new edge in the placement graph is equivalent to adding a new replica in the system. Consider the simple placement graph G where m = n, and there is an edge between task ti and si for all 1 ≤ i ≤ n. Running Algorithm 1 gives an assignment A in which task ti is assigned to si for all
l≤i≤n,and thus L4=woc.Now we add one edge The virtual load of server s under B from a is VB.o= between task tn and server s1.We run Algorithm 1 on this ∑teae=sv(化,3).The marimum virtual load under B from new placement graph G'to get assignment A'.It might a is vB.a maxses V.a. assign task tn to server s1 in the first step.Following that it assigns ti to si for 2 LA.▣ eventully have cost wfem.When a is clear from context,we omit a and write v(t,B),Vs and Va,respectively.Note Theorem 4.1 indicates that increasing the number of data that v(t,a)=w(t,a)as in Definition 7. block replicas is not always beneficial for Algorithm 1.In the Algorithm 2 works iteratively to produce a sequence of remaining part of this section,we show that the assignments assignments and then outputs the best one,i.e.,the one of computed by Algorithm 1 might deviate from the optimum least maximum server load.The iteration is controlled by an by a multiplicative factor.In the following,let O be an integer variable r which is initialized to 1 and incremented assignment that minimizes Lo on each iteration.Each iteration consists of two phases, THEOREM 4.2.Let A be an assignment computed by Al- mar-cover and bal-assign: gorithm1.Then LA≤(urem/moe)·L9 Mar-cover:Given as input a placement graph Go,an PROOF.On the one hand,pigeonhole principle says there integer value T,and a partial assignment a,max-cover is a server assigned at least m/n tasks.Since the cost of returns a partial assignment a'of a subset T of tasks. each task is at least wloc,the load of this server is at least such that a'assigns no server more than r tasks,every [m/n].woc.Thus,LO≥「m/ml·oec.On the other hand, task in T is local in a',and T is maximized over Algorithm 1 runs in a round robin fashion where one task is all such assignments.Thus,a'makes as many tasks assigned at a time.Therefore,the number of tasks assigned local as is possible without assigning more than r tasks to each server is at most m/n.Since the cost of each task to any one server.The name "max-cover"follows the is at most wrem,the load of a server is at most m/n Wrem. intuition that we are actually trying to "cover"as many Thus,LA≤[m/nl,wre, Combining the two,we have tasks as possible by their local servers,subject to the LA≤(wrem/hoc)·LO.▣ constraint that no server is assigned more than r tasks THEOREM 4.3.Let T and S be such that m<n(n-2) Bal-assign:Given as input a set of tasks T,a set of There erist a placement p and an assignment A such that A servers S,a partial assignment a computed by max- e3 oupulo时Algorithm1,L≥lm/m:1 cover,and a cost function w,bal-assign uses a simple =「m/n·w1oe. greedy algorithm to extend a to a complete assignment PROOF.We prove the theorem by constructing a place- B by repeatedly choosing a server with minimal virtual ment graph Ge.Partition the set T of tasks into n disjoint load and assigning some unassigned task to it.This subsets T::1≤i≤n,such that「m/nl≥T≥lT引≥ continues until all tasks are assigned.It thus generates Lm/n]for all1≤i≤j≤n.Now in the placement graph a sequence of partial assignments a=ao Ca1C...C ofuthat品h ,for all1≤i≤_n u=B,where u =u.Every task t assigned in bal- assign contributes v(t,B)<wfem to the virtual load then connect each task in T to a different server in the sub- of the server that it is assigned to.At the end,wrem set S={s1,s2,...,sn-1}.Since m n(n-2),we have wem,and equality holds only whenrr+u 「m/nl≤m/n+1≤n-l,which guarantees S≥lTnl This completes the placement graph Ge.Now run Algo- The astute reader might feel that it is intellectually attrac- rithm 1 on Ge.There is a possible output A where tasks in tive to use real server load as the criterion to choose servers Tn are assigned to servers in S'.In that case,all tasks that in bal-assign because it embeds more accurate information. are local to server sn are assigned elsewhere,and thus sn is We do not know if this change ever results in a better as- assigned remote tasks.Since s is assigned at least [m/n] signment.We do know that it may require more compu- tasks,this gives L4≥Lm/n·wrem:□ tation.Whenever a local task is assigned,r+u decreases by 1,so the remote cost wrem(r+u)may also decrease.If When n m.the lower bound in Theorem 4.3 matches the it does,the loads of all servers that have been assigned re- upper bound in Theorem 4.2. mote tasks must be recomputed.In the current version of 5.A FLOW-BASED ALGORITHM the algorithm,we do not need to update virtual load when a local task is assigned because the virtual cost of remote Theorem 3.1 shows that the problem of computing an op- tasks never changes in the course of bal-assign. timal task assignment for the HTA problem is NP-complete. Nevertheless,it is feasible to find task assignments whose 5.1 Algorithm Description load is at most an additive constant greater than the opti- We describe Algorithm 2 in greater detail here. mal load.We present such an algorithm in this section. For two partial assignments a and B such that BD o,we 5.1.1 Max-cover define a new notation called virtual load from a below. Max-cover (line 6 of Algorithm 2)augments the partial Definition 12.For any task t and partial assignment B assignment a"computed by the previous iteration to pro- duce a".(We define ao to be the empty partial assignment.) that extends o,let Thus,a2a"-1,and a"maximizes the total number of lo- v(t,)= Wloc if t is local in B, cal tasks assigned subject to the constraint that no server is otherwise. assigned more than r tasks in all
1 ≤ i ≤ n, and thus L A = wloc. Now we add one edge between task tn and server s1. We run Algorithm 1 on this new placement graph G 0 to get assignment A 0 . It might assign task tn to server s1 in the first step. Following that, it assigns ti to si for 2 ≤ i ≤ n − 1, and it finally assigns t1 to sn. Since t1 is remote to sn, this gives L A0 = wrem. Therefore L A0 > LA. Theorem 4.1 indicates that increasing the number of data block replicas is not always beneficial for Algorithm 1. In the remaining part of this section, we show that the assignments computed by Algorithm 1 might deviate from the optimum by a multiplicative factor. In the following, let O be an assignment that minimizes L O. Theorem 4.2. Let A be an assignment computed by Algorithm 1. Then L A ≤ (wrem/wloc) · L O. Proof. On the one hand, pigeonhole principle says there is a server assigned at least dm/ne tasks. Since the cost of each task is at least wloc, the load of this server is at least dm/ne · wloc. Thus, L O ≥ dm/ne · wloc. On the other hand, Algorithm 1 runs in a round robin fashion where one task is assigned at a time. Therefore, the number of tasks assigned to each server is at most dm/ne. Since the cost of each task is at most wrem, the load of a server is at most dm/ne·wrem. Thus, L A ≤ dm/ne · wrem. Combining the two, we have L A ≤ (wrem/wloc) · L O. Theorem 4.3. Let T and S be such that m ≤ n(n − 2). There exist a placement ρ and an assignment A such that A is a possible output of Algorithm 1, L A ≥ bm/nc · wrem, and L O = dm/ne · wloc. Proof. We prove the theorem by constructing a placement graph Gρ. Partition the set T of tasks into n disjoint subsets Ti : 1 ≤ i ≤ n, such that dm/ne ≥ |Ti| ≥ |Tj | ≥ bm/nc for all 1 ≤ i ≤ j ≤ n. Now in the placement graph Gρ, connect tasks in Ti to server si, for all 1 ≤ i ≤ n. These set of edges guarantee that L O = dm/ne · wloc. We then connect each task in Tn to a different server in the subset S 0 = {s1, s2, · · · , sn−1}. Since m ≤ n(n − 2), we have dm/ne ≤ m/n + 1 ≤ n − 1, which guarantees |S 0 | ≥ |Tn|. This completes the placement graph Gρ. Now run Algorithm 1 on Gρ. There is a possible output A where tasks in Tn are assigned to servers in S 0 . In that case, all tasks that are local to server sn are assigned elsewhere, and thus sn is assigned remote tasks. Since sn is assigned at least bm/nc tasks, this gives L A ≥ bm/nc · wrem. When n | m, the lower bound in Theorem 4.3 matches the upper bound in Theorem 4.2. 5. A FLOW-BASED ALGORITHM Theorem 3.1 shows that the problem of computing an optimal task assignment for the HTA problem is N P-complete. Nevertheless, it is feasible to find task assignments whose load is at most an additive constant greater than the optimal load. We present such an algorithm in this section. For two partial assignments α and β such that β ⊇ α, we define a new notation called virtual load from α below. Definition 12. For any task t and partial assignment β that extends α, let v α (t, β) = wloc if t is local in β, w α rem otherwise. The virtual load of server s under β from α is V β,α P s = t:β(t)=s v α (t, β). The maximum virtual load under β from α is V β,α = maxs∈S V β,α s . Thus, v assumes pessimistically that tasks not assigned by β will eventually become remote, and each remote task will eventully have cost w α rem. When α is clear from context, we omit α and write v(t, β), V β s and V β , respectively. Note that v α (t, α) = w(t, α) as in Definition 7. Algorithm 2 works iteratively to produce a sequence of assignments and then outputs the best one, i.e., the one of least maximum server load. The iteration is controlled by an integer variable τ which is initialized to 1 and incremented on each iteration. Each iteration consists of two phases, max-cover and bal-assign: • Max-cover : Given as input a placement graph Gρ, an integer value τ , and a partial assignment α, max-cover returns a partial assignment α 0 of a subset T 0 of tasks, such that α 0 assigns no server more than τ tasks, every task in T 0 is local in α 0 , and |T 0 | is maximized over all such assignments. Thus, α 0 makes as many tasks local as is possible without assigning more than τ tasks to any one server. The name “max-cover” follows the intuition that we are actually trying to “cover” as many tasks as possible by their local servers, subject to the constraint that no server is assigned more than τ tasks. • Bal-assign: Given as input a set of tasks T, a set of servers S, a partial assignment α computed by maxcover, and a cost function w, bal-assign uses a simple greedy algorithm to extend α to a complete assignment B by repeatedly choosing a server with minimal virtual load and assigning some unassigned task to it. This continues until all tasks are assigned. It thus generates a sequence of partial assignments α = α0 ⊆ α1 ⊆ · · · ⊆ αu = B, where u = u α . Every task t assigned in balassign contributes v α (t, B) ≤ w α rem to the virtual load of the server that it is assigned to. At the end, w B rem ≤ w α rem, and equality holds only when r B = r α + u α . The astute reader might feel that it is intellectually attractive to use real server load as the criterion to choose servers in bal-assign because it embeds more accurate information. We do not know if this change ever results in a better assignment. We do know that it may require more computation. Whenever a local task is assigned, r + u decreases by 1, so the remote cost wrem(r + u) may also decrease. If it does, the loads of all servers that have been assigned remote tasks must be recomputed. In the current version of the algorithm, we do not need to update virtual load when a local task is assigned because the virtual cost of remote tasks never changes in the course of bal-assign. 5.1 Algorithm Description We describe Algorithm 2 in greater detail here. 5.1.1 Max-cover Max-cover (line 6 of Algorithm 2) augments the partial assignment α τ−1 computed by the previous iteration to produce α τ . (We define α 0 to be the empty partial assignment.) Thus, α τ ⊇ α τ−1 , and α τ maximizes the total number of local tasks assigned subject to the constraint that no server is assigned more than τ tasks in all
Algorithm 2 A flow-based algorithm for HTA. 5.1.2 Bal-assign 1:input:an HMR-system (T,S,p,unoc,wrem()) 2:define A,B as assignments Definition 13.Let B and B'be partial assignments,t a 3:define o as a partial assignment 4:a(t)=L(task t is unassigned)for all t task and s a server.We say that BB'is a step that 5:for r=1 to m do assigns t to s if t is unassigned in B and B'=Bu{(t,s)} 6: a-max-cover(Gp,T,a) We say B'is a step,if B for some t and s. B-bal-assign(T,S,a,Wloc,Wrem()) 8:end for A sequence of steps a=ao一a1一.一Qu is a 9:set A equal to a B with least maximum load trace if for each i∈L,叫,ifa-lsa,is a step,then 10:output:assignment A Vaa≤Vt for all s'≠s. Given two partial assignments a:-1 and ai in a trace such The core of the max-cover phase is an augmenting path that ai,it follows that algorithm by Ford and Fulkerson 10.The Ford-Fulkerson algorithm takes as input a network with edge capacities and V4,a≤V-1a+m an existing network flow,and outputs a maximum flow that Vaa=Vt-1 for all s'≠s respects the capacity constraints.A fact about this algo- rithm is well-known 6,10. The following lemma is immediate. FACT 5.1.Given a flow network with integral capacities LEMMA 5.3.Let u=u and a"ao s ai .. and an initial integral s-t flow f,the Ford-Fulkerson algo- be a sequence of partial assignments generated by bal- rithm computes an integral marimum s-t flow f'in time assign at iteration T.This sequence is a trace that ends in O(E.(f'-lf)),where El is the number of edges in the a complete assignment B=au. network and f is the value of the flow f,i.e.,the amount of flow passing from the source to the sink. 5.2 Main Result During the max-cover phase at iteration T,the input It is obvious that Algorithm 2 is optimal for n=1 since placement graph Ge is first converted to a corresponding only one assignment is possible.Now we show that for n>2. flow network G.Ge includes all nodes in Ge and an extra Algorithm 2 computes,in polynomial time,assignments that source u and an extra sink v.In G,there is an edge (u,t)for are optimal to within an additive constant.The result is all t E T and an edge (s,v)for all s E S.All of the original formally stated as Theorem 5.4. edges (t,s)in Ge remain in G.The edge capacity is defined as follows:edge (s,v)has capacity T for all s E S,while all THEOREM 5.4.Let n 2.Given an HMR-system with m the other edges have capacity 1.Therefore,for any pair of tasks and n servers,Algorithm 2 computes an assignment A (t,s),if there is a flow through the path ut-s-v,the in time O(m2n))such that LA≤Lo+(1-点)u2m value of this flow is no greater than 1.Then the input partial assignment a is converted into a network flow fo as follows: LEMMA 5.5.Algorithm 2 runs in time O(m2n). if task t is assigned to server s in the partial assignment a, assign one unit of flow through the path u→t÷s÷u. PROOF.By Fact 5.1.we know that the Ford-Fulkerson The Ford-Fulkerson algorithm is then run on graph Ge algorithm takes time O(E Af)to augment the network with flow fa to find a maximum flow fo.From Fact 5.1, flow by A.At iteration T 1,max-cover takes time we know that the Ford-Fulkerson algorithm takes time O(IE·lfi),where f≤n.Then at iteration r=2, O(E(fa-f))in this iteration.This output flow fa at max-cover takes time O(E.(If2l-f1l)),where lf2l <2n. iteration r will act as the input flow to the Ford-Fulkerson The same process is repeated until Ifml m.The total algorithm at iteration r+1.The flow network at iteration running time of max-cover for all iterations thus adds up r+1 is the same as the one at iteration r except that each to O(E·(fl+lf2l-lfl+lf3l-lf2l+·+lfml)= edge (s,v)has capacity r+1 for all s E S.This incremen- O(El.Ifml)=O(El.m)=O(m2n). tal use of Ford-Fulkerson algorithm in successive iterations We implement the greedy algorithm in the bal-assign helps reduce the time complexity of the whole algorithm. phase with a priority queue.Since there are n servers,each At the end of the max-cover phase,the augmented flow fa operation of the priority queue takes O(log n)time.During is converted back into a partial assignment a'.If there is one the bal-assign phase at each iteration,at most m tasks need unit of flow through the path u→t→s→v in fa,we assign to be assigned.This takes time O(m log n).The total run- task t is to server s in a'.This conversion from a network ning time of bal-assign for all iterations is thus O(m-log n). flow to a partial assignment can always be done,because the Combining the running time of the two phases for all it- flow is integral and all edges between tasks and servers have erations gives time complexity O(m'n). capacity 1.Therefore,there is a one-to-one correspondence between a unit flow through the path u→t一s一vand Lemma 5.5 suggests the max-cover phase is the main con- the assignment of task t to its local server s.It follows that tributor to the time complexity of Algorithm 2.However fl=e.By Fact 5.1,the Ford-Fulkerson algorithm com- in a typical Hadoop system,the number of replicas for each putes a maximum flow that respects the capacity constraint data block is a small constant,say 2 or 3.Then the degree r.Thus,the following lemma is immediate of each t EG is bounded by this constant.In this case.the placement graph G is sparse and E=O(m +n).As a LEMMA 5.2.Let a'be the partial assignment computed result,max-cover runs in time O(m(m+n)).Therefore the by mar-cover at iterationT;and B be any partial assignment bal-assign phase might become the main contributor to the such that k≤r.Then time complexity
Algorithm 2 A flow-based algorithm for HTA. 1: input: an HMR-system (T, S, ρ, wloc, wrem(·)) 2: define A, B as assignments 3: define α as a partial assignment 4: α(t) = ⊥ (task t is unassigned) for all t 5: for τ = 1 to m do 6: α ← max-cover(Gρ, τ, α) 7: B ← bal-assign(T, S, α, wloc, wrem(·)) 8: end for 9: set A equal to a B with least maximum load 10: output: assignment A The core of the max-cover phase is an augmenting path algorithm by Ford and Fulkerson [10]. The Ford-Fulkerson algorithm takes as input a network with edge capacities and an existing network flow, and outputs a maximum flow that respects the capacity constraints. A fact about this algorithm is well-known [6, 10]. Fact 5.1. Given a flow network with integral capacities and an initial integral s-t flow f, the Ford-Fulkerson algorithm computes an integral maximum s-t flow f 0 in time O(|E| · (|f 0 | − |f|)), where |E| is the number of edges in the network and |f| is the value of the flow f, i.e., the amount of flow passing from the source to the sink. During the max-cover phase at iteration τ , the input placement graph Gρ is first converted to a corresponding flow network G 0 ρ. G 0 ρ includes all nodes in Gρ and an extra source u and an extra sink v. In G 0 ρ, there is an edge (u, t) for all t ∈ T and an edge (s, v) for all s ∈ S. All of the original edges (t, s) in Gρ remain in G 0 ρ. The edge capacity is defined as follows: edge (s, v) has capacity τ for all s ∈ S, while all the other edges have capacity 1. Therefore, for any pair of (t, s), if there is a flow through the path u → t → s → v, the value of this flow is no greater than 1. Then the input partial assignment α is converted into a network flow fα as follows: if task t is assigned to server s in the partial assignment α, assign one unit of flow through the path u → t → s → v. The Ford-Fulkerson algorithm is then run on graph G 0 ρ with flow fα to find a maximum flow f 0 α. From Fact 5.1, we know that the Ford-Fulkerson algorithm takes time O(|E|·(|f 0 α|−|fα|)) in this iteration. This output flow f 0 α at iteration τ will act as the input flow to the Ford-Fulkerson algorithm at iteration τ + 1. The flow network at iteration τ + 1 is the same as the one at iteration τ except that each edge (s, v) has capacity τ + 1 for all s ∈ S. This incremental use of Ford-Fulkerson algorithm in successive iterations helps reduce the time complexity of the whole algorithm. At the end of the max-cover phase, the augmented flow f 0 α is converted back into a partial assignment α 0 . If there is one unit of flow through the path u → t → s → v in f 0 α, we assign task t is to server s in α 0 . This conversion from a network flow to a partial assignment can always be done, because the flow is integral and all edges between tasks and servers have capacity 1. Therefore, there is a one-to-one correspondence between a unit flow through the path u → t → s → v and the assignment of task t to its local server s. It follows that |f 0 α| = ` α 0 . By Fact 5.1, the Ford-Fulkerson algorithm computes a maximum flow that respects the capacity constraint τ . Thus, the following lemma is immediate. Lemma 5.2. Let α τ be the partial assignment computed by max-cover at iteration τ , and β be any partial assignment such that k β ≤ τ . Then ` α τ ≥ ` β . 5.1.2 Bal-assign Definition 13. Let β and β 0 be partial assignments, t a task and s a server. We say that β t:s −→ β 0 is a step that assigns t to s if t is unassigned in β and β 0 = β ∪ {(t, s)}. We say β → β 0 is a step, if β t:s −→ β 0 for some t and s. A sequence of steps α = α0 → α1 → . . . → αu is a trace if for each i ∈ [1, u], if αi−1 t:s −→ αi is a step, then V αi,α s ≤ V αi,α s0 for all s 0 6= s. Given two partial assignments αi−1 and αi in a trace such that αi−1 t:s −→ αi, it follows that V αi,α s ≤ V αi−1,α s + w α rem V αi,α s = V αi−1,α s0 for all s 0 6= s The following lemma is immediate. Lemma 5.3. Let u = u α τ and α τ = α τ 0 ⊆ α τ 1 ⊆ · · · ⊆ α τ u be a sequence of partial assignments generated by balassign at iteration τ . This sequence is a trace that ends in a complete assignment B τ = α τ u. 5.2 Main Result It is obvious that Algorithm 2 is optimal for n = 1 since only one assignment is possible. Now we show that for n ≥ 2, Algorithm 2 computes, in polynomial time, assignments that are optimal to within an additive constant. The result is formally stated as Theorem 5.4. Theorem 5.4. Let n ≥ 2. Given an HMR-system with m tasks and n servers, Algorithm 2 computes an assignment A in time O(m2n) such that L A ≤ L O + “ 1 − 1 n−1 ” · w O rem. Lemma 5.5. Algorithm 2 runs in time O(m2n). Proof. By Fact 5.1, we know that the Ford-Fulkerson algorithm takes time O(|E| · |∆f |) to augment the network flow by |∆f |. At iteration τ = 1, max-cover takes time O(|E| · |f1|), where |f1| ≤ n. Then at iteration τ = 2, max-cover takes time O(|E| · (|f2| − |f1|)), where |f2| ≤ 2n. The same process is repeated until |fm| = m. The total running time of max-cover for all iterations thus adds up to O(|E| · (|f1| + |f2| − |f1| + |f3| − |f2| + · · · + |fm|)) = O(|E| · |fm|) = O(|E| · m) = O(m2n). We implement the greedy algorithm in the bal-assign phase with a priority queue. Since there are n servers, each operation of the priority queue takes O(log n) time. During the bal-assign phase at each iteration, at most m tasks need to be assigned. This takes time O(m log n). The total running time of bal-assign for all iterations is thus O(m2 log n). Combining the running time of the two phases for all iterations gives time complexity O(m2n). Lemma 5.5 suggests the max-cover phase is the main contributor to the time complexity of Algorithm 2. However, in a typical Hadoop system, the number of replicas for each data block is a small constant, say 2 or 3. Then the degree of each t ∈ G is bounded by this constant. In this case, the placement graph G is sparse and |E| = O(m + n). As a result, max-cover runs in time O(m(m + n)). Therefore the bal-assign phase might become the main contributor to the time complexity
5.2.1 Properties of optimal assignments LEMMA5.10.LB≤VB In order to prove the approximation bound,we first es- PROOF.By definition,L=w(t,B)and VB= tablish some properties of optimal assignments. ∑tBe=sv(t,B).By Lemma5.&,un≤uem,and thus Definition 14.Given an HMR-system,let be the set w(t,B≤u6,B).It follows that Vs∈SL≤yB.There- of all optimal assignments,i.e.,those that minimize the fore LB s vB,because LB max,Lf and VB =VB= maximum load.Let rmin minfr4 A O}and let max,Vp. O1={0∈O|r0=rmin}. For the remainder of this section.let si be a server such LEMMA 5.6.Let OE01.If eo=ko,then r=0 and that =ko.Such a server exists by Lemma 5.9.Let LO=KO S2=S-Is1 be the set of remaining servers.For a partial PRooF.Let=ko for some server s.Assume to the assignment B2a,define Ne to be the average virtual load contrary that r≥1.Then L≥Ko+w2n.Let t be under B of the servers in S2.Formally, a remote task assigned to s by O.By definition 3,p(t,s') NA ∑ses2V2 = 2uoe+r3wem-V盟 holds for at least one server s's. = S2 n-1 Case 1:s'has at least one remote task t'.Then move t' to s and t to s'.This results in another assignment B.B is To obtain the approximation bound,we compare N with still optimal because≤Lg,Lg≤g,and L,=ig, the similar quantity Mo for the optimal assignment.For for any other server s"ES-{s,s'}. convenience,we let 6=wfem/(n-1). Case 2:s'has only local tasks.By the definition of ko s'has at most ko local tasks assigned by O.Then move t to LEMMA 5.11.Let B=ai=B'.Then s'.This results in another assignment B.B is still optimal V≤NB≤M°-d. because LLLKo+whoeKo wrem S LO PROOF.Proof is by a counting argument.By Lemma 5.9, and LB,=LO,for any other server s" ES-is,s. we havek=kO,sog,≥g=ko.Hence,V≥Ko In either case,we have shown the new assignment is in By Lemma5.2,we have o≥o.Letd=ee-°.d≥0 O.However,since t becomes local in the new assignment, because9≥e≥O.Because 3+rB+uP=m=o+rO fewer remote tasks are assigned than in O.This contradicts we have r8+u3+d=ro.Also,u3 2 1 since t is unassigned that O01.Thus,s is assigned no remote tasks,so Lo= kOoe=K0.☐ in B.Then by Lemma 5.8, Definition15.LetO∈O.DefineM°=He=e (n-1)Na ewloe +rwcem -va =(+d)wloc +(ro-u8 -d)wcem -Va LEMMA5.7.LO≥MO ≤t°ac+(r0-u2)u8nm-Ko PROOF.Let s be a server of maximal local load in O,so k=ko.Let S2=S-[s1).By Lemma 5.6,L=K ≤(m-l)M°-2m The total load on 52is∑.esLg=H°-Ko so the Hence,NB≤Mo-i. Now,since B is part of a trace,we have Vsv for all average load on S2isM°.Hence,Lo≥max.ES2Lg≥ avg,esa L Mo .□ s'5.In particular,V Na,since Na is the average virtual load of all servers in S2.We conclude that Vs 5.2.2 Analyzing the algorithm NB≤M0-6.☐ Assume throughout this section that OO and a PROOF OF THEOREM 5.4.Lemma 5.5 shows that the ao一ai一..→au=B is a trace generated by iteration time complexity of Algorithm 2 is O(m'n).Now we finish T=ko of the algorithm.Virtual loads are all based on a,so the proof for the approximation bound. we generally omit explicit mention of a in the superscripts Let s be a server of maximum virtual load in B,so VB of v and V. VB.Let i be the smallest integer such that i.=Bls,that is,no more tasks are assigned to s in the subtrace beginning LEMMA5.8.Moe≤uBn≤wRem≤u2n with ai. PROOF.wioe wm follows from the definition of a Case 1:i=0:Then ego 0:Then B=a;=B'for some task t. tonicity of wrem().It follows by definition of the wfem no- By lemma5.ll,V2≤M°-d,so using Lemma5.8, tation that w品n≤wem≤um.☐ Vg'≤Ve+wem≤M°-6+wem LEMMA 5.9.=k Then by Lemma 5.7, PROoF.k0,because Both cases imply that VB Lo+wem -6.By otherwise a B and LB =ka.wloe <Ko<Lo,violating Lemma 5.10,we have LB <VE.Because the algorithm the optimality of O.Let t be an unassigned task in a.By chooses an assignment with least maximum load as the out- definition,p(t,s)holds for some server s.Assign t to s in put A,we have L4 <LB.Hence, a to obtain a new partial assignment B.We have k< ka+1≤kosT.By Lemma5.2,e≥ee,contradicting 小≤+品-i=°+(-n) the fact that 25=g+1.We conclude thatk=
5.2.1 Properties of optimal assignments In order to prove the approximation bound, we first establish some properties of optimal assignments. Definition 14. Given an HMR-system, let O be the set of all optimal assignments, i.e., those that minimize the maximum load. Let rmin = min{r A | A ∈ O} and let O1 = {O ∈ O | r O = rmin}. Lemma 5.6. Let O ∈ O1. If ` O s = k O, then r O s = 0 and L O s = KO. Proof. Let ` O s = k O for some server s. Assume to the contrary that r O s ≥ 1. Then L O s ≥ KO + w O rem. Let t be a remote task assigned to s by O. By definition 3, ρ(t, s0 ) holds for at least one server s 0 6= s. Case 1: s 0 has at least one remote task t 0 . Then move t 0 to s and t to s 0 . This results in another assignment B. B is still optimal because L B s ≤ L O s , L B s0 ≤ L O s0 , and L B s00 = L O s00 for any other server s 00 ∈ S − {s, s0 }. Case 2: s 0 has only local tasks. By the definition of k O, s 0 has at most k O local tasks assigned by O. Then move t to s 0 . This results in another assignment B. B is still optimal because L B s 0, because otherwise α = B and L B = k α · wloc 0: Then β = αi−1 t:s −→ αi = β 0 for some task t. By lemma 5.11, V β s ≤ MO − δ, so using Lemma 5.8, V β 0 s ≤ V β s + w α rem ≤ MO − δ + w O rem. Then by Lemma 5.7, V B = V B s = V β 0 s ≤ MO + w O rem − δ ≤ L O + w O rem − δ. Both cases imply that V B ≤ L O + w O rem − δ. By Lemma 5.10, we have L B ≤ V B. Because the algorithm chooses an assignment with least maximum load as the output A, we have L A ≤ L B. Hence, L A ≤ L O + w O rem − δ = L O + „ 1 − 1 n − 1 « · w O rem
6.CONCLUSION Conference,Budapest,Hungary,September 2-6,1985 In this paper,we present an algorithmic study of the task pages 153-169.Springer,1986. assignment problem in the Hadoop MapReduce framework [6]T.Cormen,C.Leiserson,R.Rivest,and C.Stein. and propose a mathematical model to evaluate the cost of Introduction to algorithms,2nd ed.MIT press task assignments.Based on this model,we show that it is Cambridge,MA.2001. infeasible to find the optimal assignment unless P =AP [7]M.de Kruijf and K.Sankaralingam.MapReduce for Theorem 3.1 shows that the task assignment problem in the Cell B.E.architecture.University of Wisconsin Hadoop remains hard even if all servers have equal capacity Computer Sciences Technical Report CS-TR-2007. of 3,the cost function only has 2 values in its range,and 1625,2007 each data block has at most 2 replicas. [8]J.Dean.Experiences with MapReduce,an abstraction Second,we analyze the simple round robin algorithm for for large-scale computation.In Proceedings of the 15th the UHTA problem.Theorem 4.1 reveals that the intuition International Conference on Parallel Architectures and is wrong that increasing the number of replicas always helps Compilation Techniques.ACM New York,NY,USA. load balancing.Using round robin task assignment,adding 2006. more replicas into the system can sometimes result in worse [9]J.Dean and S.Ghemawat.MapReduce:Simplified maximum load.Theorems 4.2 and 4.3 show there could be data processing on large clusters.Proceedings of the a multiplicative gap in maximum load between the optimal 6th Symposium on Operating Systems Design and assignment and the assignment computed by Algorithm 1. Implementation,San Francisco,CA,pages 137-150, Third,we present Algorithm 2 for the general HTA prob- 2004. lem.This algorithm employs maximum flow and increasing 10 L.R.Ford and D.R.Fulkerson.Maximal flow through threshold techniques.Theorem 5.4 shows that the assign- a network.Canadian Journal of Mathematics ments computed by Algorithm 2 are optimal to within an 8(3):399-404.1956. additive constant that depends only on the number of servers [11 M.R.Garey,D.S.Johnson,et al.Computers and and the remote cost function. Intractability:A Guide to the Theory of There are many interesting directions for future work.We NP-completeness.Freeman San Francisco,1979. have sketched a proof of a matching lower bound to Theo- [12 R.L.Graham.Bounds for certain multiprocessing rem 5.4 for a class of Hadoop cost functions.We plan to anomalies.Bell System Technical Journal. present this result in followup work.Sharing a MapReduce 45(9):1563-1581,1966. cluster between multiple users is becoming popular and has led to recent development of multi-user multi-job schedulers 13 R.L.Graham.Bounds on multiprocessing timing such as fair scheduler and capacity scheduler.We plan to anomalies.SIAM Journal on Applied Mathematics, analyze the performance of such schedulers and see if the pages416-429,1969. optimization techniques from this paper can be applied to [14]B.He,W.Fang,Q.Luo,N.K.Govindaraju,and improve them. T.Wang.Mars:A MapReduce framework on graphics processors.In Proceedings of the 17th International Conference on Parallel Architectures and Compilation 7.ACKNOWLEDGMENTS Techniques,pages 260-269.ACM New York,NY, We would like to thank Avi Silberschatz.Daniel Abadi. USA.2008. Kamil Bajda-Pawlikowski,and Azza Abouzeid for their in- [15]H.W.Kuhn.The Hungarian method for the spiring discussions.We are also grateful to the anonymous assignment problem.Naval Research Logistics,52(1), referees for providing many useful suggestions that signifi- 2005.Originally appeared in Naval Research Logistics cantly improved the quality of our presentation. Quarterly,2,1955,83-97. 16 R.Lammel.Google's MapReduce programming 8.REFERENCES model-Revisited.Science of Computer Programming. 68(3):208-237,2007. [1]J.Aspnes,Y.Azar,A.Fiat,S.Plotkin,and [17]J.K.Lenstra,D.B.Shmoys,and E.Tardos. O.Waarts.On-line routing of virtual circuits with Approximation algorithms for scheduling unrelated applications to load balancing and machine parallel machines.Mathematical Programming, scheduling.Journal of the ACM.44(3):486-504.1997. 46(1):259-271,1990. [2]Y.Azar,J.S.Naor,and R.Rom.The competitiveness 18 C.Ranger,R.Raghuraman,A.Penmetsa,G.Bradski of on-line assignments.In Proceedings of the 3rd and C.Kozyrakis.Evaluating MapReduce for Annual ACM-SIAM symposium on Discrete multi-core and multiprocessor systems.In Proceedings algorithms,pages 203-210.SIAM Philadelphia,PA, of the 2007 IEEE 13th International Symposium on USA.1992. High Performance Computer Architecture,pages [3]K.Birman,G.Chockler,and R.van Renesse.Towards 13-24.IEEE Computer Society Washington,DC a cloud computing research agenda.SIGACT News, USA,2007 40(2):68-80.2009. 19 M.Zaharia,A.Konwinski,A.D.Joseph,R.Katz,and [4]E.Bortnikov.Open-source grid technologies for I.Stoica.Improving MapReduce performance in web-scale computing.SIGACT News,40(2):87-93. heterogeneous environments.In Proceedings of the 8th 2009. Symposium on Operating Systems Design and [5]R.E.Burkard.Assignment problems:Recent solution Implementation,San Diego,CA,2008. methods and applications.In System Modelling and Optimization:Proceedings of the 12th IFIP
6. CONCLUSION In this paper, we present an algorithmic study of the task assignment problem in the Hadoop MapReduce framework and propose a mathematical model to evaluate the cost of task assignments. Based on this model, we show that it is infeasible to find the optimal assignment unless P = N P. Theorem 3.1 shows that the task assignment problem in Hadoop remains hard even if all servers have equal capacity of 3, the cost function only has 2 values in its range, and each data block has at most 2 replicas. Second, we analyze the simple round robin algorithm for the UHTA problem. Theorem 4.1 reveals that the intuition is wrong that increasing the number of replicas always helps load balancing. Using round robin task assignment, adding more replicas into the system can sometimes result in worse maximum load. Theorems 4.2 and 4.3 show there could be a multiplicative gap in maximum load between the optimal assignment and the assignment computed by Algorithm 1. Third, we present Algorithm 2 for the general HTA problem. This algorithm employs maximum flow and increasing threshold techniques. Theorem 5.4 shows that the assignments computed by Algorithm 2 are optimal to within an additive constant that depends only on the number of servers and the remote cost function. There are many interesting directions for future work. We have sketched a proof of a matching lower bound to Theorem 5.4 for a class of Hadoop cost functions. We plan to present this result in followup work. Sharing a MapReduce cluster between multiple users is becoming popular and has led to recent development of multi-user multi-job schedulers such as fair scheduler and capacity scheduler. We plan to analyze the performance of such schedulers and see if the optimization techniques from this paper can be applied to improve them. 7. ACKNOWLEDGMENTS We would like to thank Avi Silberschatz, Daniel Abadi, Kamil Bajda-Pawlikowski, and Azza Abouzeid for their inspiring discussions. We are also grateful to the anonymous referees for providing many useful suggestions that signifi- cantly improved the quality of our presentation. 8. REFERENCES [1] J. Aspnes, Y. Azar, A. Fiat, S. Plotkin, and O. Waarts. On-line routing of virtual circuits with applications to load balancing and machine scheduling. Journal of the ACM, 44(3):486–504, 1997. [2] Y. Azar, J. S. Naor, and R. Rom. The competitiveness of on-line assignments. In Proceedings of the 3rd Annual ACM-SIAM symposium on Discrete algorithms, pages 203–210. SIAM Philadelphia, PA, USA, 1992. [3] K. Birman, G. Chockler, and R. van Renesse. Towards a cloud computing research agenda. SIGACT News, 40(2):68–80, 2009. [4] E. Bortnikov. Open-source grid technologies for web-scale computing. SIGACT News, 40(2):87–93, 2009. [5] R. E. Burkard. Assignment problems: Recent solution methods and applications. In System Modelling and Optimization: Proceedings of the 12th IFIP Conference, Budapest, Hungary, September 2-6, 1985, pages 153–169. Springer, 1986. [6] T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to algorithms, 2nd ed. MIT press Cambridge, MA, 2001. [7] M. de Kruijf and K. Sankaralingam. MapReduce for the Cell B. E. architecture. University of Wisconsin Computer Sciences Technical Report CS-TR-2007, 1625, 2007. [8] J. Dean. Experiences with MapReduce, an abstraction for large-scale computation. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques. ACM New York, NY, USA, 2006. [9] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Proceedings of the 6th Symposium on Operating Systems Design and Implementation, San Francisco, CA, pages 137–150, 2004. [10] L. R. Ford and D. R. Fulkerson. Maximal flow through a network. Canadian Journal of Mathematics, 8(3):399–404, 1956. [11] M. R. Garey, D. S. Johnson, et al. Computers and Intractability: A Guide to the Theory of NP-completeness. Freeman San Francisco, 1979. [12] R. L. Graham. Bounds for certain multiprocessing anomalies. Bell System Technical Journal, 45(9):1563–1581, 1966. [13] R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, pages 416–429, 1969. [14] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: A MapReduce framework on graphics processors. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pages 260–269. ACM New York, NY, USA, 2008. [15] H. W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics, 52(1), 2005. Originally appeared in Naval Research Logistics Quarterly, 2, 1955, 83–97. [16] R. L¨ammel. Google’s MapReduce programming model—Revisited. Science of Computer Programming, 68(3):208–237, 2007. [17] J. K. Lenstra, D. B. Shmoys, and E. Tardos. Approximation algorithms for scheduling unrelated parallel machines. Mathematical Programming, 46(1):259–271, 1990. [18] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating MapReduce for multi-core and multiprocessor systems. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13–24. IEEE Computer Society Washington, DC, USA, 2007. [19] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. Improving MapReduce performance in heterogeneous environments. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation, San Diego, CA, 2008