Assigning Tasks for Efficiency in Hadoop [Extended Abstract] Michael J.Fischer Xueyuan Su Yitong Yin Computer Science Computer Science State Key Laboratory for Novel Yale University Yale University Software Technology P○.Box208285 P.○.Box208285 Nanjing University,China New Haven,CT,USA New Haven,CT,USA yinyt@nju.edu.cn michael.fischer@yale.edu xueyuan.su@yale.edu ABSTRACT by Abstract Devices:Complexity Measures and In recent years Google's MapReduce has emerged as a lead- Classes-reducibility and completeness:F.2.2 Analysis of ing large-scale data processing architecture.Adopted by Algorithms and Problem Complexity:Nonnumerical companies such as Amazon,Facebook,Google,IBM and Algorithms and Problems-sequencing and scheduling Yahoo!in daily use,and more recently put in use by several universities,it allows parallel processing of huge volumes of data over cluster of machines.Hadoop is a free Java im- General Terms plementation of MapReduce.In Hadoop,files are split into Algorithms,Performance,Theory blocks and replicated and spread over all servers in a net- work.Each job is also split into many small pieces called tasks.Several tasks are processed on a single server.and Keywords a job is not completed until all the assigned tasks are fin- task assignment,load balancing,NP-completeness,approx- ished.A crucial factor that affects the completion time of a imation algorithm,MapReduce,Hadoop job is the particular assignment of tasks to servers.Given a placement of the input data over servers,one wishes to find the assignment that minimizes the completion time.In this 1.INTRODUCTION paper,an idealized Hadoop model is proposed to investigate the Hadoop task assignment problem.It is shown that there is no feasible algorithm to find the optimal Hadoop task as- 1.1 Background signment unless P =AP.Assignments that are computed The cloud computing paradigm has recently received sig- by the round robin algorithm inspired by the current Hadoop nificant attention in the media.The cloud is a metaphor scheduler are shown to deviate from optimum by a multi- for the Internet,which is an abstraction for the complex plicative factor in the worst case.A flow-based algorithm infrastructure it conceals.Cloud computing refers to both is presented that computes assignments that are optimal to the applications delivered as services over the Internet and within an additive constant. the hardware and software that provide such services.It envisions shifting data storage and computing power away Categories and Subject Descriptors from local servers,across the network cloud,and into large clusters of machines hosted by companies such as Amazon D.3.2 Programming Languages:Language Classifica- Google,IBM,Microsoft,Yahoo!and so on tions-concurrent,distributed,and parallel languages;F.1.2 Google's MapReduce [8,9,16]parallel computing archi- Computation by Abstract Devices:Modes of Compu- tecture,for example,splits workload over large clusters of tation-parallelism and concurrency:F.1.3 Computation commodity PCs and enables automatic parallelization.By Supported by the Kempner Fellowship from the Depart- exploiting parallel processing,it provides a software plat- ment of Computer Science at Yale University. form that lets one easily write and run applications that fSupported by the National Science Foundation of China un- process vast amounts of data. der Grant No.60721002.This work was done when Yitong Apache Hadoop [4]is a free Java implementation of Yin was at Yale University. MapReduce in the open source software community.It is originally designed to efficiently process large volumes of data by parallel processing over commodity computers in local networks.In academia,researchers have adapted Permission to make digital or hard copies of all or part of this work for Hadoop to several different architectures.For example personal or classroom use is granted without fee provided that copies are Ranger et al.18 evaluate MapReduce in multi-core and not made or distributed for profit or commercial advantage and that copies multi-processor systems,Kruijf et al.[7]implement MapRe- bear this notice and the full citation on the first page.To copy otherwise,to duce on the Cell B.E.processor architecture,and He et republish,to post on servers or to redistribute to lists,requires prior specific permission and/or a fee. al.[14]propose a MapReduce framework on graphics pro- SPAA'10.June 13-15,2010,Thira,Santorini,Greece. cessors.Many related applications using Hadoop have also Copyright2010ACM978-1-4503-0079-7/1006.$10.00. been developed to solve various practical problems.Assigning Tasks for Efficiency in Hadoop [Extended Abstract] Michael J. Fischer Computer Science Yale University P.O. Box 208285 New Haven, CT, USA michael.fischer@yale.edu Xueyuan Su ∗ Computer Science Yale University P.O. Box 208285 New Haven, CT, USA xueyuan.su@yale.edu Yitong Yin † State Key Laboratory for Novel Software Technology Nanjing University, China yinyt@nju.edu.cn ABSTRACT In recent years Google’s MapReduce has emerged as a leading large-scale data processing architecture. Adopted by companies such as Amazon, Facebook, Google, IBM and Yahoo! in daily use, and more recently put in use by several universities, it allows parallel processing of huge volumes of data over cluster of machines. Hadoop is a free Java implementation of MapReduce. In Hadoop, files are split into blocks and replicated and spread over all servers in a network. Each job is also split into many small pieces called tasks. Several tasks are processed on a single server, and a job is not completed until all the assigned tasks are finished. A crucial factor that affects the completion time of a job is the particular assignment of tasks to servers. Given a placement of the input data over servers, one wishes to find the assignment that minimizes the completion time. In this paper, an idealized Hadoop model is proposed to investigate the Hadoop task assignment problem. It is shown that there is no feasible algorithm to find the optimal Hadoop task assignment unless P = N P. Assignments that are computed by the round robin algorithm inspired by the current Hadoop scheduler are shown to deviate from optimum by a multiplicative factor in the worst case. A flow-based algorithm is presented that computes assignments that are optimal to within an additive constant. Categories and Subject Descriptors D.3.2 [Programming Languages]: Language Classifications—concurrent, distributed, and parallel languages; F.1.2 [Computation by Abstract Devices]: Modes of Computation—parallelism and concurrency; F.1.3 [Computation ∗ Supported by the Kempner Fellowship from the Department of Computer Science at Yale University. † Supported by the National Science Foundation of China under Grant No. 60721002. This work was done when Yitong Yin was at Yale University. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SPAA’10, June 13–15, 2010, Thira, Santorini, Greece. Copyright 2010 ACM 978-1-4503-0079-7/10/06 ...$10.00. by Abstract Devices]: Complexity Measures and Classes—reducibility and completeness; F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems—sequencing and scheduling General Terms Algorithms, Performance, Theory Keywords task assignment, load balancing, NP-completeness, approximation algorithm, MapReduce, Hadoop 1. INTRODUCTION 1.1 Background The cloud computing paradigm has recently received significant attention in the media. The cloud is a metaphor for the Internet, which is an abstraction for the complex infrastructure it conceals. Cloud computing refers to both the applications delivered as services over the Internet and the hardware and software that provide such services. It envisions shifting data storage and computing power away from local servers, across the network cloud, and into large clusters of machines hosted by companies such as Amazon, Google, IBM, Microsoft, Yahoo! and so on. Google’s MapReduce [8, 9, 16] parallel computing architecture, for example, splits workload over large clusters of commodity PCs and enables automatic parallelization. By exploiting parallel processing, it provides a software platform that lets one easily write and run applications that process vast amounts of data. Apache Hadoop [4] is a free Java implementation of MapReduce in the open source software community. It is originally designed to efficiently process large volumes of data by parallel processing over commodity computers in local networks. In academia, researchers have adapted Hadoop to several different architectures. For example, Ranger et al. [18] evaluate MapReduce in multi-core and multi-processor systems, Kruijf et al. [7] implement MapReduce on the Cell B.E. processor architecture, and He et al. [14] propose a MapReduce framework on graphics processors. Many related applications using Hadoop have also been developed to solve various practical problems