MapReduce: Simplified Data Processing_中国高校课件下载中心

点击下载：《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源（阅读材料）MapReduce——Simplified Data Processing on Large Clusters

正在加载图片...

MapReduce:Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat jeff@google.com,sanjay@google.com Google,Inc. Abstract given day,etc.Most such computations are conceptu- ally straightforward.However,the input data is usually MapReduce is a programming model and an associ- large and the computations have to be distributed across ated implementation for processing and generating large hundreds or thousands of machines in order to finish in data sets.Users specify a map function that processes a a reasonable amount of time.The issues of how to par- key/value pair to generate a set of intermediate key/value allelize the computation,distribute the data,and handle pairs,and a reduce function that merges all intermediate failures conspire to obscure the original simple compu- values associated with the same intermediate key.Many tation with large amounts of complex code to deal with real world tasks are expressible in this model,as shown these issues in the paper. As a reaction to this complexity.we designed a new Programs written in this functional style are automati- abstraction that allows us to express the simple computa- cally parallelized and executed on a large cluster of com- tions we were trying to perform but hides the messy de- modity machines.The run-time system takes care of the tails of parallelization,fault-tolerance,data distribution details of partitioning the input data,scheduling the pro- and load balancing in a library.Our abstraction is in- gram's execution across a set of machines,handling ma- spired by the map and reduce primitives present in Lisp chine failures,and managing the required inter-machine and many other functional languages.We realized that communication.This allows programmers without any most of our computations involved applying a map op- experience with parallel and distributed systems to eas- eration to each logical "record"in our input in order to ily utilize the resources of a large distributed system. compute a set of intermediate key/value pairs,and then Our implementation of MapReduce runs on a large applying a reduce operation to all the values that shared cluster of commodity machines and is highly scalable: the same key,in order to combine the derived data ap a typical MapReduce computation processes many ter- propriately.Our use of a functional model with user- abytes of data on thousands of machines.Programmers specified map and reduce operations allows us to paral- find the system easy to use:hundreds of MapReduce pro- lelize large computations easily and to use re-execution grams have been implemented and upwards of one thou- as the primary mechanism for fault tolerance. sand MapReduce jobs are executed on Google's clusters The major contributions of this work are a simple and every day. powerful interface that enables automatic parallelization and distribution of large-scale computations,combined with an implementation of this interface that achieves 1 Introduction high performance on large clusters of commodity PCs. Section 2 describes the basic programming model and Over the past five years,the authors and many others at gives several examples.Section 3 describes an imple- Google have implemented hundreds of special-purpose mentation of the MapReduce interface tailored towards computations that process large amounts of raw data, our cluster-based computing environment.Section 4 de- such as crawled documents,web request logs,etc.,to scribes several refinements of the programming model compute various kinds of derived data,such as inverted that we have found useful.Section 5 has performance indices,various representations of the graph structure measurements of our implementation for a variety of of web documents,summaries of the number of pages tasks.Section 6 explores the use of MapReduce within crawled per host,the set of most frequent queries in a Google including our experiences in using it as the basis To appear in OSDI 2004 1MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat jeff@google.com, sanjay@google.com Google, Inc. Abstract MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day. 1 Introduction Over the past five years, the authors and many others at Google have implemented hundreds of special-purpose computations that process large amounts of raw data, such as crawled documents, web request logs, etc., to compute various kinds of derived data, such as inverted indices, various representations of the graph structure of web documents, summaries of the number of pages crawled per host, the set of most frequent queries in a given day, etc. Most such computations are conceptually straightforward. However, the input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with these issues. As a reaction to this complexity, we designed a new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical “record” in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with userspecified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance. The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs. Section 2 describes the basic programming model and gives several examples. Section 3 describes an implementation of the MapReduce interface tailored towards our cluster-based computing environment. Section 4 describes several refinements of the programming model that we have found useful. Section 5 has performance measurements of our implementation for a variety of tasks. Section 6 explores the use of MapReduce within Google including our experiences in using it as the basis To appear in OSDI 2004 1

向下翻页>>

点击下载：《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源（阅读材料）MapReduce——Simplified Data Processing on Large Clusters