正在加载图片...
ing system that produces the data structures used for the make it easier for programmers to write parallel pro- Google web search service.The indexing system takes grams.A key difference between these systems and as input a large set of documents that have been retrieved MapReduce is that MapReduce exploits a restricted pro- by our crawling system,stored as a set of GFS files.The gramming model to parallelize the user program auto- raw contents for these documents are more than 20 ter- matically and to provide transparent fault-tolerance. abytes of data.The indexing process runs as a sequence Our locality optimization draws its inspiration from of five to ten MapReduce operations.Using MapReduce techniques such as active disks [12,15].where compu- (instead of the ad-hoc distributed passes in the prior ver- tation is pushed into processing elements that are close sion of the indexing system)has provided several bene- to local disks.to reduce the amount of data sent across fits: I/O subsystems or the network.We run on commodity processors to which a small number of disks are directly The indexing code is simpler,smaller,and easier to connected instead of running directly on disk controller understand.because the code that deals with fault processors,but the general approach is similar. tolerance,distribution and parallelization is hidden Our backup task mechanism is similar to the eager within the MapReduce library.For example,the scheduling mechanism employed in the Charlotte Sys- size of one phase of the computation dropped from tem 3. approximately 3800 lines of C++code to approx- One of the shortcomings of simple eager scheduling is that if a given task causes repeated failures, imately 700 lines when expressed using MapRe- the entire computation fails to complete.We fix some in- duce. stances of this problem with our mechanism for skipping The performance of the MapReduce library is good bad records. enough that we can keep conceptually unrelated The MapReduce implementation relies on an in-house computations separate,instead of mixing them to- cluster management system that is responsible for dis- gether to avoid extra passes over the data.This tributing and running user tasks on a large collection of makes it easy to change the indexing process.For shared machines.Though not the focus of this paper,the example,one change that took a few months to cluster management system is similar in spirit to other make in our old indexing system took only a few systems such as Condor [16]. days to implement in the new system. The sorting facility that is a part of the MapReduce library is similar in operation to NOW-Sort [1].Source The indexing process has become much easier to machines(map workers)partition the data to be sorted operate,because most of the problems caused by and send it to one of R reduce workers.Each reduce machine failures,slow machines,and networking hiccups are dealt with automatically by the MapRe- worker sorts its data locally (in memory if possible).Of course NOW-Sort does not have the user-definable Map duce library without operator intervention.Further- and Reduce functions that make our library widely appli- more,it is easy to improve the performance of the cable. indexing process by adding new machines to the in- River [2]provides a programming model where pro- dexing cluster. cesses communicate with each other by sending data over distributed queues.Like MapReduce,the River 7 Related Work system tries to provide good average case performance even in the presence of non-uniformities introduced by Many systems have provided restricted programming heterogeneous hardware or system perturbations.River models and used the restrictions to parallelize the com- achieves this by careful scheduling of disk and network putation automatically.For example,an associative func- transfers to achieve balanced completion times.MapRe- tion can be computed over all prefixes of an N element duce has a different approach.By restricting the pro- array in log N time on N processors using parallel prefix gramming model,the MapReduce framework is able computations [6,9,13].MapReduce can be considered to partition the problem into a large number of fine- a simplification and distillation of some of these models grained tasks.These tasks are dynamically scheduled based on our experience with large real-world compu- on available workers so that faster workers process more tations.More significantly,we provide a fault-tolerant tasks.The restricted programming model also allows implementation that scales to thousands of processors. us to schedule redundant executions of tasks near the In contrast,most of the parallel processing systems have end of the job which greatly reduces completion time in only been implemented on smaller scales and leave the the presence of non-uniformities (such as slow or stuck details of handling machine failures to the programmer. workers). Bulk Synchronous Programming [17]and some MPI BAD-FS [5]has a very different programming model primitives [11]provide higher-level abstractions that from MapReduce,and unlike MapReduce,is targeted to To appear in OSDI 2004 11ing system that produces the data structures used for the Google web search service. The indexing system takes as input a large set of documents that have been retrieved by our crawling system, stored as a set of GFS files. The raw contents for these documents are more than 20 ter￾abytes of data. The indexing process runs as a sequence of five to ten MapReduce operations. Using MapReduce (instead of the ad-hoc distributed passes in the prior ver￾sion of the indexing system) has provided several bene- fits: • The indexing code is simpler, smaller, and easier to understand, because the code that deals with fault tolerance, distribution and parallelization is hidden within the MapReduce library. For example, the size of one phase of the computation dropped from approximately 3800 lines of C++ code to approx￾imately 700 lines when expressed using MapRe￾duce. • The performance of the MapReduce library is good enough that we can keep conceptually unrelated computations separate, instead of mixing them to￾gether to avoid extra passes over the data. This makes it easy to change the indexing process. For example, one change that took a few months to make in our old indexing system took only a few days to implement in the new system. • The indexing process has become much easier to operate, because most of the problems caused by machine failures, slow machines, and networking hiccups are dealt with automatically by the MapRe￾duce library without operator intervention. Further￾more, it is easy to improve the performance of the indexing process by adding new machines to the in￾dexing cluster. 7 Related Work Many systems have provided restricted programming models and used the restrictions to parallelize the com￾putation automatically. For example, an associative func￾tion can be computed over all prefixes of an N element array in log N time on N processors using parallel prefix computations [6, 9, 13]. MapReduce can be considered a simplification and distillation of some of these models based on our experience with large real-world compu￾tations. More significantly, we provide a fault-tolerant implementation that scales to thousands of processors. In contrast, most of the parallel processing systems have only been implemented on smaller scales and leave the details of handling machine failures to the programmer. Bulk Synchronous Programming [17] and some MPI primitives [11] provide higher-level abstractions that make it easier for programmers to write parallel pro￾grams. A key difference between these systems and MapReduce is that MapReduce exploits a restricted pro￾gramming model to parallelize the user program auto￾matically and to provide transparent fault-tolerance. Our locality optimization draws its inspiration from techniques such as active disks [12, 15], where compu￾tation is pushed into processing elements that are close to local disks, to reduce the amount of data sent across I/O subsystems or the network. We run on commodity processors to which a small number of disks are directly connected instead of running directly on disk controller processors, but the general approach is similar. Our backup task mechanism is similar to the eager scheduling mechanism employed in the Charlotte Sys￾tem [3]. One of the shortcomings of simple eager scheduling is that if a given task causes repeated failures, the entire computation fails to complete. We fix some in￾stances of this problem with our mechanism for skipping bad records. The MapReduce implementation relies on an in-house cluster management system that is responsible for dis￾tributing and running user tasks on a large collection of shared machines. Though not the focus of this paper, the cluster management system is similar in spirit to other systems such as Condor [16]. The sorting facility that is a part of the MapReduce library is similar in operation to NOW-Sort [1]. Source machines (map workers) partition the data to be sorted and send it to one of R reduce workers. Each reduce worker sorts its data locally (in memory if possible). Of course NOW-Sort does not have the user-definable Map and Reduce functions that make our library widely appli￾cable. River [2] provides a programming model where pro￾cesses communicate with each other by sending data over distributed queues. Like MapReduce, the River system tries to provide good average case performance even in the presence of non-uniformities introduced by heterogeneous hardware or system perturbations. River achieves this by careful scheduling of disk and network transfers to achieve balanced completion times. MapRe￾duce has a different approach. By restricting the pro￾gramming model, the MapReduce framework is able to partition the problem into a large number of fine￾grained tasks. These tasks are dynamically scheduled on available workers so that faster workers process more tasks. The restricted programming model also allows us to schedule redundant executions of tasks near the end of the job which greatly reduces completion time in the presence of non-uniformities (such as slow or stuck workers). BAD-FS [5] has a very different programming model from MapReduce, and unlike MapReduce, is targeted to To appear in OSDI 2004 11
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有