the execution of jobs across a wide-area network.How- David Kramer,Shun-Tak Leung,and Josh Redstone for ever,there are two fundamental similarities.(1)Both their work in developing GFS.We would also like to systems use redundant execution to recover from data thank Percy Liang and Olcan Sercinoglu for their work loss caused by failures.(2)Both use locality-aware in developing the cluster management system used by scheduling to reduce the amount of data sent across con- MapReduce.Mike Burrows,Wilson Hsieh,Josh Leven- gested network links. berg,Sharon Perl,Rob Pike,and Debby Wallach pro- TACC [7]is a system designed to simplify con- vided helpful comments on earlier drafts of this pa- struction of highly-available networked services.Like per.The anonymous OSDI reviewers,and our shepherd, MapReduce,it relies on re-execution as a mechanism for Eric Brewer,provided many useful suggestions of areas implementing fault-tolerance. where the paper could be improved.Finally,we thank all the users of MapReduce within Google's engineering or- ganization for providing helpful feedback,suggestions, 8 Conclusions and bug reports. The MapReduce programming model has been success- fully used at Google for many different purposes.We References attribute this success to several reasons.First,the model is easy to use,even for programmers without experience [1]Andrea C.Arpaci-Dusseau,Remzi H.Arpaci-Dusseau, with parallel and distributed systems,since it hides the David E.Culler,Joseph M.Hellerstein,and David A.Pat- details of parallelization,fault-tolerance,locality opti- terson.High-performance sorting on networks of work- mization,and load balancing.Second,a large variety stations.In Proceedings of the 1997 ACM SIGMOD In- of problems are easily expressible as MapReduce com- ternational Conference on Management of Data,Tucson. Arizona.May 1997. putations.For example,MapReduce is used for the gen- eration of data for Google's production web search ser- [2] Remzi H.Arpaci-Dusseau,Eric Anderson,Noah vice,for sorting,for data mining,for machine learning, Treuhaft,David E.Culler,Joseph M.Hellerstein,David and many other systems.Third,we have developed an Patterson,and Kathy Yelick.Cluster I/O with River: implementation of MapReduce that scales to large clus- Making the fast case common.In Proceedings of the Sixth ters of machines comprising thousands of machines.The Workshop on Input/Output in Parallel and Distributed Systems(IOPADS '99).pages 10-22.Atlanta,Georgia, implementation makes efficient use of these machine re- May1999. sources and therefore is suitable for use on many of the large computational problems encountered at Google. [3]Arash Baratloo,Mehmet Karaul.Zvi Kedem.and Peter Wyckoff.Charlotte:Metacomputing on the web.In Pro- We have learned several things from this work.First, ceedings of the 9th International Conference on Parallel restricting the programming model makes it easy to par- and Distributed Computing Systems,1996. allelize and distribute computations and to make such computations fault-tolerant.Second,network bandwidth [4]Luiz A.Barroso,Jeffrey Dean,and Urs Holzle.Web is a scarce resource.A number of optimizations in our search for a planet:The Google cluster architecture.IEEE Micro,23(2):22-28.Apil2003. system are therefore targeted at reducing the amount of data sent across the network:the locality optimization al- [5]John Bent,Douglas Thain,Andrea C.Arpaci-Dusseau, lows us to read data from local disks,and writing a single Remzi H.Arpaci-Dusseau,and Miron Livny.Explicit copy of the intermediate data to local disk saves network control in a batch-aware distributed file system.In Pro- bandwidth.Third,redundant execution can be used to ceedings of the Ist USENIX Symposium on Networked Systems Design and Implementation NSDI,March 2004. reduce the impact of slow machines,and to handle ma- chine failures and data loss. [6]Guy E.Blelloch.Scans as primitive parallel operations IEEE Transactions on Computers,C-38(11),November 1989. Acknowledgements [7]Armando Fox.Steven D.Gribble,Yatin Chawathe, Eric A.Brewer,and Paul Gauthier.Cluster-based scal- Josh Levenberg has been instrumental in revising and able network services.In Proceedings of the 16th ACM extending the user-level MapReduce API with a num- Symposium on Operating System Principles,pages 78- ber of new features based on his experience with using 91.Saint-Malo.France,1997. MapReduce and other people's suggestions for enhance- [8]Sanjay Ghemawat,Howard Gobioff,and Shun-Tak Le- ments.MapReduce reads its input from and writes its ung.The Google file system.In 19th Symposium on Op- output to the Google File System [8].We would like to erating Systems Principles,pages 29-43,Lake George. thank Mohit Aron,Howard Gobioff,Markus Gutschke, New York.2003 To appear in OSDI 2004 12the execution of jobs across a wide-area network. However, there are two fundamental similarities. (1) Both systems use redundant execution to recover from data loss caused by failures. (2) Both use locality-aware scheduling to reduce the amount of data sent across congested network links. TACC [7] is a system designed to simplify construction of highly-available networked services. Like MapReduce, it relies on re-execution as a mechanism for implementing fault-tolerance. 8 Conclusions The MapReduce programming model has been successfully used at Google for many different purposes. We attribute this success to several reasons. First, the model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault-tolerance, locality optimization, and load balancing. Second, a large variety of problems are easily expressible as MapReduce computations. For example, MapReduce is used for the generation of data for Google’s production web search service, for sorting, for data mining, for machine learning, and many other systems. Third, we have developed an implementation of MapReduce that scales to large clusters of machines comprising thousands of machines. The implementation makes efficient use of these machine resources and therefore is suitable for use on many of the large computational problems encountered at Google. We have learned several things from this work. First, restricting the programming model makes it easy to parallelize and distribute computations and to make such computations fault-tolerant. Second, network bandwidth is a scarce resource. A number of optimizations in our system are therefore targeted at reducing the amount of data sent across the network: the locality optimization allows us to read data from local disks, and writing a single copy of the intermediate data to local disk saves network bandwidth. Third, redundant execution can be used to reduce the impact of slow machines, and to handle machine failures and data loss. Acknowledgements Josh Levenberg has been instrumental in revising and extending the user-level MapReduce API with a number of new features based on his experience with using MapReduce and other people’s suggestions for enhancements. MapReduce reads its input from and writes its output to the Google File System [8]. We would like to thank Mohit Aron, Howard Gobioff, Markus Gutschke, David Kramer, Shun-Tak Leung, and Josh Redstone for their work in developing GFS. We would also like to thank Percy Liang and Olcan Sercinoglu for their work in developing the cluster management system used by MapReduce. Mike Burrows, Wilson Hsieh, Josh Levenberg, Sharon Perl, Rob Pike, and Debby Wallach provided helpful comments on earlier drafts of this paper. The anonymous OSDI reviewers, and our shepherd, Eric Brewer, provided many useful suggestions of areas where the paper could be improved. Finally, we thank all the users of MapReduce within Google’s engineering organization for providing helpful feedback, suggestions, and bug reports. References [1] Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, and David A. Patterson. High-performance sorting on networks of workstations. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, May 1997. [2] Remzi H. Arpaci-Dusseau, Eric Anderson, Noah Treuhaft, David E. Culler, Joseph M. Hellerstein, David Patterson, and Kathy Yelick. Cluster I/O with River: Making the fast case common. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems (IOPADS ’99), pages 10–22, Atlanta, Georgia, May 1999. [3] Arash Baratloo, Mehmet Karaul, Zvi Kedem, and Peter Wyckoff. Charlotte: Metacomputing on the web. In Proceedings of the 9th International Conference on Parallel and Distributed Computing Systems, 1996. [4] Luiz A. Barroso, Jeffrey Dean, and Urs Holzle. ¨ Web search for a planet: The Google cluster architecture. IEEE Micro, 23(2):22–28, April 2003. [5] John Bent, Douglas Thain, Andrea C.Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Miron Livny. Explicit control in a batch-aware distributed file system. In Proceedings of the 1st USENIX Symposium on Networked Systems Design and Implementation NSDI, March 2004. [6] Guy E. Blelloch. Scans as primitive parallel operations. IEEE Transactions on Computers, C-38(11), November 1989. [7] Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, and Paul Gauthier. Cluster-based scalable network services. In Proceedings of the 16th ACM Symposium on Operating System Principles, pages 78– 91, Saint-Malo, France, 1997. [8] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system. In 19th Symposium on Operating Systems Principles, pages 29–43, Lake George, New York, 2003. To appear in OSDI 2004 12