for a rewrite of our production indexing system.Sec-2.2 Types tion 7 discusses related and future work. Even though the previous pseudo-code is written in terms 2 Programming Model of string inputs and outputs,conceptually the map and reduce functions supplied by the user have associated types: The computation takes a set of input key/value pairs,and produces a set of output key/value pairs.The user of map (k1,v1) →1ist(k2,v2) the MapReduce library expresses the computation as two reduce (k2,list (v2))-list(v2) functions:Map and Reduce. I.e.,the input keys and values are drawn from a different Map,written by the user,takes an input pair and pro- domain than the output keys and values.Furthermore. duces a set of intermediate key/value pairs.The MapRe- the intermediate keys and values are from the same do- duce library groups together all intermediate values asso- main as the output keys and values. ciated with the same intermediate key I and passes them Our C++implementation passes strings to and from to the Reduce function. the user-defined functions and leaves it to the user code The Reduce function,also written by the user,accepts to convert between strings and appropriate types. an intermediate key I and a set of values for that key.It merges together these values to form a possibly smaller set of values.Typically just zero or one output value is 2.3 More Examples produced per Reduce invocation.The intermediate val- ues are supplied to the user's reduce function via an iter- Here are a few simple examples of interesting programs ator.This allows us to handle lists of values that are too that can be easily expressed as MapReduce computa- large to fit in memory. tions. 2.1 Example Distributed Grep:The map function emits a line if it Consider the problem of counting the number of oc- matches a supplied pattern.The reduce function is an identity function that just copies the supplied intermedi- currences of each word in a large collection of docu- ate data to the output. ments.The user would write code similar to the follow- ing pseudo-code: Count of URL Access Frequency:The map func- map(String key,String value): /key:document name tion processes logs of web page requests and outputs /value:document contents (URL,1).The reduce function adds together all values for each word w in value: for the same URL and emits a (URL,total count) EmitIntermediate(w,"1"); pair. reduce(String key,Iterator values): /key:a word Reverse Web-Link Graph:The map function outputs /values:a list of counts (target,source)pairs for each link to a target int result =0; URL found in a page named source.The reduce for each v in values: function concatenates the list of all source URLs as- result +ParseInt(v); sociated with a given target URL and emits the pair: Emit (AsString(result)); (target,list(source)) The map function emits each word plus an associated count of occurrences (just'1'in this simple example) Term-Vector per Host:A term vector summarizes the The reduce function sums together all counts emitted most important words that occur in a document or a set for a particular word. of documents as a list of (word,frequency)pairs.The In addition,the user writes code to fill in a mapreduce map function emits a (hostname,term vector) specification object with the names of the input and out- pair for each input document (where the hostname is put files,and optional tuning parameters.The user then extracted from the URL of the document).The re- invokes the MapReduce function,passing it the specifi- duce function is passed all per-document term vectors cation object.The user's code is linked together with the for a given host.It adds these term vectors together. MapReduce library (implemented in C++).Appendix A throwing away infrequent terms,and then emits a final contains the full program text for this example. (hostname,term vector)pair. To appear in OSDI 2004 2for a rewrite of our production indexing system. Section 7 discusses related and future work. 2 Programming Model The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce. Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function. The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user’s reduce function via an iterator. This allows us to handle lists of values that are too large to fit in memory. 2.1 Example Consider the problem of counting the number of occurrences of each word in a large collection of documents. The user would write code similar to the following pseudo-code: map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); The map function emits each word plus an associated count of occurrences (just ‘1’ in this simple example). The reduce function sums together all counts emitted for a particular word. In addition, the user writes code to fill in a mapreduce specification object with the names of the input and output files, and optional tuning parameters. The user then invokes the MapReduce function, passing it the specifi- cation object. The user’s code is linked together with the MapReduce library (implemented in C++). Appendix A contains the full program text for this example. 2.2 Types Even though the previous pseudo-code is written in terms of string inputs and outputs, conceptually the map and reduce functions supplied by the user have associated types: map (k1,v1) → list(k2,v2) reduce (k2,list(v2)) → list(v2) I.e., the input keys and values are drawn from a different domain than the output keys and values. Furthermore, the intermediate keys and values are from the same domain as the output keys and values. Our C++ implementation passes strings to and from the user-defined functions and leaves it to the user code to convert between strings and appropriate types. 2.3 More Examples Here are a few simple examples of interesting programs that can be easily expressed as MapReduce computations. Distributed Grep: The map function emits a line if it matches a supplied pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. Count of URL Access Frequency: The map function processes logs of web page requests and outputs hURL, 1i. The reduce function adds together all values for the same URL and emits a hURL, total counti pair. Reverse Web-Link Graph: The map function outputs htarget, sourcei pairs for each link to a target URL found in a page named source. The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: htarget, list(source)i Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of hword, frequencyi pairs. The map function emits a hhostname, term vectori pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final hhostname, term vectori pair. To appear in OSDI 2004 2