正在加载图片...
for a rewrite of our production indexing system.Sec-2.2 Types tion 7 discusses related and future work. Even though the previous pseudo-code is written in terms 2 Programming Model of string inputs and outputs,conceptually the map and reduce functions supplied by the user have associated types: The computation takes a set of input key/value pairs,and produces a set of output key/value pairs.The user of map (k1,v1) →1ist(k2,v2) the MapReduce library expresses the computation as two reduce (k2,list (v2))-list(v2) functions:Map and Reduce. I.e.,the input keys and values are drawn from a different Map,written by the user,takes an input pair and pro- domain than the output keys and values.Furthermore. duces a set of intermediate key/value pairs.The MapRe- the intermediate keys and values are from the same do- duce library groups together all intermediate values asso- main as the output keys and values. ciated with the same intermediate key I and passes them Our C++implementation passes strings to and from to the Reduce function. the user-defined functions and leaves it to the user code The Reduce function,also written by the user,accepts to convert between strings and appropriate types. an intermediate key I and a set of values for that key.It merges together these values to form a possibly smaller set of values.Typically just zero or one output value is 2.3 More Examples produced per Reduce invocation.The intermediate val- ues are supplied to the user's reduce function via an iter- Here are a few simple examples of interesting programs ator.This allows us to handle lists of values that are too that can be easily expressed as MapReduce computa- large to fit in memory. tions. 2.1 Example Distributed Grep:The map function emits a line if it Consider the problem of counting the number of oc- matches a supplied pattern.The reduce function is an identity function that just copies the supplied intermedi- currences of each word in a large collection of docu- ate data to the output. ments.The user would write code similar to the follow- ing pseudo-code: Count of URL Access Frequency:The map func- map(String key,String value): /key:document name tion processes logs of web page requests and outputs /value:document contents (URL,1).The reduce function adds together all values for each word w in value: for the same URL and emits a (URL,total count) EmitIntermediate(w,"1"); pair. reduce(String key,Iterator values): /key:a word Reverse Web-Link Graph:The map function outputs /values:a list of counts (target,source)pairs for each link to a target int result =0; URL found in a page named source.The reduce for each v in values: function concatenates the list of all source URLs as- result +ParseInt(v); sociated with a given target URL and emits the pair: Emit (AsString(result)); (target,list(source)) The map function emits each word plus an associated count of occurrences (just'1'in this simple example) Term-Vector per Host:A term vector summarizes the The reduce function sums together all counts emitted most important words that occur in a document or a set for a particular word. of documents as a list of (word,frequency)pairs.The In addition,the user writes code to fill in a mapreduce map function emits a (hostname,term vector) specification object with the names of the input and out- pair for each input document (where the hostname is put files,and optional tuning parameters.The user then extracted from the URL of the document).The re- invokes the MapReduce function,passing it the specifi- duce function is passed all per-document term vectors cation object.The user's code is linked together with the for a given host.It adds these term vectors together. MapReduce library (implemented in C++).Appendix A throwing away infrequent terms,and then emits a final contains the full program text for this example. (hostname,term vector)pair. To appear in OSDI 2004 2for a rewrite of our production indexing system. Sec￾tion 7 discusses related and future work. 2 Programming Model The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce. Map, written by the user, takes an input pair and pro￾duces a set of intermediate key/value pairs. The MapRe￾duce library groups together all intermediate values asso￾ciated with the same intermediate key I and passes them to the Reduce function. The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate val￾ues are supplied to the user’s reduce function via an iter￾ator. This allows us to handle lists of values that are too large to fit in memory. 2.1 Example Consider the problem of counting the number of oc￾currences of each word in a large collection of docu￾ments. The user would write code similar to the follow￾ing pseudo-code: map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); The map function emits each word plus an associated count of occurrences (just ‘1’ in this simple example). The reduce function sums together all counts emitted for a particular word. In addition, the user writes code to fill in a mapreduce specification object with the names of the input and out￾put files, and optional tuning parameters. The user then invokes the MapReduce function, passing it the specifi- cation object. The user’s code is linked together with the MapReduce library (implemented in C++). Appendix A contains the full program text for this example. 2.2 Types Even though the previous pseudo-code is written in terms of string inputs and outputs, conceptually the map and reduce functions supplied by the user have associated types: map (k1,v1) → list(k2,v2) reduce (k2,list(v2)) → list(v2) I.e., the input keys and values are drawn from a different domain than the output keys and values. Furthermore, the intermediate keys and values are from the same do￾main as the output keys and values. Our C++ implementation passes strings to and from the user-defined functions and leaves it to the user code to convert between strings and appropriate types. 2.3 More Examples Here are a few simple examples of interesting programs that can be easily expressed as MapReduce computa￾tions. Distributed Grep: The map function emits a line if it matches a supplied pattern. The reduce function is an identity function that just copies the supplied intermedi￾ate data to the output. Count of URL Access Frequency: The map func￾tion processes logs of web page requests and outputs hURL, 1i. The reduce function adds together all values for the same URL and emits a hURL, total counti pair. Reverse Web-Link Graph: The map function outputs htarget, sourcei pairs for each link to a target URL found in a page named source. The reduce function concatenates the list of all source URLs as￾sociated with a given target URL and emits the pair: htarget, list(source)i Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of hword, frequencyi pairs. The map function emits a hhostname, term vectori pair for each input document (where the hostname is extracted from the URL of the document). The re￾duce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final hhostname, term vectori pair. To appear in OSDI 2004 2
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有