mode input treats each line as a key/_中国高校课件下载中心

点击下载：《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源（阅读材料）MapReduce——Simplified Data Processing on Large Clusters

正在加载图片...

mode input treats each line as a key/value pair:the key the signal handler sends a "last gasp"UDP packet that is the offset in the file and the value is the contents of contains the sequence number to the MapReduce mas- the line.Another common supported format stores a ter.When the master has seen more than one failure on sequence of key/value pairs sorted by key.Each input a particular record,it indicates that the record should be type implementation knows how to split itself into mean- skipped when it issues the next re-execution of the corre- ingful ranges for processing as separate map tasks(e.g. sponding Map or Reduce task text mode's range splitting ensures that range splits oc- cur only at line boundaries).Users can add support for a new input type by providing an implementation of a sim- 4.7 Local Execution ple reader interface,though most users just use one of a small number of predefined input types. Debugging problems in Map or Reduce functions can be A reader does not necessarily need to provide data tricky,since the actual computation happens in a dis- tributed system,often on several thousand machines, read from a file.For example,it is easy to define a reader with work assignment decisions made dynamically by that reads records from a database,or from data struc- the master.To help facilitate debugging,profiling,and tures mapped in memory. small-scale testing,we have developed an alternative im- In a similar fashion,we support a set of output types plementation of the MapReduce library that sequentially for producing data in different formats and it is easy for executes all of the work for a MapReduce operation on user code to add support for new output types. the local machine.Controls are provided to the user so that the computation can be limited to particular map 4.5 Side-effects tasks.Users invoke their program with a special flag and can then easily use any debugging or testing tools they In some cases,users of MapReduce have found it con- find useful (e.g.gdb). venient to produce auxiliary files as additional outputs from their map and/or reduce operators.We rely on the application writer to make such side-effects atomic and 4.8 Status Information idempotent.Typically the application writes to a tempo- rary file and atomically renames this file once it has been The master runs an internal HTTP server and exports fully generated. a set of status pages for human consumption.The sta- tus pages show the progress of the computation,such as We do not provide support for atomic two-phase com- how many tasks have been completed,how many are in mits of multiple output files produced by a single task. progress,bytes of input,bytes of intermediate data,bytes Therefore,tasks that produce multiple output files with of output,processing rates,etc.The pages also contain cross-file consistency requirements should be determin- links to the standard error and standard output files gen- istic.This restriction has never been an issue in practice. erated by each task.The user can use this data to pre- dict how long the computation will take,and whether or 4.6 Skipping Bad Records not more resources should be added to the computation. These pages can also be used to figure out when the com- Sometimes there are bugs in user code that cause the Map putation is much slower than expected. or Reduce functions to crash deterministically on certain In addition,the top-level status page shows which records.Such bugs prevent a MapReduce operation from workers have failed,and which map and reduce tasks completing.The usual course of action is to fix the bug, they were processing when they failed.This informa- but sometimes this is not feasible;perhaps the bug is in tion is useful when attempting to diagnose bugs in the a third-party library for which source code is unavail- user code. able.Also,sometimes it is acceptable to ignore a few records,for example when doing statistical analysis on a large data set.We provide an optional mode of execu- 4.9 Counters tion where the MapReduce library detects which records cause deterministic crashes and skips these records in or- The MapReduce library provides a counter facility to der to make forward progress. count occurrences of various events.For example,user Each worker process installs a signal handler that code may want to count total number of words processed catches segmentation violations and bus errors.Before or the number of German documents indexed,etc. invoking a user Map or Reduce operation,the MapRe- To use this facility,user code creates a named counter duce library stores the sequence number of the argument object and then increments the counter appropriately in in a global variable.If the user code generates a signal, the Map and/or Reduce function.For example: To appear in OSDI 2004mode input treats each line as a key/value pair: the key is the offset in the file and the value is the contents of the line. Another common supported format stores a sequence of key/value pairs sorted by key. Each input type implementation knows how to split itself into meaningful ranges for processing as separate map tasks (e.g. text mode’s range splitting ensures that range splits occur only at line boundaries). Users can add support for a new input type by providing an implementation of a simple reader interface, though most users just use one of a small number of predefined input types. A reader does not necessarily need to provide data read from a file. For example, it is easy to define a reader that reads records from a database, or from data structures mapped in memory. In a similar fashion, we support a set of output types for producing data in different formats and it is easy for user code to add support for new output types. 4.5 Side-effects In some cases, users of MapReduce have found it convenient to produce auxiliary files as additional outputs from their map and/or reduce operators. We rely on the application writer to make such side-effects atomic and idempotent. Typically the application writes to a temporary file and atomically renames this file once it has been fully generated. We do not provide support for atomic two-phase commits of multiple output files produced by a single task. Therefore, tasks that produce multiple output files with cross-file consistency requirements should be deterministic. This restriction has never been an issue in practice. 4.6 Skipping Bad Records Sometimesthere are bugsin user code that cause the Map or Reduce functions to crash deterministically on certain records. Such bugs prevent a MapReduce operation from completing. The usual course of action is to fix the bug, but sometimes this is not feasible; perhaps the bug is in a third-party library for which source code is unavailable. Also, sometimes it is acceptable to ignore a few records, for example when doing statistical analysis on a large data set. We provide an optional mode of execution where the MapReduce library detects which records cause deterministic crashes and skips these records in order to make forward progress. Each worker process installs a signal handler that catches segmentation violations and bus errors. Before invoking a user Map or Reduce operation, the MapReduce library stores the sequence number of the argument in a global variable. If the user code generates a signal, the signal handler sends a “last gasp” UDP packet that contains the sequence number to the MapReduce master. When the master has seen more than one failure on a particular record, it indicates that the record should be skipped when it issues the next re-execution of the corresponding Map or Reduce task. 4.7 Local Execution Debugging problems in Map or Reduce functions can be tricky, since the actual computation happens in a distributed system, often on several thousand machines, with work assignment decisions made dynamically by the master. To help facilitate debugging, profiling, and small-scale testing, we have developed an alternative implementation of the MapReduce library that sequentially executes all of the work for a MapReduce operation on the local machine. Controls are provided to the user so that the computation can be limited to particular map tasks. Users invoke their program with a special flag and can then easily use any debugging or testing tools they find useful (e.g. gdb). 4.8 Status Information The master runs an internal HTTP server and exports a set of status pages for human consumption. The status pages show the progress of the computation, such as how many tasks have been completed, how many are in progress, bytes of input, bytes of intermediate data, bytes of output, processing rates, etc. The pages also contain links to the standard error and standard output files generated by each task. The user can use this data to predict how long the computation will take, and whether or not more resources should be added to the computation. These pages can also be used to figure out when the computation is much slower than expected. In addition, the top-level status page shows which workers have failed, and which map and reduce tasks they were processing when they failed. This information is useful when attempting to diagnose bugs in the user code. 4.9 Counters The MapReduce library provides a counter facility to count occurrences of various events. For example, user code may want to count total number of words processed or the number of German documents indexed, etc. To use this facility, user code creates a named counter object and then increments the counter appropriately in the Map and/or Reduce function. For example: To appear in OSDI 2004 7

<<向上翻页向下翻页>>

点击下载：《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源（阅读材料）MapReduce——Simplified Data Processing on Large Clusters