上声定通大字 SHANGHAI JLAO TONG UNIVERSITY CS427 Multicore Architecture and Parallel Computing Lecture 9 MapReduce Prof Li Jiang 201411/19
CS427 Multicore Architecture and Parallel Computing Lecture 9 MapReduce Prof. Li Jiang 2014/11/19 1
O What is MapReduce Origin from Google, [OSDI 04 A Simple programming mode Functional model For large-scale data processing Exploits large set of commodity computers Executes process in distributed manner Offers high availability
What is MapReduce 2 • Origin from Google, [OSDI’04] • A simple programming model • Functional model • For large-scale data processing – Exploits large set of commodity computers – Executes process in distributed manner – Offers high availability
③ Motivation Large-scale data processing Want to use 1000s of Cpus But don t want hassle of managing things Mapreduce provides Automatic parallelization e distribution fault tolerance Monitoring &t status updates
Motivation 3 • Large-Scale Data Processing – Want to use 1000s of CPUs – But don’t want hassle of managing things • MapReduce provides – Automatic parallelization & distribution – Fault tolerance – I/O scheduling – Monitoring & status updates
o)Benefit of MapReduce Map/reduce Programming model from Lisp (and other functional languages) Many problems can be phrased this way easy to distribute across nodes Nice retry/failure semantics
Benefit of MapReduce 4 • Map/Reduce – Programming model from Lisp – (and other functional languages) • Many problems can be phrased this way • Easy to distribute across nodes • Nice retry/failure semantics
G) Distributed Word Count Split data→→ count→ count Very Split datal→→ count count merged big split datal→→ count count +merde count data Split data→→ count→ count」
Distributed Word Count 5 Very big data Split data Split data Split data Split data count count count count count count count count merge merged count
O Distributed Grep Split datal→→grep matches Very sp| t data→grep matches big-sp| t data→grep→ matches→cat→A∥ matches data Split data-+ grep matches
Distributed Grep 6 Very big data Split data Split data Split data Split data grep grep grep grep matches matches matches matches cat All matches
②Map+ Reduce Very MAP Partitioning big REDUcE Result Function data Map Reduce Accepts input key / value Accepts intermediate pair ★ ey/value pair -Emits intermediate Emits output key/value Rey value pair
Map+Reduce 7 • Map – Accepts input key/value pair – Emits intermediate key/value pair • Reduce – Accepts intermediate key/value* pair – Emits output key/value pair Very big data Result M A P R E D U C E Partitioning Function
②Map+ Reduce map(key val) is run on each item in set emits new-Rey/ new-val pairs reduce(key, vals) is run for each unique key emitted by mapo emits final output
8 • map(key, val) is run on each item in set – emits new-key / new-val pairs • reduce(key, vals) is run for each unique key emitted by map() – emits final output Map+Reduce
G)Square Sum (map f list lista listg'Unary operator ( map square“(1234) -14916 o reduce (14916) 30
Square Sum 9 • (map f list [list2 list3 …]) • (map square ‘(1 2 3 4)) – (1 4 9 16) • (reduce + ‘(1 4 9 16)) – (+ 16 (+ 9 (+ 4 1) ) ) – 30
G)Word Count Input consists of(url, contents) pairs map key=url, val=contents) For each word w in contents,emit(W,“1”) reduce key-word, values=unig- counts Sum all“1” s in values list Emit result "(word, sum
Word Count 10 – Input consists of (url, contents) pairs – map(key=url, val=contents): • For each word w in contents, emit (w, “1”) – reduce(key=word, values=uniq_counts): • Sum all “1”s in values list • Emit result “(word, sum)