Outline Last Course Review Reading Check Document Retrieval &Inverted Index ■PageRank Algorithm
Outline ◼ Last Course Review & Reading Check ◼ Document Retrieval &Inverted Index ◼ PageRank Algorithm
NC&IS Last Course Review
Last Course Review
ki vi k2 v2 ka va ka va ks Vs ke Ve map map map map a 1b 2 c3c6 a 5c 2 b 7 c 8 Shuffle and Sort:aggregate values by keys a 15 b27 c2368 reduce reduce reduce 12 S2 T3 S3
map map map map Shuffle and Sort: aggregate values by keys reduce reduce reduce k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 a 1 5 b 2 7 c 2 3 6 8 r1 s1 r2 s2 r3 s3
MapReduce "Runtime" a Handles scheduling Assigns workers to map and reduce tasks Handles "data distribution" Moves processes to data Handles synchronization Gathers,sorts,and shuffles intermediate data Handles errors and faults Detects worker failures and restarts -Everything happens on top of a distributed FS (later)
MapReduce “Runtime” ◼ Handles scheduling ◼ Assigns workers to map and reduce tasks ◼ Handles “data distribution” ◼ Moves processes to data ◼ Handles synchronization ◼ Gathers, sorts, and shuffles intermediate data ◼ Handles errors and faults ◼ Detects worker failures and restarts ◼ Everything happens on top of a distributed FS (later)
Ki vi k2 v2 ka ve ka va ks vs ke ve map map map map 1b2 a c 3 c 6 a 5c 2 b 7 c 8 combine combine combine combine 1b2 a c 9 a 5c 2 partition partition partition partition Shuffle and Sort:aggregate values by keys a15 b27 c2968 reduce reduce reduce S1 2S2 3s3
combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 partition partition partition partition map map map map k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys reduce reduce reduce a 1 5 b 2 7 c 2 9 8 r1 s1 r2 s2 r3 s3 c 2 3 6 8
Word Count:Version 1 1:class MAPPER 2: method MAP(docid a,doc d) 3: H-new ASSOCIATIVEARRAY 4: for all term t∈doc d do 5: H{t}←-H{t}+1 Tally counts for entire document 6: for all term t∈Hdo 7: EMIT(term t,count H{t) Are combiners still needed?
Word Count: Version 1 Are combiners still needed?
Computing the Mean:Version 2 1:class MAPPER 2 method MAP(string t,integer r) 3: EMIT(string t,integer r) 1:class COMBINER 2: method CoMBINE(string t,integers [r1,r2,...) 3: sum←-0 4: cmt←-0 5: for all integer r E integers [r1,r2,...]do 6: sum←-sum+T 7: cnt←-cmt+1 8: EMIT(string t,pair (sum,cnt)) Separate sum and count 1:class REDUCER 2: method REDUCE(string t,pairs [(s1,c1),(s2,c2)...]) 3: sum←-0 4: cmt←-0 5: for all pair (s,c)E pairs [(s1,c1),(s2,c2)...]do 6: sum←-sum+s 7: cnt←-cnt+c 8: Tavg←-sum/cnt 9: EMIT(string t,integer ravg) Why doesn't this work?
Computing the Mean: Version 2 Why doesn’t this work?
Another Try:“Stripes” Idea:group together pairs into an associative array (a,b)→1 (a,c→2 (a,d)→5 a→{b:1,c:2,d:5,e:3,f:2} (a,e)3 (a,f)→2 Each mapper takes a sentence: Generate all co-occurring term pairs For each term,emit a{b:countp,c:countc,d:countd... Reducers perform element-wise sum of associative arrays a→{b:1,d:5,e:3} Key:cleverly-constructed data structure a→{b:1,c:2,d:2,f:2} brings together partial results a→{b:2,c:2,d:7,e:3,f:2}
◼ Idea: group together pairs into an associative array ◼ Each mapper takes a sentence: ◼ Generate all co-occurring term pairs ◼ For each term, emit a → { b: countb , c: countc , d: countd … } ◼ Reducers perform element-wise sum of associative arrays Another Try: “Stripes” (a, b) → 1 (a, c) → 2 (a, d) → 5 (a, e) → 3 (a, f) → 2 a → { b: 1, c: 2, d: 5, e: 3, f: 2 } a → { b: 1, d: 5, e: 3 } a → { b: 1, c: 2, d: 2, f: 2 } a → { b: 2, c: 2, d: 7, e: 3, f: 2 } +
f(BlA):“Pairs” (a,*)→32 Reducer holds this value in memory (a,b)→3 (a,b)→3/32 (a,b2)→12 (a,b2)→12/32 (a,b3)→7 (a,b3)→7/32 (a,b4)→1 (a,b4)→1/32 For this to work: Must emit extra(a,*for every b in mapper Must make sure all a's get sent to same reducer (use partitioner) Must make sure (a,*comes first(define sort order) Must hold state in reducer across different key-value pairs
f(B|A): “Pairs” ◼ For this to work: ◼ Must emit extra (a, *) for every bn in mapper ◼ Must make sure all a’s get sent to same reducer (use partitioner) ◼ Must make sure (a, *) comes first (define sort order) ◼ Must hold state in reducer across different key-value pairs (a, b1 ) → 3 (a, b2 ) → 12 (a, b3 ) → 7 (a, b4 ) → 1 … (a, *) → 32 (a, b1 ) → 3 / 32 (a, b2 ) → 12 / 32 (a, b3 ) → 7 / 32 (a, b4 ) → 1 / 32 … Reducer holds this value in memory
NC&IS Homework Reading
Homework Reading