电子科技大学：《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源（课件讲稿）Lecture 7 Hadoop-Spark

团购合买资源类别：文库，文档格式：PDF，文档页数：57，文件大小：4.3MB

Lecture 7 Hadoop/Spark

What is Hadoop Hadoop is a software framework for distributed processing of large datasets across large clusters of computers Large datasets>Terabytes or petabytes of data Large clusters>hundreds or thousands of nodes Hadoop is open-source implementation for Google MapReduce Hadoop is based on a simple programming model called MapReduce Hadoop is based on a simple data model,any data will fit 3

What is Hadoop • Hadoop is a software framework for distributed processing of large datasets across large clusters of computers • Large datasets Æ Terabytes or petabytes of data • Large clusters Æ hundreds or thousands of nodes • Hadoop is open-source implementation for Google MapReduce • Hadoop is based on a simple programming model called MapReduce • Hadoop is based on a simple data model, any data will fit 3

Design Principles of Hadoop Need to process big data Need to parallelize computation across thousands of nodes Commodity hardware Large number of low-end cheap machines working in parallel to solve a computing problem This is in contrast to Parallel DBs Small number of high-end expensive machines 1

Design Principles of Hadoop • Need to process big data • Need to parallelize computation across thousands of nodes • Commodity hardware • Large number of low-end cheap machines working in parallel to solve a computing problem • This is in contrast to Parallel DBs • Small number of high-end expensive machines 4

Divide and Conquer "Work" Partition W1 W2 W3 ”h0tkr worker workeE "Result Combine

Divide and Conquer “worker ” “worker ” “worker ” Partition Combine

It's a bit more complex... Fundamental issues scheduling,data distribution,synchronization, Different programming models inter-process communication,robustness,fault Message Passing Shared Memory tolerance,... P P2 P3 P4 Ps P:P2 P3 P4 Ps Architectural issues Flynn's taxonomy (SIMD,MIMD,etc.), network typology,bisection bandwidth UMA vs.NUMA,cache coherence Different programming constructs mutexes,conditional variables,barriers,... masters/slaves,producers/consumers,work queues,... Common problems livelock,deadlock,data starvation,priority inversion... dining philosophers,sleeping barbers,cigarette smokers,... The reality:programmer shoulders the burden of managing concurrency

It’s a bit more complex… Message Passing P1 P2 P3 P4 P5 Shared Memory P1 P2 P3 P4 P5 Memory Different programming models Different programming constructs mutexes, conditional variables, barriers, … masters/slaves, producers/consumers, work queues, … Fundamental issues scheduling, data distribution, synchronization, inter-process communication, robustness, fault tolerance, … Common problems livelock, deadlock, data starvation, priority inversion… dining philosophers, sleeping barbers, cigarette smokers, … Architectural issues Flynn’s taxonomy (SIMD, MIMD, etc.), network typology, bisection bandwidth UMA vs. NUMA, cache coherence The reality: programmer shoulders the burden of managing concurrency…

▣ 7 Source:Ricardo Guimaraes Herrmann

Source: Ricardo Guimarães Herrmann

Design Principles of Hadoop Automatic parallelization distribution Hidden from the end-user Fault tolerance and automatic recovery Nodes/tasks will fail and will recover automatically Clean and simple programming abstraction ·Users only provide two functions“map'and“reduce” 8

Design Principles of Hadoop • Automatic parallelization & distribution • Hidden from the end-user • Fault tolerance and automatic recovery • Nodes/tasks will fail and will recover automatically • Clean and simple programming abstraction • Users only provide two functions “map” and “reduce” 8

Distributed File System Don't move data to workers...move workers to the data! Store data on the local disks of nodes in the cluster Start up the workers on the node that has the data local ·Why? Not enough RAM to hold all the data in memory Disk access is slow,but disk throughput is reasonable A distributed file system is the answer 。 GFS(Google File System) HDFS for Hadoop(=GFS clone)

Distributed File System • Don’t move data to workers… move workers to the data! • Store data on the local disks of nodes in the cluster • Start up the workers on the node that has the data local • Why? • Not enough RAM to hold all the data in memory • Disk access is slow, but disk throughput is reasonable • A distributed file system is the answer • GFS (Google File System) • HDFS for Hadoop (= GFS clone)

Hadoop:How it Works 10

Hadoop: How it Works 10

Hadoop HBase Pig Hive Chukwa MapReduce HDFS Z00 Keeper Hadoop's stated mission(Doug Cutting interview): Commoditize infrastructure for web-scale, data-intensive applications

Hadoop HBase MapReduce Core Avro HDFS Zoo Keeper Pig Hive Chukwa Hadoop’s stated mission (Doug Cutting interview): Commoditize infrastructure for web-scale, data-intensive applications

点击进入文档下载页（PDF格式）

共57页，可试读19页，点击继续阅读 ↓↓

点击下载（PDF格式）

浏览记录