Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Basics of Apache Hadoop

Similar presentations


Presentation on theme: "The Basics of Apache Hadoop"— Presentation transcript:

1 The Basics of Apache Hadoop
CS 6021 Adv. Computer Architecture Dr. Clincy Jinsik Kim

2 Apache Hadoop Created by Doug Cutting @ Yahoo!
Developed off Google’s MapReduce and Google Distributed Filesystem Framework that utilizes distributed processing on Big Data Core: MapReduce and HDFS Cluster architecture 2009 – 500 GB sorted in 59 seconds

3 Distributed and Parallel Computing
Distributed computing: A network of computers that communicate with each other to achieve a common goal. Cluster architecture: Commodity computers with high speed network interconnects, managed by a framework like Hadoop

4 Distributed and Parallel Computing
Parallel computing: processing a job by splitting it into subtasks; executed in parallel Must not have data interdependencies Achieves speed up by executing subtasks in parallel

5 Cluster Architecture

6 MapReduce Jobtracker: assign tasktrackers to run parallel subtasks
Tasktracker: report to jobtracker of progress

7 Preprocessing Data

8 Map, Max temperature of year
Map input key, value (1, …) (2, …) (n …) (n …) Map output key, value (1950, 32.3) (1950, 38.1) (1950, 23.1) (1950, 21.8)

9 Reduce, Max temperature of year
Intermediary Map outputs are sorted and grouped by keys

10 MapReduce Once the maximum temperature for each year is found, the Reduce output is stored Three replicas are made 1st stored locally on Reduce node 2nd and 3rd stored on nodes on a different rack

11 Multiple Reducers Optimize using multiple reducers
Maps must partition outputs to number of reducers

12 Combiner Function Map outputs may be reduced before being sent to the Reducer Decreases the amount of data sent across the network; critical in Big Data

13 Combiner Function

14 Hadoop Distributed Filesystem
Namenode Master node Persistently stores two files on disk Image: tree structure; checkpoint Edit log: Log of datanode failures Periodically combined to update the image Stores references to all file blocks in memory Directs clients to file blocks

15 Hadoop Distributed Filesystem
Datanode Worker nodes that store, retrieve file blocks Prefer datanodes that store file blocks locally; fast Data locality optimization

16 Hadoop Distributed Filesystem
High seek to ‘data read’ ratio results in lower throughput HDFS blocks are very large, 64 MB Low seek Data processed at disk transfer rate Very high throughput of data

17 Hadoop Distributed Filesystem

18 Hadoop Distributed Filesystem
Client File Write

19 Hadoop Distributed Filesystem
Client File Read

20 Replicas, backup, and error detection
High chance of failure due to large number of parts Reduce outputs, file blocks store replicas Namenode is a single point of failure Requires a backup namenode Large amounts of data movement, high chance of data corruption CRC-32 (Cyclic Redundancy Check)

21 Sources: White, T. Hadoop: The Definitive Guide, 3rd ed. (2012). Raicu, I. Introduction to Distributed Systems [slides]. (2011). Illinois Institute of Technology.


Download ppt "The Basics of Apache Hadoop"

Similar presentations


Ads by Google