The Basics of Apache Hadoop

The Basics of Apache Hadoop
CS 6021 Adv. Computer Architecture Dr. Clincy Jinsik Kim

Apache Hadoop Created by Doug Cutting @ Yahoo!
Developed off Google’s MapReduce and Google Distributed Filesystem Framework that utilizes distributed processing on Big Data Core: MapReduce and HDFS Cluster architecture 2009 – 500 GB sorted in 59 seconds

Distributed and Parallel Computing
Distributed computing: A network of computers that communicate with each other to achieve a common goal. Cluster architecture: Commodity computers with high speed network interconnects, managed by a framework like Hadoop

Distributed and Parallel Computing
Parallel computing: processing a job by splitting it into subtasks; executed in parallel Must not have data interdependencies Achieves speed up by executing subtasks in parallel

Cluster Architecture

MapReduce Jobtracker: assign tasktrackers to run parallel subtasks
Tasktracker: report to jobtracker of progress

Preprocessing Data

Map, Max temperature of year
Map input key, value (1, …) (2, …) … (n …) (n …) Map output key, value (1950, 32.3) (1950, 38.1) (1950, 23.1) (1950, 21.8)

Reduce, Max temperature of year
Intermediary Map outputs are sorted and grouped by keys

MapReduce Once the maximum temperature for each year is found, the Reduce output is stored Three replicas are made 1st stored locally on Reduce node 2nd and 3rd stored on nodes on a different rack

Multiple Reducers Optimize using multiple reducers
Maps must partition outputs to number of reducers

Combiner Function Map outputs may be reduced before being sent to the Reducer Decreases the amount of data sent across the network; critical in Big Data

Combiner Function

Hadoop Distributed Filesystem
Namenode Master node Persistently stores two files on disk Image: tree structure; checkpoint Edit log: Log of datanode failures Periodically combined to update the image Stores references to all file blocks in memory Directs clients to file blocks

Datanode Worker nodes that store, retrieve file blocks Prefer datanodes that store file blocks locally; fast Data locality optimization

High seek to ‘data read’ ratio results in lower throughput HDFS blocks are very large, 64 MB Low seek Data processed at disk transfer rate Very high throughput of data

Client File Write

Client File Read

Replicas, backup, and error detection
High chance of failure due to large number of parts Reduce outputs, file blocks store replicas Namenode is a single point of failure Requires a backup namenode Large amounts of data movement, high chance of data corruption CRC-32 (Cyclic Redundancy Check)

Sources: White, T. Hadoop: The Definitive Guide, 3rd ed. (2012). Raicu, I. Introduction to Distributed Systems [slides]. (2011). Illinois Institute of Technology.

The Basics of Apache Hadoop

Similar presentations

Presentation on theme: "The Basics of Apache Hadoop"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Basics of Apache Hadoop

Similar presentations

Presentation on theme: "The Basics of Apache Hadoop"— Presentation transcript:

Similar presentations

About project

Feedback