Download presentation
Presentation is loading. Please wait.
1
The Basics of Apache Hadoop
CS 6021 Adv. Computer Architecture Dr. Clincy Jinsik Kim
2
Apache Hadoop Created by Doug Cutting @ Yahoo!
Developed off Google’s MapReduce and Google Distributed Filesystem Framework that utilizes distributed processing on Big Data Core: MapReduce and HDFS Cluster architecture 2009 – 500 GB sorted in 59 seconds
3
Distributed and Parallel Computing
Distributed computing: A network of computers that communicate with each other to achieve a common goal. Cluster architecture: Commodity computers with high speed network interconnects, managed by a framework like Hadoop
4
Distributed and Parallel Computing
Parallel computing: processing a job by splitting it into subtasks; executed in parallel Must not have data interdependencies Achieves speed up by executing subtasks in parallel
5
Cluster Architecture
6
MapReduce Jobtracker: assign tasktrackers to run parallel subtasks
Tasktracker: report to jobtracker of progress
7
Preprocessing Data
8
Map, Max temperature of year
Map input key, value (1, …) (2, …) … (n …) (n …) Map output key, value (1950, 32.3) (1950, 38.1) (1950, 23.1) (1950, 21.8)
9
Reduce, Max temperature of year
Intermediary Map outputs are sorted and grouped by keys
10
MapReduce Once the maximum temperature for each year is found, the Reduce output is stored Three replicas are made 1st stored locally on Reduce node 2nd and 3rd stored on nodes on a different rack
11
Multiple Reducers Optimize using multiple reducers
Maps must partition outputs to number of reducers
12
Combiner Function Map outputs may be reduced before being sent to the Reducer Decreases the amount of data sent across the network; critical in Big Data
13
Combiner Function
14
Hadoop Distributed Filesystem
Namenode Master node Persistently stores two files on disk Image: tree structure; checkpoint Edit log: Log of datanode failures Periodically combined to update the image Stores references to all file blocks in memory Directs clients to file blocks
15
Hadoop Distributed Filesystem
Datanode Worker nodes that store, retrieve file blocks Prefer datanodes that store file blocks locally; fast Data locality optimization
16
Hadoop Distributed Filesystem
High seek to ‘data read’ ratio results in lower throughput HDFS blocks are very large, 64 MB Low seek Data processed at disk transfer rate Very high throughput of data
17
Hadoop Distributed Filesystem
18
Hadoop Distributed Filesystem
Client File Write
19
Hadoop Distributed Filesystem
Client File Read
20
Replicas, backup, and error detection
High chance of failure due to large number of parts Reduce outputs, file blocks store replicas Namenode is a single point of failure Requires a backup namenode Large amounts of data movement, high chance of data corruption CRC-32 (Cyclic Redundancy Check)
21
Sources: White, T. Hadoop: The Definitive Guide, 3rd ed. (2012). Raicu, I. Introduction to Distributed Systems [slides]. (2011). Illinois Institute of Technology.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.