Presentation is loading. Please wait.

Presentation is loading. Please wait.

15-440, Hadoop Distributed File System Allison Naaktgeboren  Wut u mean? I iz loadin a HA-doop fileh  Ur doin' it rong kitteh.

Similar presentations

Presentation on theme: "15-440, Hadoop Distributed File System Allison Naaktgeboren  Wut u mean? I iz loadin a HA-doop fileh  Ur doin' it rong kitteh."— Presentation transcript:

1 15-440, Hadoop Distributed File System Allison Naaktgeboren  Wut u mean? I iz loadin a HA-doop fileh  Ur doin' it rong kitteh

2 Annoucements Go Vote! Interpretive Dances happen only after Lecture Office Hour Change  Mon: 6:30-9:30  Tues: 6-7:30 Exams are graded

3 Hadoop Core at 30,000 ft

4 Back to the Map Reduce Model Recall that –map (in_key, in_value) -> (inter_key, inter_value) list combine (inter_key, inter_value) → (inter_key, inter_value) combine (inter_key, inter_value) → (inter_key, inter_value) –reduce (inter_key, inter_value list) -> (out_key, out_vlaue)‏ What resource are we most constrained by?  “Oceans of Data, Skinny pipes” How many types of data will the file system care about? How long will we need each kind? What is the common case for each?


6 What would a MR Filesytem need? General Use case: large files  Mostly append to end, long sequential reads, few deletes  Appends might be concurrent Scability  Adding (or losing) machines should be relatively painless Nodes work on nearby data  Minimize moving data between machines Bandwidth is our limiting resource Remember how much data Failure (handling)is Common  Yea, yea we know, we took 213, we know hardware sucks No, really failure (handling) is common (constant)‏  Disks, processors,whole nodes, racks, and datacenters

7 Addressing Those Concerns Sequential Reads, appends need to be fast  Deletes can be painful “Hot plug” machines  Add or lose machines while system is running jobs  System should auto detect the change HDFS should distribute data somewhat evenly  So that all workers have a reasonable amount of data to chew on  And coordinating with the Jobtracker (job master)‏ Data Replication  Should be spread out. Why?  What type of problems could arise?

8 Moving into the Details Nodes in HDFS  NameNode (master) ( like GFS Master)‏  DataNodes (slaves) ( like GFS chunkservers)‏ NB – Hadoop and HDFS closely paired  “careful use of jargon defines the true expert”  “worker node A” and “data node 1” are frequently the same machine Two types of Masters  Jobtracker (Hadoop Job Master)‏  NameNode (file system Master)‏ What I mean by 'master' for the rest of the lecture

9 Your Data goes in.... Files are divided into Chunks  64 MB The mapping between filename and chunks goes to the Master Each chunk is replicated and sent off to DataNodes  By default, 3  The master determines which dataNodes

10 What the Clients Do Where the data starts On file creation creates a seperate file w/checksum When data fetched back from a dataNode, checksum computed again Cache file data  Avoid bothering the Master too often When a Client has 1 chunk's worth of data  Contacts the Master,  Master sends name of dataNodes to send it to  ONLY sends it to the 1 st

11 What the DataNodes Do Heartbeat to the Master Opens, closes, or replicates a chunk if requested from Master During replication, sends data to next dataNode in chain

12 What the Namespace Node Does System metadata!  Holds Name->ID mapping  Chunk replicas locations  Transcation Logs EditLog FSImage It is responsible for coherency  Uses the logs atomically  Addresses the conccurent writes issue It is checkpointed  Similar to AFS volume snapshots  Will pull last consistent log upon restart

13 What the Namespace Node Does Listens for Heartbeats Listens for Client Requests If no heartbeat  marks a node as dead  Its data is deregistered It selects dataNodes  Which nodes get which chunks  Signals creating, opening, closing Deletes  Orders move to /trash  Starts delete timer

14 All together Now!

15 Additional Resources Hadoop wiki Youtube → “Hadoop” → Google developer videos (1-3 will be helpful)‏ Google University  Includes UW course, the other UW course, a couple others  Use are your own risk “The Google File System” paper is rather readable as research papers go

Download ppt "15-440, Hadoop Distributed File System Allison Naaktgeboren  Wut u mean? I iz loadin a HA-doop fileh  Ur doin' it rong kitteh."

Similar presentations

Ads by Google