1 Lei Xu. Brief Introduction  Hadoop  An apache project for data-intensive applications  Typical application: Map-Reduce (OSDI’04), a distributed algorithm.

1 Lei Xu

Brief Introduction  Hadoop  An apache project for data-intensive applications  Typical application: Map-Reduce (OSDI’04), a distributed algorithm for massive-data computation  Crawl and index web pages (Y!)  Analyze popular topics and trends (Twitter)  Led by Yahoo!/Facebook/Cloudera 2

Brief Introduction (cont’d)  Hadoop Distributed File System (HDFS)  A scalable distributed file system to serve Hadoop MapReduce applications  Borrow the essential ideas from the Google File System  Sanjay Ghenawat, Howard Gobioff and Shun-Tak Leung. The Google File System. 19 TH ACM Symposium on Operating System Principles (SOSP’03)  Share same design assumptions 3

Google File System  A scalable distributed file system designed for:  Data-intensive applications (mainly MapReduce)  Web page indexing  Then it has spread to other applications  E.g. Gmail, Big Table, App Engine  Fault-tolerant  Low-cost hardware  High throughputs 4

Google File System (cont’d)  Departure from other file system assumptions  Run on top of the commodity hardware  Component failures are common  Files are huge  Basic block size 64~128 MB  1~64KB in traditional file systems (Ext3/NTFS and etc.)  Massive-data/data-intensive processing  Large streaming read and small random read  Large, sequential writes  No (or bare) random writes 5

Hadoop DFS Assumptions  Other than the assumptions in Google File System, HDFS assumes that:  Simple Coherency Model  Write-once-read-many  Once a file was created, written and closed, it can not be changed anymore.  Moving Computation Is Cheaper than Moving Data  “Semi-Location-Aware” computation  Try its best to assign computations closer to the related data  Portability Across Heterogeneous Hardware and Software Platforms  Is written in Java, multi-platform support  Google File System was written in C++ and run on Linux  Store data on top of existing file systems (NTFS/Ext4/Btrfs…) 6

HDFS Architecture  Master/Slave Architecture  NameNode  Metadata Server  File location ( file name -> the DataNode )  File attributions (atime/ctime/mtime, size, the number of replicas and etc.)  DataNode  Manages the storage attached to the nodes that they run on  Client  Producer and Consumers of data 7

HDFS Architecture (cont’d) 8

NameNode  Metadata Server  Only one NameNode in one cluster  Single Point Failure  Potential performance bottleneck  Manage the file system namespace  Traditional hierarchical namespace  Keep all file metadata in memory for fast access  The memory size of NameNode determines how many files can be supported  Execute file system namespace operation:  Open/close/rename/create/unlink…  Return the location of data blocks 9

NameNode (cont’d)  Maintains system-wide activities  E.g. creating new replications of file data, garbage collection, load balancing and etc.  Periodically communicates with DataNode to collect their statuses  Is DataNode alive?  Is DataNode overload? 10

DataNode  Storage server  Store fixed-size data blocks on local file systems ( ext4/zfs/btrfs )  Serve read/write operations from the clients  Create, delete, replicate data blocks upon instruction from the NameNode  Block size = 64MB 11

Client  Application-level implementations  Does not provide POSIX API  Hadoop has a FUSE interface  FUSE: Filesystem in Userspace  Has limited functions (e.g, no random write supports)  Query the NameNode for file locations and metadata  Contact corresponding DataNodes for file I/Os 12

Data Replication  Files are stored as a sequence of blocks  The blocks (typically 64MB) are replicated for fault tolerance  Replication factor is configurable per file  Can be specified at creation time, and can be changed later  The NameNode decides how to replicate blocks. It periodically receives:  Heartbeat, which implies the DataNode is alive  Blockreport, which contains a list of all blocks on a DataNode  When a DataNode is down, the NameNode replicas all blocks on this DataNode to other active DataNode to achieve enough replications 13

Data Replication (cont’d) 14

Data Replication (cont’d)  Rack Awareness  Hadoop instance runs on a cluster of computers that spread across many racks:  Nodes in same rack are connected by one switches  Communications between two nodes in different racks go through switches  Slower than nodes in same rack  One rack may fail due to network/power issues.  Improve data reliability, availability and network bandwidth utilization 15

Data Replications (cont’d)  Rack Awareness (cont’d)  For common case, the replication factor is three  Two replicas are placed on two different nodes in same rack  The third replica is placed on a node in a remote rack  Improves write performance  2/3 writes are in same rack, faster  Without compromising data reliability 16

Replica Selection  For READ operation:  Minimize the bandwidth consumption and latency  Prefer nearer node:  If there is a replica on the same node, it is preferred  The cluster may span multiple data centers, replicas in same data centers are preferred 17

Filesystem Metadata  The HDFS stores all file metadata on NameNode  An EditLog  Record every change that occurs to filesystem metadata  For failure recovery  Same as journaling file systems (Ext3/NTFS)  An FSImage  Stores mapping of blocks to files and file attributes  EditLog and FSImage are stored on NameNode locally 18

Filesystem Metedata(cont’d)  DataNode has no knowledge about HDFS files  It only stores data blocks as regular files on local file systems  With a checksum for data integrity  It periodically reports a Blockreport that includes all blocks stored on this DataNode to NameNode  Only the DataNode has knowledge about the availability of one block replica. 19

Filesystem Metadata(cont’d)  When NameNode starts up  Load FSImage and EditLog from the local file system  Update FSImage with latest EditLogs  Create a new FSImage for latest checkpoint and store on local file system permanently 20

Communication Protocol  A Hadoop specific RPC on top of TCP/IP  NameNode is simply a server that only responses to the requests issued by DataNodes or clients  ClientProtocol.java – client protocol  DatanodeProtoco.java – datanode protocol 21

Robustness  Primary object of HDFS:  Reliable with component failures  In a typical large cluster (>1K nodes), component failures are common  Three common types of failures:  NameNode failures  DataNode failures  Network failures 22

Robustness (cont’d)  Heartbeats  Each DataNode sends heartbeats to NameNode periodically  System status and block reports  The NameNode marks DataNodes w/o recent heartbeats as dead  Does not forward I/O to it  Mark all data blocks on these DataNodes as unavailable  Re-replicate these blocks if necessary (according to the replication factor).  Can detect network failures and DataNode dies 23

Robustness (cont’d)  Re-Balancing  Automatically move the data on one DataNode to another one  If the free space falls below a threshold  Data-Integrity  A block of data may be corrupted  Disk faults, network faults, buggy software  Client computes checksums for each block and stores them in a separate hidden file in HDFS namespace  Verify data before read it 24

Robustness (cont’d)  Metadata failures  FSImage and EditLog are the central data structures  Once corrupted, HDFS can not build namespace and access data  NameNode can be configured to support multiple- copies of FSImage and EditLog  E.g: one FSImage/EditLog on local machine, another one is stored on mounted remote NFS server.  Reduce the update performances  Once NameNode is down, it must to restart the cluster manually 25

Data Organization  Data Blocks  HDFS is designed to support very large files and streaming I/Os  A File is chopped up into 64MB blocks  Reduce the number of connection establishments and accelerate TCP transmissions  If possible, each block of a file will reside on a different DataNode  For future parallel I/O and computations (MapReduce) 26

Data Organization (cont’d)  Staging  When write a new file  A client firstly caches the file data into temporary local file until this file worth over the HDFS block size  Then the client contacts NameNode to assign a DataNode  The client flushes the cached data to the chosen DataNode  Fully utilized the bandwidth 27

Data Organization (cont’d)  Replication Pipeline  A client obtains a DataNode list to flush one block  The client firstly flushes the data to the first DataNode  The first DataNode starts to receive the data in small portions (4kB), writes that portions to local storage, and transfer it to the next DataNode in the list immediately  The second DataNode acts as the first one  The total transfer time for one block(64MB) is:  T(64MB) + T(4kb) * 2, for pipeline  3 * T(64MB), for non-pipeline 28

Replication Pipeline  The client asks the NameNode where to put data  The client push data to DataNode linearly to fully utilize network bandwidth  The secondary replicas reply to the primary. Then the primary replies to the client for success. 29 * This figure was in “The Google File System” paper

See also  HBase – a BigTable implementation on Hadoop  Key-value storage  Pig – high-level language to run data analyze on Hadoop  ZooKeeper  “ZooKeeper: Wait-free Coordination for Internet-scale Systems”, ATC’10, Best Paper  CloudStore (KFS, previously Kosmosfs)  A C++ implementation of Google File System  Parallels the Hadoop project 30

Google v.s Y!/Facebook/Amazon.. Google Google File System MapReduce BigTable Hadoop Hadoop DFS Hadoop MapReduce HBase 31

Known Issues and Research Interests  NameNode is the single point failure  Limits the total files supported in the HDFS as well  RAM limitation  Google has changed the one-master architecture to multiple-header cluster  However, the details are unrevealed 32

Known Issues and Research Interests (cont’d)  Use replications to provide data reliability  Same problems to RAID-1 ?  Apply RAID technologies to HDFS?  “DiskReduce: RAID for Data-Intensive Scalable Computing”, PDSW’09 33

Known Issues and Research Interests (cont’d)  Energy Efficiency  DataNodes are alive for data availability  However, there may be no MapReduce computations running on them.  Waste of energy 34

Conclusion  Hadoop Distributed File System is designed to serve MapReduce computations  Provide high reliable storage  Support mass of data  Optimized data placement policies based on the topology of data centers  Large companies build their core businesses on top of these infrastructures  Google: GFS/MapReduce/BigTable  Yahoo!/Facebook/Amazon/Twitter/NY Times: Hadoop/HBase/Pig 35

Reference  HDFS Architecture Guide: http://hadoop.apache.org/hdfs/docs/current/ hdfs_design.html http://hadoop.apache.org/hdfs/docs/current/ hdfs_design.html 36

Thank you ! 37 Questions?

1 Lei Xu. Brief Introduction  Hadoop  An apache project for data-intensive applications  Typical application: Map-Reduce (OSDI’04), a distributed algorithm.

Similar presentations

Presentation on theme: "1 Lei Xu. Brief Introduction  Hadoop  An apache project for data-intensive applications  Typical application: Map-Reduce (OSDI’04), a distributed algorithm."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Lei Xu. Brief Introduction  Hadoop  An apache project for data-intensive applications  Typical application: Map-Reduce (OSDI’04), a distributed algorithm.

Similar presentations

Presentation on theme: "1 Lei Xu. Brief Introduction  Hadoop  An apache project for data-intensive applications  Typical application: Map-Reduce (OSDI’04), a distributed algorithm."— Presentation transcript:

Similar presentations

About project

Feedback