Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hadoop Namenode High Availability August 2008 Requirements and Procedures.

Similar presentations


Presentation on theme: "Hadoop Namenode High Availability August 2008 Requirements and Procedures."— Presentation transcript:

1 Hadoop Namenode High Availability August 2008 Requirements and Procedures

2 2 Requirements  Two nodes to satisfy availability requirements.  High availability for internal components of each node.  Disk redundancy  Network redundancy  Redundant network architecture.  Heartbeat mechanism between the two nodes.  Replication of namenode metadata.  Automatic fail over with no human action required.

3 Internal Components Disks o 2x 300 GB 15k RPM SAS. o Hardware RAID 1 mirroring. o SMART monitoring. Network o Dual 1Gbps on-board NICs. o Linux bonding with LACP.

4 4 Redundant Network Architecture Linux bonding o See bonding.txt from Linux kernel docs. o LACP, aka 802.3ad, aka mode=4.  (http://en.wikipedia.org/wiki/Link_Aggregation_Control_Protocol) o Must be supported by your switches. o Throughput advantage  Observed at 1.76Gb/s o Allows for failure of either NIC instead of a single heartbeat connection via crossover. Switching infrastructure and physical segregation. o See diagram

5 5 Network Diagram

6 6 Heartbeat Between Nodes  Provided by "heartbeat" package.  (http://www.linux-ha.org/)  Manage multiple resources:  Virtual IP address  DRBD Disk  Hadoop processes  /etc/ha.d/haresources example: cw-grid101.contextweb.prod IPaddr:: cw-grid101.contextweb.prod drbddisk::r0 cw-grid101.contextweb.prod Filesystem::/dev/drbd0::/hadoop::ext3::defaults cw-grid101.contextweb.prod hadoop  Heartbeat uses bond0 network interface. (* Not approved)  3 second timeout for "deadtime".  Created LSB compliant hadoop init script.

7 7 Replication of Namenode Metadata  DRBD Replication.  (http://www.drbd.org/)

8 /etc/drbd.conf example: global { usage-count no; } resource r0 { protocol C; syncer { rate 110M; } # approximately 50% of total available startup { wfc-timeout 0; degr-wfc-timeout 120; } on cw-grid101.contextweb.prod { device /dev/drbd0; disk /dev/sda4; address :7788; meta-disk internal; } on cw-grid102.contextweb.prod { device /dev/drbd0; disk /dev/sda4; address :7788; meta-disk internal; } }

9 Fail Over Order of Events Virtual IP fails over. DRBD system switches primary node. (/proc/drbd status) File system fsck and mount at /hadoop. Hadoop started via LSB compliant init script. End to end fail over time approximately 15 seconds. Optionally, original master is rebooted to help avoid split-brain.

10 DRBD Status  Updating # cat /proc/drbd version: (api:88/proto:86-88) GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by :04:55 0: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r--- ns: nr:0 dw: dr: al:11746 bm:12767 lo:14 pe:12 ua:246 ap:1 oos: [==> ] sync'ed: 18.0% (82459/100465)M finish: 0:14:31 speed: 96,904 (77,472) K/sec  Synchronized # cat /proc/drbd version: (api:88/proto:86-88) GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by :04:55 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r--- ns: nr:0 dw: dr: al:11781 bm:17923 lo:0 pe:0 ua:0 ap:0 oos:0 10


Download ppt "Hadoop Namenode High Availability August 2008 Requirements and Procedures."

Similar presentations


Ads by Google