Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.

Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432 Yuuki Horita

Intermediate Presentation(05/04/15) Background  Large-scale computation runs in parallel on a great number of nodes in distributed environments (Grid) over a long period of time High failure rate Node / Process Failures Network Failures Fault Tolerance is getting more important

Intermediate Presentation(05/04/15) Fault tolerant computing Failures RecoveryResumingFailure Detection The end … Computing

Intermediate Presentation(05/04/15) Failure Detection  Heartbeat strategy X T to Y is probably dead Y T hb msg ① A process Y sends a message, called heartbeat, to another process X at regular time interval T hb ② After Y dies, X receives no heartbeat from Y ③ X suspects Y after a certain period of time T hb + T to from the last receipt of heartbeat

Intermediate Presentation(05/04/15) Objective  To design and implement failure detection service for supporting fault- tolerant parallel computation

Intermediate Presentation(05/04/15) Contributions  propose a new failure detection approach for fault- tolerant parallel computation high autonomy address join/leave of procs. support Grid environments with less manual configurations high consistency all the procs. obtain consistent failure information high efficiency more efficient than other autonomous approaches (the overhead with 313 procs. was at most about 2% where the heartbeat interval is 0.1[s])

Intermediate Presentation(05/04/15) Agenda  Background  Demands / Related Works  Our Approach  Experiments  Summary

Intermediate Presentation(05/04/15) Demands for Failure Detection  System demand ( : Autonomy) Adaptability/Fault-tolerance: address join/leave of processes Accessibility: need less manual configuration  Information demand ( : Consistency) Consistency: must provide consistent information  Performance demand ( : Efficiency) Low overhead: don’t deteriorate application performance Low detection latency: inform failure events ASAP Accuracy: less false positive

Intermediate Presentation(05/04/15) Hierarchical style  MDS (Globus Project)  NWS [R. Wolski ’ 97, N.T.Spring ’ 99]  a single point of failure may lead to system failure  manual configuration may be cumbersome : Autonomy Problem

Intermediate Presentation(05/04/15) Gossip style [R. Renesse ’ 98]  utilize the mechanism of rumor spreading each process sends a gossip message (like heartbeat) to a randomly selected process periodically a gossip message includes {node, heartbeat} of all processes  node : a process identifier  heartbeat : the latest time when some node received node ’ s heartbeat

Intermediate Presentation(05/04/15) Gossip style Heartbeats are propagated to all processes in a certain amount of time automatically  each process judges process failure independently : Consistency Problem  it takes longer to detect failures : Efficiency Problem

Intermediate Presentation(05/04/15) Basic Design  Separation of failure detection and information propagation Each process is monitored by some processes (Failure-detection phase) If a process detects process failures, it broadcasts the information (Information-propagation phase) the overhead under normal conditions will be low (Efficiency) the failure information will be shared (Consistency)

Intermediate Presentation(05/04/15) Failure Detection Each process autonomously acts so that it is always monitored by some processes Each process  requests randomly selected k neighbor processes to monitor itself (neighbor : directly connectable)  sends heartbeat to them at regular time interval T hb  requests again in the same way if the monitoring process has failed (self- repairing) A → B ： A sends heartbeats to B ( B monitors A ) k = 2

Intermediate Presentation(05/04/15) Information Propagation  flood along the monitoring network Can we guarantee that the monitoring network is connected ?  no need for extra connections  redundant paths for broadcast (:fault-tolerant)  at most 2k messages per proc. (:scalable)

Intermediate Presentation(05/04/15) Connectivity of Monitoring Network  We calculated the probability of disconnectivity of the monitoring network The disconnectivity can be ignored if k >= 3

Intermediate Presentation(05/04/15) Support Grid Environments  The connectivity between different networks is often limited (i.e. NAT, Firewall) Cluster ACluster B Gateway Disconnected!

Intermediate Presentation(05/04/15) Support Grid Environments K monitoring requests For each process, any of its neighbor processes should be either monitoring it directly or adjacent to k of its monitoring processes

Intermediate Presentation(05/04/15) monitoring it directly adjacent to k of its monitoring processes Support Grid Environments 5 42 3 9 8 [2, 7] [1, 2], [4, 5] k = 2 234567 222222 234567 121122 neighbor processes monitoring directly 6 7 1

Intermediate Presentation(05/04/15) monitoring it directly adjacent to k of its monitoring processes Support Grid Environments 5 42 3 9 8 k = 2 24567 11122 1, [7,9] monitoring directly 2457 1111 245 111 245 010 24567 11121 6 7 1

Intermediate Presentation(05/04/15) Experiment Environment  ISTBS Cluster (112 nodes × 2 CPU) Xeon2.4GHz × 70 + Xeon2.8GHz ×42 105 nodes (7 nodes down) located at Hongo  SHEEP Cluster (65 nodes × 2 CPU) Xeon2.4GHz × 65 65 nodes located at Kashiwa Internet SHEEP cluster in Kashiwa ISTBS cluster in Hongo

Intermediate Presentation(05/04/15) Demonstration (Java Applet) a process monitoring  lots of processes will die concurrently 3-times (turn black and disappear)  the surviving processes will detect all of the failures (change in color)  processes will repair the broken monitoring relations (add new edges)

Intermediate Presentation(05/04/15) connectivity under failures  simulate the connectivity of the monitoring network under some failures check whether monitoring network is connected when F failures happen concurrently 1.8×10 9 trials in each case

Intermediate Presentation(05/04/15) connectivity under failures

Intermediate Presentation(05/04/15) Connectivity under failures # of procs.10204080160 k=3, p=0.01345813 k=3, p=0.000122334 k=4, p=0.014691424 k=4, p=0.0001344610 calculated the maximum number of failure where probability of disconnection is less than p

Intermediate Presentation(05/04/15) Efficiency  measured the execution time of a Fibonacci program under the following autonomous failure detection service all-to-all Gossip ours  parameters # of processes : 2 ~ 313 k = 3 T hb = 0.1, 1.0[s]

Intermediate Presentation(05/04/15) Results (Efficiency) 10% overhead (N = 127) over 5% overhead (N = 153) The overhead is at most around 2 %

Intermediate Presentation(05/04/15) Summary  proposed a new failure detection technique for fault-tolerant parallel computation  showed that our system could be autonomously constructed in Grid environments our system has high fault-tolerance it is more efficient than other autonomous approaches

Intermediate Presentation(05/04/15) Future Work  handling network partitioning  sharing load on dynamic process join  showing its practicality by implementing fault-tolerant parallel application using it

Intermediate Presentation(05/04/15) Publications  堀田勇樹, 田浦健次朗, 近山隆. 分散環境における耐故障並列計算を支援する通信ライブラリ. 先進的計算基盤システムシンポジウム (SACSIS2004). May 2004. （ポスター論文）  堀田勇樹, 田浦健次朗, 近山隆. Phoenix プログラミングモデルにおける故障検知機構. 並列 / 分散 / 協調処理に関するサマー・ワークショップ (SWoPP2004). July 2004.  堀田勇樹, 田浦健次朗, 近山隆. 耐故障並列計算を支援する自律的な故障検知機構. 先進的計算基盤システムシンポジウム (SACSIS2005). May 2005. ( 発表予定 )

Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.

Similar presentations

Presentation on theme: "Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.

Similar presentations

Presentation on theme: "Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432."— Presentation transcript:

Similar presentations

About project

Feedback