Mehmet Can Kurt, The Ohio State University Sriram Krishnamoorthy, Pacific Northwest National Laboratory Kunal Agrawal, Washington University in St. Louis.

Slides:

Advertisements

Similar presentations

IT 344: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Chia-Chi Teng CTB 265.

Advertisements

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 2.

Fault-Tolerant Systems Design Part 1.

Silberschatz and Galvin  Operating System Concepts Module 16: Distributed-System Structures Network-Operating Systems Distributed-Operating.

Distributed System Structures Network Operating Systems –provide an environment where users can access remote resources through remote login or file transfer.

Carnegie Mellon R-BATCH: Task Partitioning for Fault-tolerant Multiprocessor Real-Time Systems Junsung Kim, Karthik Lakshmanan and Raj Rajkumar Electrical.

CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.

Spark: Cluster Computing with Working Sets

Fail-Safe Mobility Management and Collision Prevention Platform for Cooperative Mobile Robots with Asynchronous Communications Rami Yared School of Information.

Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.

Distributed Computations

CS 582 / CMPE 481 Distributed Systems Fault Tolerance.

16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.

Implementing Disaster Protection

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

Evaluating the Error Resilience of Parallel Programs Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Failure Spread in Redundant UMTS Core Network n Author: Tuomas Erke, Helsinki University of Technology n Supervisor: Timo Korhonen, Professor of Telecommunication.

15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.

Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.

Storage in Big Data Systems

1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Fault-Tolerant Systems Design Part 1.

Copyright: Abhinav Vishnu Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models Abhinav Vishnu 1, Huub Van Dam 1, Bert De Jong.

Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.

Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.

A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.

Fault-Tolerant Systems Design Part 1.

A Methodology for Creating Fast Wait-Free Data Structures Alex Koganand Erez Petrank Computer Science Technion, Israel.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

ERLANGEN REGIONAL COMPUTING CENTER st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

Energy Efficient Data Management for Wireless Sensor Networks with Data Sink Failure Hyunyoung Lee, Kyoungsook Lee, Lan Lin and Andreas Klappenecker †

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.

Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

VOCL-FT: Introducing Techniques for Efficient Soft Error Coprocessor Recovery Antonio J. Peña, Wesley Bland, Pavan Balaji.

Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.

Scheduling of Non-Real-Time Tasks in Linux (SCHED_NORMAL/SCHED_OTHER)

For Massively Parallel Computation The Chaotic State of the Art

Chapter 9: Virtual Memory

A task-based implementation for GeantV

FPGA: Real needs and limits

Fault-Tolerant Programming Models and Computing Frameworks

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab

Store Recycling Function Experimental Results

RAID RAID Mukesh N Tekwani

Chapter 9: Virtual-Memory Management

Replication-based Fault-tolerance for Large-scale Graph Processing

Fault Tolerance Distributed Web-based Systems

CSE 451: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.

2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering.

RAID RAID Mukesh N Tekwani April 23, 2019

IT 344: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Chia-Chi Teng CTB

Phoenix: A Substrate for Resilient Distributed Graph Analytics

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

Lecture 29: Distributed Systems

Presentation transcript:

Mehmet Can Kurt, The Ohio State University Sriram Krishnamoorthy, Pacific Northwest National Laboratory Kunal Agrawal, Washington University in St. Louis Gagan Agrawal, The Ohio State University Fault-Tolerant Dynamic Task Graph Scheduling

Motivation Significant transformation in hardware many core processors/coprocessors (Intel Xeon Phi) Trend towards asynchronous execution task graph model (Cilk, TBB, CnC) Resilience important now more than ever! soft errors decreasing MTBF 2 Fault tolerance for task graph model

Background: Task Graph Execution Representation as a DAG vertices (tasks), edges (dependences) Main scheduling rule Improved scalability asynchronous execution load balance via work stealing A C B D E 3

Failure Model Task graph scheduling in presence of detectable soft errors Recover corrupted data blocks and task descriptors Assumptions: 1. existence of an error detector ECC, symptom-based detectors, application-level assertions recovery upon observation 2. logic for task graph creation is resilient through user provided functions 4

Recovery Challenges D fails right after its computation re-compute D (only once), restart B and C Further complications if data blocks are reused C overwrites E re-compute E (only once) Minimum effect on normal scheduling A C B D E Waiting Completed Executing Failed 5

Fault-Tolerant Scheduling Developed on NABBIT* a task graph scheduler using work stealing augmented with additional routines optimality properties maintained Recovery from arbitrary number of soft failures no redundant execution or checkpoint/restart selective task re-execution negligible overheads for a small constant number of faults * IPDPS’10 6

Scheduling Without Failures Traverse predecessors A.status is “Computed”, decrement C.join B.status is “Visited”, enqueue C to B.notifyArray Successors enqueued in notifyArray Compute task when join is 0 Notify successors in notifyArray B A C C’s Task Descriptor join: notifyArray: status: db: number of outstanding predecessors execution status at the moment successors to notify pointer to output 2 { } Visited null 1 {D} D 0 Computed data 7

Fault-Tolerant Scheduling: Properties Non-collective and selective recovery without interfering with other threads re-execute impacted portion of the task graph A C B D E thread 1 thread 2 thread 3 8

Fault-Tolerant Scheduling: Recovery Failures can be handled at any stage of execution Enclosure with try-catch blocks No recovery for non observed failures during traversal A B C during computation A B C A B C during notification C E recover B recover C recover E A B C Predecessor Failure Self Failure Successor Failure 9

Fault-Tolerant Scheduling: Recovery Meta-data of a failed task is correctly recovered. Treat the failed task as new (no backup & restore) Replace failed task descriptor Recovering task traverses its predecessors, computes and notifies A C B D E B’ B’s Task Descriptor join: 0 notifyArray: {C,D} status: Visited db: null B’s Task Descriptor join: 1 notifyArray: {} status: Visited db: null 10

Fault-Tolerant Scheduling: Key Guarantees Guarantee 1: join of a task descriptor is decremented exactly once per predecessor. B recovers and notifies D again D executes prematurely Keep track of notifications D join: 1 (notified by B, waiting for C) A C B D E 11

Fault-Tolerant Scheduling: Key Guarantees Guarantee 2: Every task waiting on a predecessor is notified. Hung execution state if tasks enqueued are not notified! Re-construct notifyArray A C B D E B’ B notifyArr:{C,D} C join: 1 D join: 2 B’ notifyArr:{C,D} B’ notifyArr:{} 12

Fault-Tolerant Scheduling: Key Guarantees Guarantee 3: Each failure is recovered at most once. Both C and D observes failure A separate recovery by each observer Keep track of initiated recoveries A C B D E 13

Fault-Tolerant Scheduling: Key Guarantees Guarantee 4: Overwritten data blocks are distinguished and handled correctly. Did D start overwriting C’s data block? if no, only re-compute D. otherwise, re-compute C, B and A as well. Treat overwritten data blocks as failed BAC D 14 v=0 v=1 v=2

Theoretical Analysis Of Performance 15 NABBIT is provably time efficient asymptotically optimal running time Optimality property maintained for normal execution Optimal fault-tolerant execution no additional increase to critical path length cost depends on number of failures Experiments support analysis

Experiments Platform: four 12-core AMD Opteron 2.3 GHz processors with 256 GB memory only 44 cores out of 48 arithmetic mean (with standard deviation) of 10 runs Benchmarks: LCS, Smith-Waterman, Floyd-Warshall, LU and Cholesky 16

Overheads without Failures Results for LCS and SW

Overheads without Failures Results for FW (10-15% overhead at 44 cores)

Overheads with Failures Amount of Work Lost: loss is a constant amount of tasks (512), or a percentage of total work (2%, 5%) Failure Time: before compute or after compute Task Type: tasks which produce a data block’s 0 th (v=0), last (v=last), or a random version (v=rand). 19

Overheads with Failures (512 re- executions) Negligible/small in “before compute”/”after compute” scenarios no overhead with 1, 8 and 64 task re-executions 20

Overheads with Failures (2% and 5%) Overheads proportional to the amount of work lost. 3.6% 8.2% 21

Scalability Analysis (5% task re-executions, varying number of cores) Re-execution chains leading to lack of concurrency. Overheads not exceeding 6.5% in most cases %

Conclusion A fault-tolerant dynamic task graph scheduler non-collective and selective recovery Optimality properties still hold Recovery overheads negligible for small number of failures proportional to the work lost for larger failures 23

Thank you! Questions? 24

Benchmarks 25

After Notify Scenario 26