Carnegie Mellon R-BATCH: Task Partitioning for Fault-tolerant Multiprocessor Real-Time Systems Junsung Kim, Karthik Lakshmanan and Raj Rajkumar Electrical.

Slides:



Advertisements
Similar presentations
EE5900 Advanced Embedded System For Smart Infrastructure
Advertisements

- Dr. Kalpakis CMSC Dr. Kalpakis 1 Outline In implementing DBMS we need to answer How should the system store and manage very large amounts of data?
Thank you for your introduction.
CS223 Advanced Data Structures and Algorithms 1 Greedy Algorithms Neil Tang 4/8/2010.
Garbage Collecting the World Bernard Lang Christian Queinnec Jose Piquer Presented by Yu-Jin Chia See also: pp text.
1 of 14 1 /23 Flexibility Driven Scheduling and Mapping for Distributed Real-Time Systems Paul Pop, Petru Eles, Zebo Peng Department of Computer and Information.
Introduction and Background  Power: A Critical Dimension for Embedded Systems  Dynamic power dominates; static /leakage power increases faster  Common.
RUN: Optimal Multiprocessor Real-Time Scheduling via Reduction to Uniprocessor Paul Regnier † George Lima † Ernesto Massa † Greg Levin ‡ Scott Brandt ‡
Task Allocation and Scheduling n Problem: How to assign tasks to processors and to schedule them in such a way that deadlines are met n Our initial focus:
Soft Real-Time Semi-Partitioned Scheduling with Restricted Migrations on Uniform Heterogeneous Multiprocessors Kecheng Yang James H. Anderson Dept. of.
1 of 14 1/14 Design Optimization of Time- and Cost-Constrained Fault-Tolerant Distributed Embedded Systems Viaceslav Izosimov, Paul Pop, Petru Eles, Zebo.
Bogdan Tanasa, Unmesh D. Bordoloi, Petru Eles, Zebo Peng Department of Computer and Information Science, Linkoping University, Sweden December 3, 2010.
WPDRTS ’05 1 Workshop on Parallel and Distributed Real-Time Systems 2005 April 4th and 5th, 2005, Denver, Colorado Challenge Problem Session Detection.
1 of 14 1 Fault-Tolerant Embedded Systems: Scheduling and Optimization Viacheslav Izosimov, Petru Eles, Zebo Peng Embedded Systems Lab (ESLAB) Linköping.
REAL-TIME SOFTWARE SYSTEMS DEVELOPMENT Instructor: Dr. Hany H. Ammar Dept. of Computer Science and Electrical Engineering, WVU.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
1 of 14 1 Scheduling and Optimization of Fault- Tolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden.
The Rare Glitch Project: Verifying Bus Protocols for Embedded Systems Edmund Clarke, Daniel Kroening Carnegie Mellon University.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
LAN / WAN Business Proposal. What is a LAN or WAN? A LAN is a Local Area Network it usually connects all computers in one building or several building.
Failure Spread in Redundant UMTS Core Network n Author: Tuomas Erke, Helsinki University of Technology n Supervisor: Timo Korhonen, Professor of Telecommunication.
Dynamic and Decentralized Approaches for Optimal Allocation of Multiple Resources in Virtualized Data Centers Wei Chen, Samuel Hargrove, Heh Miao, Liang.
REAL-TIME SOFTWARE SYSTEMS DEVELOPMENT Instructor: Dr. Hany H. Ammar Dept. of Computer Science and Electrical Engineering, WVU.
Copyright © Clifford Neuman and Dongho Kim - UNIVERSITY OF SOUTHERN CALIFORNIA - INFORMATION SCIENCES INSTITUTE Advanced Operating Systems Lecture.
Cluster Reliability Project ISIS Vanderbilt University.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
ROBUST RESOURCE ALLOCATION OF DAGS IN A HETEROGENEOUS MULTI-CORE SYSTEM Luis Diego Briceño, Jay Smith, H. J. Siegel, Anthony A. Maciejewski, Paul Maxwell,
An efficient active replication scheme that tolerate failures in distributed embedded real-time systems Alain Girault, Hamoudi Kalla and Yves Sorel Pop.
A S CHEDULABILITY A NALYSIS FOR W EAKLY H ARD R EAL - T IME T ASKS IN P ARTITIONING S CHEDULING ON M ULTIPROCESSOR S YSTEMS Energy Reduction in Weakly.
© Oxford University Press 2011 DISTRIBUTED COMPUTING Sunita Mahajan Sunita Mahajan, Principal, Institute of Computer Science, MET League of Colleges, Mumbai.
Dependable communication synthesis for distributed embedded systems Nagarajan Kandasamy, John P. Hayes, Brian T. Murray Presented by John David Eriksen.
Scheduling policies for real- time embedded systems.
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
Background Gaussian Elimination Fault Tolerance Single or multiple core failures: Single or multiple core additions: Simultaneous core failures and additions:
Advanced Principles of Operating Systems (CE-403).
Model-Based Embedded Real- Time Software Development Dionisio de Niz and Raj Rajkumar Real-Time and Multimedia Sys Lab Carnegie Mellon University.
REAL-TIME SOFTWARE SYSTEMS DEVELOPMENT Instructor: Dr. Hany H. Ammar Dept. of Computer Science and Electrical Engineering, WVU.
1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.
Transparent Fault-Tolerant Java Virtual Machine Roy Friedman & Alon Kama Computer Science — Technion.
Object-Oriented Design and Implementation of the OE-Scheduler in Real-time Environments Ilhyun Lee Cherry K. Owen Haesun K. Lee The University of Texas.
5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.
Mehmet Can Kurt, The Ohio State University Sriram Krishnamoorthy, Pacific Northwest National Laboratory Kunal Agrawal, Washington University in St. Louis.
Mixed Criticality Systems: Beyond Transient Faults Abhilash Thekkilakattil, Alan Burns, Radu Dobrin and Sasikumar Punnekkat.
Tolerating Communication and Processor Failures in Distributed Real-Time Systems Hamoudi Kalla, Alain Girault and Yves Sorel Grenoble, November 13, 2003.
Real-Time Systems, Events, Triggers. Real-Time Systems A system that has operational deadlines from event to system response A system whose correctness.
A Fault-Tolerant Scheduling Algorithm for Real-Time Periodic Tasks with Possible Software Faults Ching-Chih Han, Kang G. Shin, and Jian Wu.
Fault-Tolerant Rate- Monotonic Scheduling Sunondo Ghosh, Rami Melhem, Daniel Mosse and Joydeep Sen Sarma.
Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Reliable energy management System reliability is affected by use of energy management The use of DVS increases the probability of faults, thus damaging.
REAL-TIME OPERATING SYSTEMS
Delay-Tolerant Networks (DTNs)
Bin Packing First fit decreasing algorithm
Prabhat Kumar Saraswat Paul Pop Jan Madsen
Anand Bhat1, Dr. Soheil Samii2, Dr. Raj Rajkumar1
Real-time Software Design
Fault-tolerant Control System Design and Analysis
Outline Announcements Fault Tolerance.
Bin Packing First fit decreasing algorithm
CSCI1600: Embedded and Real Time Software
EEC 688/788 Secure and Dependable Computing
Bin Packing First fit decreasing algorithm
CSCI1600: Embedded and Real Time Software
Bin Packing First fit decreasing algorithm
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Anand Bhat*, Soheil Samii†, Raj Rajkumar* *Carnegie Mellon University
Presentation transcript:

Carnegie Mellon R-BATCH: Task Partitioning for Fault-tolerant Multiprocessor Real-Time Systems Junsung Kim, Karthik Lakshmanan and Raj Rajkumar Electrical and Computer Engineering Carnegie Mellon University

Carnegie Mellon Outline Motivation Goals and Systems Models R-BATCH: Task Allocations with Replicas Performance Evaluation Conclusion

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion3 Autonomous Vehicles: Background GM Chevy Tahoe named “Boss” Won 2007 DARPA urban challenge Motivation → Goals → R-BATCH → Evaluation → Conclusion3

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion4 Autonomous Vehicles: Background Boss Senses environment Fuses sensor data to form a model of the real world Plans navigation paths Actuates steering wheel, brake, and accelerator Boss requires Safety-critical operations Timing guarantees Robustness to harsh environments Motivation → Goals → R-BATCH → Evaluation → Conclusion4

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion5 Autonomous Vehicles: Architecture 0.5 million lines of code for autonomous driving support 10 dual-core processors + 50 embedded processors Motivation → Goals → R-BATCH → Evaluation → Conclusion5

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion6 Autonomous Vehicles: Capabilities 0.5 million lines of code for autonomous driving support 10 dual-core processors + 50 embedded processors Requires high computational capabilities with timeliness guarantees Adding more processors Using high-performance processors Motivation → Goals → R-BATCH → Evaluation → Conclusion6

Carnegie Mellon , 32nm 2000, 130nm 1989, 800nm 100 Log time (years in service) Infant mortality (random, extrinsic) Failure Rate Wear-out (intrinsic) Processor Reliability Trend Motivation → Goals → R-BATCH → Evaluation → Conclusion7

Carnegie Mellon Outline Motivation Goals and Systems Models R-BATCH: Task Allocations with Replicas Performance Evaluation Conclusion

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion9 Goals for Fault-Tolerance Handle permanent processor failures Tolerate a given number of processor failures Avoid losing functionality by adding more resources in an affordable way Hardware replication Software replication Re-execution of failed jobs Lower quality of service of tasks Deal with unpredictable nature of failures Consider all possible scenarios? Motivation → Goals → R-BATCH → Evaluation → Conclusion9

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion10 System Model (1 of 2) Motivation → Goals → R-BATCH → Evaluation → Conclusion10

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion11 System Model (2 of 2) Task classifications Hard recovery task cannot miss the deadline even if a failure occurs  e.g., automotive engine control Soft recovery task can be recovered in the next period  e.g., navigation, chassis unit control Best-effort recovery task can be recovered if there is an enough room after failure Motivation → Goals → R-BATCH → Evaluation → Conclusion11

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion12 Hard Recovery Task 0 Failure occurred Task recovered Motivation → Goals → R-BATCH → Evaluation → Conclusion12 Processor 1 Processor 2

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion13 Soft Recovery Task Task recovered Failure occurred 0 Motivation → Goals → R-BATCH → Evaluation → Conclusion13 Processor 1 Processor 2

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion14 Task Replication Hot StandbyCold Standby Can recover Hard Recovery taskCan recover Soft Recovery task Running multiple copies of a featureDormant until activated No timing penaltyDelayed recovery time Utilization lossNo utilization loss without failures Observations Hot Standby The primary and the backups running at the same time Cold Standby One Cold Standby can recover several tasks on different processors Shared system state is available in all processors  By using network bus architecture Motivation → Goals → R-BATCH → Evaluation → Conclusion14

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion15 Hard Recovery Task with Hot Standby 0 Failure occurred Task recovered via Hot Standby Motivation → Goals → R-BATCH → Evaluation → Conclusion15 Processor 1 Processor 2

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion16 Soft Recovery Task with Cold Standby Processor 1 Processor 2 Task recovered via Cold Standby Failure occurred 0 Motivation → Goals → R-BATCH → Evaluation → Conclusion16

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion17 2H 3H P1P1 P2P2 P3P3 P4P4 1P 2P 3P 1C 4P 5H 4C 5P 3C 5C 2P 3H P1P1 P2P2 P3P3 P4P4 1P 2P 3P 1C 4P 5H 4P 5P 3C 5H 2H 3H P1P1 P2P2 P3P3 P4P4 1P 2P 3P 1P 4P 5H 4C 5P 3H 5C Example Scenarios P 3 failed P 1 failed With 5 tasks and 4 processors n P : Primary of task n n H : Hot Standbys of task n n C : Cold Standbys of task n Motivation → Goals → R-BATCH → Evaluation → Conclusion17

Carnegie Mellon Outline Motivation Goals and Systems Models R-BATCH: Task Allocations with Replicas Performance Evaluation Conclusion

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion19 R-BATCH Reliable Bin-packing Algorithm for Tasks with Cold standby and Hot standby Reliable task allocation Allocates Hot Standbys Allocates Cold Standbys Motivation → Goals → R-BATCH → Evaluation → Conclusion19

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion20 Uniprocessor Schedulability * More complex; misbehaves at higher U Lower utilization Practical Motivation → Goals → R-BATCH → Evaluation → Conclusion20

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion21 Bin-packing Problem Definition: The problem of packing a set of items into the fewest number of bins such that the total size does not exceed the bin capacity * Items: Utilizations of each task Bins: Processors Then, given a set of tasks, how many bins (processors) do we need? † TkTk TkTk TjTj TjTj TiTi TiTi TmTm TmTm Processor P Motivation → Goals → R-BATCH → Evaluation → Conclusion21

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion22 The Classical Approach: Bin-packing Bin packing is used to allocate tasks to multiprocessor platforms Best-fit Decreasing (BFD) algorithm Step 1: Sort the objects in descending order of size Step 2: Sort the bins in descending order of consumed space Step 3: Fit next object into the first sorted bin that fits If no bin fits, add a new bin to fit into Step 4: If objects remain, go to Step 2. Step 5: Done. P1P1 P2P2 P3P3 P4P4 1, 0.6 2, 0.3 3, 0.2 1, 0.6 Given a set of tasks: {0.6, 0.3, 0.2} 2, 0.3 3, 0.2 Motivation → Goals → R-BATCH → Evaluation → Conclusion22

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion23 BFD with Placement Constraints We also have to deal with replicated tasks Under the placement constraint (BFD-P * ) No two replicas can be on the same processor Otherwise, processor failure will take down both replicas 1P, 0.6 2P, 0.3 3P, 0.2 1H, 0.6 2H, 0.3 3H, 0.2 P1P1 P2P2 P3P3 P4P4 1P, 0.6 2P, 0.3 3H, 0.2 1H, 0.6 2H, 0.3 3P, 0.2 Motivation → Goals → R-BATCH → Evaluation → Conclusion23

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion24 Given a set of tasks: {0.6, 0.3, 0.2} with 2 replicas each By using BFD with placement constraint We can however reduce the number of bins as follows: 1H, 0.6 1P, 0.6 2P, 0.3 3P, 0.2 1H, 0.6 2H, 0.3 3H, 0.2 1P, 0.6 2P, 0.3 3P, 0.2 2H, 0.3 3H, 0.3 Can BFD-P Be Improved? P1P1 P2P2 P3P3 P4P4 P1P1 P2P2 P3P3 P4P4 1P, 0.6 2P, 0.3 3H, 0.2 1H, 0.6 2H, 0.3 3P, 0.2 1P, 0.6 2P, 0.3 3H, 0.2 1H, 0.6 2H, 0.3 3P, 0.2 Motivation → Goals → R-BATCH → Evaluation → Conclusion24

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion25 Reliable BFD (RBFD) RBFD Algorithm Step 1: Sort tasks in decreasing order according to the utilization of each task Step 2: Allocate each primary task in the bin which will have the smallest remaining space Step 3: Set i = 1 Step 4: Allocate i th replica of each task in the bin which will have the smallest remaining space. Step 5: Increment i and repeat Step 4 until all replicas are allocated. 1H, 0.6 1P, 0.6 2P, 0.3 3P, 0.2 2H, 0.3 3H, 0.3 P1P1 P2P2 P3P3 P4P4 1P, 0.6 2P, 0.3 3H, 0.2 1H, 0.6 2H, 0.3 3P, 0.2 Motivation → Goals → R-BATCH → Evaluation → Conclusion25

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion26 Given a set of tasks: {0.6, 0.3, 0.2} with 3 replicas each to tolerate 2 processor failures Instead of using two more processors, add an “empty” processor to hold a “virtual task” Save More Processors with Cold Standby 1H, 0.6 1P, 0.6 2P, 0.3 3P, 0.2 2H, 0.3 3H, 0.2 P1P1 P2P2 P3P3 P4P4 P5P5 1H, 0.6 2H, 0.3 3H, 0.2 1H, 0.6 1P, 0.6 2P, 0.3 3P, 0.2 2H, 0.3 3H, 0.2 P1P1 P2P2 P3P3 P4P4 P5P5 1C, 0.6 2C, 0.3 3C, 0.2 Motivation → Goals → R-BATCH → Evaluation → Conclusion26

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion27 Cold Standby with Virtual Task Virtual task A guaranteed utilization reserving slack for recovering failures via Cold Standby Generate Virtual Tasks Step 1: Create a new virtual task by selecting the task with the highest utilization across all processors, which is not allocated to virtual tasks Step 2: Compare the size of virtual task with tasks on different processors, and check if those tasks can be recovered by using the virtual task Step 3: Go to Step 1 if there are remaining tasks 1H, 0.6 1P, 0.6 1P, 0.6 2P, 0.3 3P, 0.2 2H, 0.3 2H, 0.3 3H, 0.2 3H, 0.2 P1P1 P2P2 P3P3 P4P4 1C, 0.6 Generated Virtual Task 1C covers task 1, 2, and 3 2C, 0.3 3C, 0.2 1C, 0.6 2C, 0.3 3C, 0.2 1C, 0.6 Motivation → Goals → R-BATCH → Evaluation → Conclusion27

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion28 R-BATCH Reliable Bin-packing Algorithm for Tasks with Cold Standby and Hot Standby Step 1: Perform R-BFD with the primary and Hot Standbys Step 2: Generate virtual tasks Step 3: Perform R-BFD with virtual tasks 1H, 0.6 1P, 0.6 1P, 0.6 2P, 0.3 3P, 0.2 2H, 0.3 2H, 0.3 3H, 0.2 3H, 0.2 P1P1 P2P2 P3P3 P4P4 1P, 0.6 2P, 0.3 3H, 0.2 1H, 0.6 2H, 0.3 3P, 0.2 1C, 0.6 2C, 0.3 3C, 0.2 1C, 0.6 Motivation → Goals → R-BATCH → Evaluation → Conclusion28

Carnegie Mellon Outline Motivation Goals and Systems Models R-BATCH: Task Allocations with Replicas Performance Evaluation Conclusion

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion30 Evaluation Environment Motivation → Goals → R-BATCH → Evaluation → Conclusion30

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion31 Performance Evaluation (R-BFD) Motivation → Goals → R-BATCH → Evaluation → Conclusion31 Ratios of Saved Processors (Normalized to BFD-P) Number of Tasks 18%

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion32 Performance Evaluation (R-BATCH) Motivation → Goals → R-BATCH → Evaluation → Conclusion32 Ratios of Saved Processors (Normalized to BFD-P) Number of Tasks 49%

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion33 Performance Evaluation Motivation → Goals → R-BATCH → Evaluation → Conclusion33 Ratios of Saved Processors (Normalized to BFD-P) Ratios of Saved Processors (Normalized to BFD-P) For smaller task set sizes, R-BFD is more beneficial For larger task set sizes, R-BATCH is more beneficial

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion34 Back to Boss 20 periodic tasks for autonomous driving support By using R-BATCH Can tolerate 5 failures with 10 dual-core processors 35% saving compared to BFD-P With the primary With 1 Hot Standby per task With 4 Cold Standby per task Motivation → Goals → R-BATCH → Evaluation → Conclusion34

Carnegie Mellon Motivation → Goals → R-BATCH → Evaluation → Conclusion35 Conclusion Many safety-critical real-time systems must also support redundancy for tolerating faults We defined recovery task models Hard Recovery Task Soft Recovery Task Best-effort Recovery Task We used two types of recovery schemes Hot Standby (for Hard Recovery Task) Cold Standby (for Soft Recovery Task) We can tolerate a fixed number of (fail-stop) failures R-BFD 18% fewer processors with Hot Standby R-BATCH 49% fewer processors with Hot Standby and Cold Standby Utilizes slack for additional tasks Motivation → Goals → R-BATCH → Evaluation → Conclusion35