Investigating Lightweight Fault Tolerance Strategies for Enterprise Distributed Real-time Embedded Systems Tech-X Corporation Boulder, Colorado Vanderbilt.

Slides:

Advertisements

Similar presentations

Computer Systems & Architecture Lesson 2 4. Achieving Qualities.

Advertisements

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.

Reliability on Web Services Presented by Pat Chan 17/10/2005.

CS 795 – Spring  “Software Systems are increasingly Situated in dynamic, mission critical settings ◦ Operational profile is dynamic, and depends.

CiCUTS: Combining System Execution Modeling Tools with Continuous Integration Environments James H. Hill, Douglas Schmidt Vanderbilt University Nashville,

High-confidence Software for Cyber Physical Systems Drexel University Philadephia, PA Vanderbilt University Nashville, Tennessee Aniruddha Gokhale *, Sherif.

Effective Coordination of Multiple Intelligent Agents for Command and Control The Robotics Institute Carnegie Mellon University PI: Katia Sycara

Component Patterns – Architecture and Applications with EJB copyright © 2001, MATHEMA AG Component Patterns Architecture and Applications with EJB JavaForum.

Variability Oriented Programming – A programming abstraction for adaptive service orientation Prof. Umesh Bellur Dept. of Computer Science & Engg, IIT.

A Dependable Auction System: Architecture and an Implementation Framework

CS599 Software Engineering for Embedded Systems1 Software Engineering for Real-Time: A Roadmap Presentation by: Mandar Samant Raghbir Singh Banwait.

1 12/10/03CCM Workshop QoS Engineering and Qoskets George Heineman Praveen Sharma Joe Loyall Richard Schantz BBN Technologies Distributed Systems Department.

1 Quality Objects: Advanced Middleware for Wide Area Distributed Applications Rick Schantz Quality Objects: Advanced Middleware for Large Scale Wide Area.

Software Engineering and Middleware: a Roadmap by Wolfgang Emmerich Ebru Dincel Sahitya Gupta.

Ensuring Non-Functional Properties. What Is an NFP?  A software system’s non-functional property (NFP) is a constraint on the manner in which the system.

WPDRTS ’05 1 Workshop on Parallel and Distributed Real-Time Systems 2005 April 4th and 5th, 2005, Denver, Colorado Challenge Problem Session Detection.

Chapter 13 Embedded Systems

FTMP: A Fault-Tolerant Multicast Protocol Louise E. Moser Department of Electrical and Computer Engineering University of California, Santa Barbara.

23 September 2004 Evaluating Adaptive Middleware Load Balancing Strategies for Middleware Systems Department of Electrical Engineering & Computer Science.

QoS-enabled middleware by Saltanat Mashirova. Distributed applications Distributed applications have distinctly different characteristics than conventional.

Computer System Architectures Computer System Software

26 Sep 2003 Transparent Adaptive Resource Management for Distributed Systems Department of Electrical Engineering and Computer Science Vanderbilt University,

BBN Technologies Craig Rodrigues Gary Duzan QoS Enabled Middleware: Adding QoS Management Capabilities to the CORBA Component Model Real-time CCM Meeting.

Presented at University of Alabama CIS, Birmingham Monday, April 9, 2001 Patterns-based Fault Tolerant CORBA Implementation for Predictable Performance.

Computer Science Open Research Questions Adversary models –Define/Formalize adversary models Need to incorporate characteristics of new technologies and.

Wireless Access and Terminal Mobility in CORBA Dimple Kaul, Arundhati Kogekar, Stoyan Paunov.

SAMANVITHA RAMAYANAM 18 TH FEBRUARY 2010 CPE 691 LAYERED APPLICATION.

October 8, 2015 Research Sponsored by NASA Applying Reflective Middleware Techniques to Optimize a QoS-enabled CORBA Component Model Implementation Nanbor.

Dependable Systems (CSE 890), Thursday, 27 th 2003 IRL Interoperable Replication Logic: A three-tier approach to FT-CORBA Infrastructures Authors: R. Baldoni,

RepoMan A Component Repository Manager for Enterprise DRE Systems Stoyan Paunov & Douglas C. Schmidt Vanderbilt University Institute for Software Integrated.

Progress SOA Reference Model Explained Mike Ormerod Applied Architect 9/8/2008.

HPEC’02 Workshop September 24-26, 2002, MIT Lincoln Labs Applying Model-Integrated Computing & DRE Middleware to High- Performance Embedded Computing Applications.

Consistent and Efficient Database Replication based on Group Communication Bettina Kemme School of Computer Science McGill University, Montreal.

Real-Time CORBA By Christopher Bolduc. What is Real-Time? Real-time computing is the study of hardware and software systems that are subject to a “real-

Sunday, October 15, 2000 JINI Pattern Language Workshop ACM OOPSLA 2000 Minneapolis, MN, USA Fault Tolerant CORBA Extensions for JINI Pattern Language.

Refining middleware functions for verification purpose Jérôme Hugues Laurent Pautet Fabrice Kordon

Modeling Component-based Software Systems with UML 2.0 George T. Edwards Jaiganesh Balasubramanian Arvind S. Krishna Vanderbilt University Nashville, TN.

DataReader 2 Enhancing Security in Ultra-Large Scale (ULS) Systems using Domain- specific Modeling Joe Hoffert, Akshay Dabholkar, Aniruddha Gokhale, and.

Decision-Theoretic Planning with (Re)Deployment of Components in Distributed Real-time & Embedded Systems Douglas C. Schmidt

Investigating Survivability Strategies for Ultra-Large Scale (ULS) Systems Vanderbilt University Nashville, Tennessee Institute for Software Integrated.

1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.

Distribution and components. 2 What is the problem? Enterprise computing is Large scale & complex: It supports large scale and complex organisations Spanning.

FT-ERF Fault-Tolerance in an Event Rule Framework for Distributed Systems Hillary Caituiro-Monge, Graduate Student. Advisor: Javier Arroyo-Figueroa, Ph.D.

University of Toronto at Scarborough © Kersti Wain-Bantin CSCC40 system architecture 1 after designing to meet functional requirements, design the system.

SOFTWARE DESIGN AND ARCHITECTURE LECTURE 13. Review Shared Data Software Architectures – Black board Style architecture.

MDDPro: Model-Driven Dependability Provisioning in Enterprise Distributed Real-time and Embedded Systems Sumant Tambe* Jaiganesh Balasubramanian Aniruddha.

NetQoPE: A Middleware-based Netowork QoS Provisioning Engine for Distributed Real-time and Embedded Systems Jaiganesh Balasubramanian

A QoS Policy Modeling Language for Publish/Subscribe Middleware Platforms A QoS Policy Modeling Language for Publish/Subscribe Middleware Platforms Joe.

Adaptive Resource Management Architecture for DRE Systems Nishanth Shankaran

1 BBN Technologies Quality Objects (QuO): Adaptive Management and Control Middleware for End-to-End QoS Craig Rodrigues, Joseph P. Loyall, Richard E. Schantz.

Topic 2: The Role of Open Standards, Open-Source Development, & Different Development Models & Processes (on Industrializing Software) ARO Workshop Outbrief,

Component Patterns – Architecture and Applications with EJB copyright © 2001, MATHEMA AG Component Patterns Architecture and Applications with EJB Markus.

Towards A QoS Modeling and Modularization Framework for Component-based Systems Sumant Tambe* Akshay Dabholkar Aniruddha Gokhale Amogh Kavimandan (Presenter)

August 20, 2002 Applying RT-Policies in CORBA Component Model Nanbor Wang Department of Computer Science Washington University in St. Louis

Model-Driven Optimizations of Component Systems Vanderbilt University Nashville, Tennessee Institute for Software Integrated Systems OMG Real-time Workshop.

©Ian Sommerville 2000, Tom Dietterich 2001 Slide 1 Distributed Systems Architectures l Architectural design for software that executes on more than one.

Fault-tolerance for Component-based Systems – An Automated Middleware Specialization Approach Sumant Tambe* Akshay Dabholkar Aniruddha Gokhale Abhishek.

FLARe: a Fault-tolerant Lightweight Adaptive Real-time Middleware for Distributed Real-time and Embedded Systems Dr. Aniruddha S. Gokhale

Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.

The Role of Reflection in Next Generation Middleware

Sumant Tambe* Akshay Dabholkar Aniruddha Gokhale

International Service Availability Symposium (ISAS) 2007

Real-time Software Design

QoS-Enabled Middleware

Transparent Adaptive Resource Management for Middleware Systems

Tools for Composing and Deploying Grid Middleware Web Services

Software Architecture Lecture 20

International Service Availability Symposium (ISAS) 2007

Transparent Adaptive Resource Management for Middleware Systems

Presentation transcript:

Investigating Lightweight Fault Tolerance Strategies for Enterprise Distributed Real-time Embedded Systems Tech-X Corporation Boulder, Colorado Vanderbilt University Nashville, Tennessee Jaiganesh Balasubramanian Dr. Aniruddha Gokhale Dr. Douglas C. Schmidt Dr. Nanbor Wang This research is sponsored by Lockheed Martin, BBN Technologies, and Raytheon under contracts to Vanderbilt University and by NSWCDD under contract to Tech-X Corporation (N M-0029)

2 Network-centric, dynamic, very large-scale “systems of systems” Stringent simultaneous QoS demands, e.g., “never die,” time- critical, etc. Highly diverse, complex, & increasingly integrated & autonomous application domains Distributed Object Computing middleware used to design and develop enterprise DRE systems support for highly available systems e.g., FT-CORBA end-to-end predictable behavior for requests e.g., RT-CORBA Enterprise DRE Systems Characteristics Goal is to provide real-time fault tolerance for enterprise DRE systems

3 Challenges in Developing RT-FT Enterprise DRE Systems Meeting real-time deadlines even in the presence of failures bounded times for fault detection and fault recovery bounded times for dynamic reconfigurations because of hardware and software failures Satisfying QoS demands in the face of fluctuating and/or insufficient resources degraded QoS for certain applications in the presence of hardware and software failures dependable upgrades, downgrades & updates

4 Problems in FT-CORBA Solutions for Enterprise DRE Systems Traditional FT-CORBA systems provide a one-size-fits-all solution excessive overhead for systems with real-time requirements and resource constraints overly complex, difficult to integrate Unpredictable and expensive replication and fault detection strategies no bound on TCP connection failure detections excessive overhead with active replication on totally ordered reliable group communication as well as state synchronization Semantic incompatibility between FT-CORBA and RT-CORBA solutions unacceptable latency and performance overheads inherent in vanilla FT-CORBA solutions adversely impacts predictability

5 Possible Lightweight Fault Tolerance Approaches Decoupling of different FT-specific functionalities from the middleware, so that the middleware can be integrated easily with other systems allows integrating well known fault tolerance techniques into the system move away from point solutions Integration of the desired fault tolerance infrastructure elements with RT-CORBA mechanisms to improve predictability of fault tolerance functionalities Semantic configuration of fault tolerance infrastructure based on understanding of application needs Investigation of interoperable protocols for interactions with external systems Microkernel architecture applied to fault tolerant systems Fault Recovery Fault Analysis Reliable Multicast Consistency Management Connection Management Fault Analysis Consistency Management Reliable Multicast Fault Recovery FT Middleware Replication Management Causal FIFO Ordered

6 OMG RFP for Lightweight RT Fault Tolerance (1/2) A new RFP issued by OMG in the OMG Technical Meeting at Boston, June 2006 requires RT FT solutions Separation of functionalities from the middleware that vary depending upon the operating scenarios, such as consistency management fault analysis, containment and recovery Middleware works collaboratively with external fault tolerance mechanisms providing fault detection, replication management and transparent failover.

7 OMG RFP for Lightweight RT Fault Tolerance (2/2) Support for both active and passive replications explore weak consistency models for active replication, thereby relaxing determinism requirements Support for providing a RT-CORBA mapping for the FT infrastructure elements as well as the application enforce priorities on invocations during failure and failure-free situations avoid priority inversions in recovery operations

8 Approaches to Lightweight RT FT (1/2) Identification of fault tolerance functionalities that can be better provisioned by external mechanisms ways to configure those mechanisms to provide different QoS to applications e.g. causal, FIFO, reliable guarantees on group communication Identification of uniform interfaces and strategy patterns so that applications and the interfaces they access need not change but the mechanisms they access can keep changing

9 Approaches to Lightweight RT FT (2/2) Identification of interfaces for integrating external fault tolerance mechanisms e.g. interfaces between the fault analyzers and recovery managers Identification of RT-CORBA mappings for consistent scheduling mechanisms for both the middleware as well as the external fault tolerance mechanisms

10 An Example Architecture (1/2) Middleware is responsible for handling communication between the client and the server applications Common fault tolerance functionalities decoupled from the middleware and configured appropriately Interoperable protocols like DDS can be used as a means of communication amongst external fault tolerance mechanisms e.g. fault detectors and fault analyzers Middleware provides certain fault detectors that can help faster failure detections e.g. TCP timeout detections

11 An Example Architecture (2/2) Middleware needs to interact with the external mechanisms to provide failure detections as well as to find out the current status of the replication group Application developer can configure a fault tolerance functionality that is most suited for the application needs e.g. configuring a consistency mechanism that fits the application needs middleware mechanisms works with the configured functionality to ensure real-time QoS

12 Example Replication Strategy Example replication strategy that can provide predictable failure detection and recovery is semi- active replication Replica groups are connected in a chain members monitor the member in front of the chain failure detections by transport connection closure or group communication timeouts Replica, designated as primary processes the requests and sends state updates to the other members using a reliable multicast protocol

13 Unexplored Issues with the Lightweight RT FT Architecture (1/2) Issues on how to configure active replication away from the middleware mechanisms used for communication On the client side, portable interceptors may need to be added to catch exceptions arising from server process failures assuming not using FT- CORBA compliant ORBs Middleware connection management timeouts need to work closely with the recovery manager simultaneous failure detections

14 Unexplored Issues with the Lightweight RT FT Architecture (2/2) Externally configured consistency management mechanisms must work with the middleware connection management, in order to ensure strong consistency weaker consistency models for updating states to primary as state synchronization happens Middleware replication manager might be a single point of failure Process group replication needs help from middleware for synchronizing middleware state recovery granularity at the ORB/POA level

15 Effectively Using Lightweight RT FT Semantic configurability of Fault Tolerance requires collecting application requirements and expected execution profiles target system configurations and resource availabilities Efficient deployment of applications and their replicas appropriate support for resource allocations to applications – application placement algorithms real-time ordering of operations using well known dynamic/hybrid scheduling strategies Understand needs of the applications allows a chance to configure the external fault tolerance mechanisms most suited for the applications allows opportunities for run-time optimizations in providing fault tolerance

16 Lightweight CCM for Enterprise DRE Systems Creates a standard “virtual boundary” around application component implementations that interact only via well- defined interfaces Define standard container mechanisms needed to execute components in generic component servers Specify the infrastructure needed to configure & deploy components throughout a distributed system Container … … … … … provides capabilities for dynamic reconfiguration as well as updating of implementations, so that enterprise DRE system infrastructure can be adaptive to the needs of the applications provides opportunities for reuse of core business logic provides mechanisms to change system behavior at runtime and according to operating environment needs …

17 Applicability of Lightweight CCM to Lightweight RT FT Common fault tolerance functionalities like the fault notification can be encapsulated as Lightweight CCM components attributes can be set on those Lightweight CCM components to specify the QoS they can provide configure attributes using deployment and configuration tools addition of helper components to interact with these Lightweight CCM components interfaces to the helper methods need not change Lightweight CCM implementations can be changed to get different functionality or different QoS …

18 Concluding Remarks Enterprise DRE systems require various degree of deterministic behaviors even in the presence of faults or fluctuating available resources FT-CORBA is not suitable for enterprise DRE systems, esp. very heavyweight and hard to integrate with other systems hard to integrate real-time fault tolerance point solutions and hard to customize Lightweight RT FT CORBA RFP is trying to address the needs of FT/RT applications decoupling of middleware functionalities that can be customized using well known and researched fault tolerance techniques separation of interfaces for easier control using RT-CORBA coupled with Lightweight CCM, can be easily deployed, configured and managed according to the needs of the applications We would like to thank Robert Kukura (Raytheon), Paul Werme (NSWC) and Andrew Foster (Prismtech) for providing useful discussions and information about the Lightweight Real-time Fault Tolerance specification and existing fault tolerance practices and solutions in the industry