ABCSG - Dependable Systems - 01/06/2006 1 ABCSG Dependable Systems.

Slides:



Advertisements
Similar presentations
Computer Systems & Architecture Lesson 2 4. Achieving Qualities.
Advertisements

Principles of Engineering System Design Dr T Asokan
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Impossibility of Distributed Consensus with One Faulty Process
Chapter 8 Fault Tolerance
Fault-Tolerant Systems Design Part 1.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Parallel Programming Motivation and terminology – from ACM/IEEE 2013 curricula.
(c) Oded Shmueli Distributed Recovery, Lecture 7 (BHG, Chap.7)
Dependability ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg University August.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Dependability TSW 10 Anders P. Ravn Aalborg University November 2009.
Transaction Processing Lecture ACID 2 phase commit.
Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
1 Chapter Fault Tolerant Design of Digital Systems.
2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.
Dependability ITV Real-Time Systems Anders P. Ravn Aalborg University February 2006.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.
CprE 458/558: Real-Time Systems
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 2 Wenbing Zhao Department of Electrical and Computer Engineering.
Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
March 13, 2001CSci Clark University1 CSci 250 Software Design & Development Lecture #15 Tuesday, March 13, 2001.
Software Metrics - Data Collection What is good data? Are they correct? Are they accurate? Are they appropriately precise? Are they consist? Are they associated.
Verification and Validation Overview References: Shach, Object Oriented and Classical Software Engineering Pressman, Software Engineering: a Practitioner’s.
CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II.
SOFTWARE DESIGN.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
SOFTWARE SYSTEMS DEVELOPMENT 4: System Design. Simplified view on software product development process 2 Product Planning System Design Project Planning.
Fault-Tolerant Systems Design Part 1.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Historical Aspects Origin of software engineering –NATO study group coined the term in 1967 Software crisis –Low quality, schedule delay, and cost overrun.
Quality Assurance.
CprE 458/558: Real-Time Systems
Safety-Critical Systems 7 Summary T V - Lifecycle model System Acceptance System Integration & Test Module Integration & Test Requirements Analysis.
CMSC 345 Fall 2000 Requirements Overview. Work with customers to elicit requirements by asking questions, demonstrating similar systems, developing prototypes,
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
Fault-Tolerant Systems Design Part 1.
Commit Algorithms Hamid Al-Hamadi CS 5204 November 17, 2009.
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
Basic Concepts of Dependability Jean-Claude Laprie DeSIRE and DeFINE Workshop — Pisa, November 2002.
Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.
1 INTRUSION TOLERANT SYSTEMS WORKSHOP Phoenix, AZ 4 August 1999 Jaynarayan H. Lala ITS Program Manager.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
18 September 2008CIS 340 # 1 Last Covered (almost)(almost) Variety of middleware mechanisms Gain? Enable n-tier architectures while not necessarily using.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
Week#2 Software Quality Assurance Software Quality Engineering.
18/05/2006 Fault Tolerant Computing Based on Diversity by Seda Demirağ
Week#3 Software Quality Engineering.
Software Quality Assurance
Outline Introduction Background Distributed DBMS Architecture
Fault Tolerance & Reliability CDA 5140 Spring 2006
Verification and Validation Overview
Fault Tolerance In Operating System
Rigorous Development Of a Safety-Critical System Based on Coordinated Atomic Actions By Subash M S.
Introduction to Database Systems
Chapter 10 Systems Implementation and Operation
On transactions, and Atomic Operations
Commit Protocols CS60002: Distributed Systems
On transactions, and Atomic Operations
Presentation transcript:

ABCSG - Dependable Systems - 01/06/ ABCSG Dependable Systems

ABCSG - Dependable Systems - 01/06/20062 Agenda  Dependable Computing Basic concepts Basic concepts DefinitionsDefinitions AttributesAttributes ThreadsThreads Means to attain dependability Means to attain dependability Fault preventionFault prevention Fault removalFault removal Fault forecastingFault forecasting Fault toleranceFault tolerance -> Branch into techniques -> Branch into Coordinated Atomic Actions

ABCSG - Dependable Systems - 01/06/20063 Dependable Computing - Definition  Ability to deliver service that can justifiably be trusted or  Ability of a system to avoid service failures that are more frequent or more severe than is acceptable

ABCSG - Dependable Systems - 01/06/20064 Dependable Computing - Attributes

ABCSG - Dependable Systems - 01/06/20065 Dependable Computing - Threats  Everything that can influence the system in such a way, that it will result in the system to fall outside the definition of dependable Development phase Development phase Physical worldPhysical world Human developersHuman developers Development toolsDevelopment tools Production and test facilitiesProduction and test facilities Use phase Use phase Physical worldPhysical world AdministratorsAdministrators Users of servicesUsers of services Providers of servicesProviders of services InfrastructureInfrastructure IntrudersIntruders

ABCSG - Dependable Systems - 01/06/20066 Means - Fault prevention  A failure is the result of an error  An error is the result of a fault => Prevent faults = prevent failure  Basically we all know how (right?) Information hiding Information hiding Modularization Modularization Strongly typed languages Strongly typed languages......

ABCSG - Dependable Systems - 01/06/20067 Means - Fault removal  During development (also test fault tolerance by fault injection) (also test fault tolerance by fault injection)  During use Corrective maintenance Corrective maintenance Preventive maintenance Preventive maintenance

ABCSG - Dependable Systems - 01/06/20068 Means - Fault forecasting  The performance of a evaluation of the system behavior with respect to fault occurrence or activation.  Qualitative evaluation Identify the failure modes or the event combinations that would lead to system failure. Identify the failure modes or the event combinations that would lead to system failure.  Quantitative evaluation Identify in terms of probabilities the extent to which some of the attributes of dependability are satisfied. Identify in terms of probabilities the extent to which some of the attributes of dependability are satisfied.

ABCSG - Dependable Systems - 01/06/20069 Means - Fault tolerance  Fault prevention include human activities and is thus imperfect => We need fault removal  Fault removal include human activities and is thus imperfect => We need fault forecasting  Fault forecasting include human activities and is thus imperfect => We need fault tolerance  Fault tolerance include human activities and is thus imperfect => Systems will fail... but a combination of all aforementioned techniques, can best lead to dependable computing... so lets have a look at fault tolerance

ABCSG - Dependable Systems - 01/06/ Fault tolerance  Recall that fault tolerance is one of the means to attain dependable systems  Terminology and key concept Fault -> Error -> Failure Fault -> Error -> Failure Failure semantics Failure semantics Redundancy Redundancy  Techniques Sequential Sequential Independent concurrent systems Independent concurrent systems Competitive concurrent systems Competitive concurrent systems Cooperative concurrent systems Cooperative concurrent systems Hybrid systems Hybrid systems

ABCSG - Dependable Systems - 01/06/ Fault tolerance - Terminology and key concept  A failure is the observation of an erroneous system state  An error is an erroneous system state, which might lead to a failure  A fault is a system defect, which might lead to an error

ABCSG - Dependable Systems - 01/06/ Fault tolerance - Terminology and key concept English  A failure is a consequence of an error that is the consequence of a fault Fault => Error => Failure Fault => Error => FailureDansk  En fejl er konsekvensen af en fejl som er konsekvensen af en fejl Fejl => Fejl => Fejl Fejl => Fejl => Fejl (Tænk lidt over den)

ABCSG - Dependable Systems - 01/06/ Fault tolerance - Terminology and key concept  We have a space of possibility between an error and a failure  Redundancy is the key concept

ABCSG - Dependable Systems - 01/06/ Fault tolerance - Sequential systems  Recovery blocks - redundant algorithms  Retry blocks - redundant data Acceptance test examines the system state to verify that the behavior is acceptable

ABCSG - Dependable Systems - 01/06/ Fault tolerance - Independent concurrent systems  N-Version programming - The parallel version of recovery blocks  N-Copy programming - The parallel version of retry blocks The decision mechanism must decide if one of the results can be considered correct... and this is not an easy task ! - Multiple correct results, floating point precision... - Exact majority voter, mean voter, consensus voter, etc...

ABCSG - Dependable Systems - 01/06/ Fault tolerance - Competitive concurrent systems  Two or more processes are not aware of each other, but share some resources  They want to live in their own environment and a fault in one process should not affect the other processes  Transactions Atomicity / Consistency / Isolation / DurabilityAtomicity / Consistency / Isolation / Durability Provide backward error recoveryProvide backward error recovery Together with exception handling, transactions can be used to provide forward error recoveryTogether with exception handling, transactions can be used to provide forward error recovery In self-checking transactional objects methods are decorated with a pre and a post conditionIn self-checking transactional objects methods are decorated with a pre and a post condition

ABCSG - Dependable Systems - 01/06/ Fault tolerance - Cooperative concurrent systems  Several processes cooperate in executing a common job, and they are aware of each other  Conversation Works like a transaction involving several processes Works like a transaction involving several processes It’s an isolated environment for the participating processes, they are not allowed to communicate outside the conversation (information smuggling) It’s an isolated environment for the participating processes, they are not allowed to communicate outside the conversation (information smuggling) Ultimately everybody commits or rollback to the state from the beginning of the conversation - backward error recovery Ultimately everybody commits or rollback to the state from the beginning of the conversation - backward error recovery  Atomic actions Is a conversation, but with the ability to do forward error recovery Is a conversation, but with the ability to do forward error recovery

ABCSG - Dependable Systems - 01/06/ Fault tolerance - Hybrid systems  Models that support both competitive and corporative concurrency  Coordinated atomic actions An atomic action, but with the possibility of the participants to access external objects An atomic action, but with the possibility of the participants to access external objects Atomic actions to control cooperative concurrency and coordinated error recovery Atomic actions to control cooperative concurrency and coordinated error recovery Transactions to control competitive concurrency to maintain the consistency of the shared resources in case of failures Transactions to control competitive concurrency to maintain the consistency of the shared resources in case of failures

ABCSG - Dependable Systems - 01/06/ Coordinated Atomic Actions... must be another day, I think time is up!