21/03/2006 Cryptologie, Sécurité des systèmes & Espionnage industriel 1 Calculs sécurisés adaptatifs sur infrastructure de calcul global Calculs sécurisés.

Slides:



Advertisements
Similar presentations
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Advertisements

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Teaser - Introduction to Distributed Computing
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix Jean-Louis Roch, Daouda Traore INRIA-CNRS Moais team - LIG Grenoble,
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
1 Building Reliable Web Services: Methodology, Composition, Modeling and Experiment Pat. P. W. Chan Department of Computer Science and Engineering The.
CSCE 715 Ankur Jain 11/16/2010. Introduction Design Goals Framework SDT Protocol Achievements of Goals Overhead of SDT Conclusion.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Diagnosis on Computational Grids for Detecting Intelligent Cheating Nodes Felipe Martins Rossana M. Andrade Aldri L. dos Santos Bruno SchulzeJosé N. de.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
Smart Redundancy for Distributed Computation George Edwards Blue Cell Software, LLC Yuriy Brun University of Washington Jae young Bang University of Southern.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
Reliability on Web Services Pat Chan 31 Oct 2006.
1 Client-Server versus P2P  Client-server Computing  Purpose, definition, characteristics  Relationship to the GRID  Research issues  P2P Computing.
Lecture 11 Reliability and Security in IT infrastructure.
1 of 14 1 Scheduling and Optimization of Fault- Tolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
March 24, 2003Upadhyaya – IWIA A Tamper-resistant Framework for Unambiguous Detection of Attacks in User Space Using Process Monitors R. Chinchani.
SafeScale project C. CERIN, J.L. PAZAT, J.L. ROCH, R. KERYELL
Pregel: A System for Large-Scale Graph Processing
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
RAID Shuli Han COSC 573 Presentation.
Developing Analytical Framework to Measure Robustness of Peer-to-Peer Networks Niloy Ganguly.
FMEA-technique of Web Services Analysis and Dependability Ensuring Anatoliy Gorbenko Vyacheslav Kharchenko Olga Tarasyuk National Aerospace University.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Introduction Distributed Algorithms for Multi-Agent Networks Instructor: K. Sinan YILDIRIM.
Network Aware Resource Allocation in Distributed Clouds.
SecureMR: A Service Integrity Assurance Framework for MapReduce Author: Wei Wei, Juan Du, Ting Yu, Xiaohui Gu Source: Annual Computer Security Applications.
On Probabilistic Snap-Stabilization Karine Altisen Stéphane Devismes University of Grenoble.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.
Xiao Liu CS3 -- Centre for Complex Software Systems and Services Swinburne University of Technology, Australia Key Research Issues in.
A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,
MODULE 12 Control Audit And Security Of Information System 12.1 Controls in Information systems 12.2 Need and methods of auditing Information systems 12.3.
Autonomous Replication for High Availability in Unstructured P2P Systems Francisco Matias Cuenca-Acuna, Richard P. Martin, Thu D. Nguyen
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
The Grid System Design Liu Xiangrui Beijing Institute of Technology.
Distributed Database Systems Overview
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Practical Byzantine Fault Tolerance
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
Scalable Computing on Open Distributed Systems Jon Weissman University of Minnesota National E-Science Center CLADE 2008.
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
Fault-Tolerant Parallel and Distributed Computing for Software Engineering Undergraduates Ali Ebnenasir and Jean Mayo {aebnenas, Department.
Oracle's Distributed Database Bora Yasa. Definition A Distributed Database is a set of databases stored on multiple computers at different locations and.
5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.
Trust-Sensitive Scheduling on the Open Grid Jon B. Weissman with help from Jason Sonnek and Abhishek Chandra Department of Computer Science University.
Analyzing the Vulnerability of Superpeer Networks Against Attack Niloy Ganguly Department of Computer Science & Engineering Indian Institute of Technology,
Mehmet Can Kurt, The Ohio State University Sriram Krishnamoorthy, Pacific Northwest National Laboratory Kunal Agrawal, Washington University in St. Louis.
Software Testing and Quality Assurance Practical Considerations (4) 1.
On-line adaptative parallel prefix computation Jean-Louis Roch, Daouda Traore, Julien Bernard INRIA-CNRS Moais team - LIG Grenoble, France Contents I.
Static Process Scheduling
CS 162 Section 10 Two-phase commit Fault-tolerant computing.
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Distributed Error- Confinement Shay Kutten (Technion) with Boaz Patt-Shamir (Tel Aviv U.) Yossi Azar (Tel Aviv U.)
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Data Management on Opportunistic Grids
Distributed Systems – Paxos
PREGEL Data Management in the Cloud
EEC 688/788 Secure and Dependable Computing
Fault Tolerance Distributed Web-based Systems
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Replica Placement Model: We consider objects (and don’t worry whether they contain just data or code, or both) Distinguish different processes: A process.
Anand Bhat*, Soheil Samii†, Raj Rajkumar* *Carnegie Mellon University
Presentation transcript:

21/03/2006 Cryptologie, Sécurité des systèmes & Espionnage industriel 1 Calculs sécurisés adaptatifs sur infrastructure de calcul global Calculs sécurisés adaptatifs sur infrastructure de calcul global Thierry Gautier, Samir Jafar, Franck Leprévost+, Jean-Louis Roch, Sébastien Varrette+ and Axel Krings* Projet MOAIS (CNRS,INPG,INRIA,UJF) LIG - IMAG, Grenoble, France + Université du Luxembourg, Luxembourg * Idaho University, Moscow, Idaho.

Page: 2 u Large-Scale Global Computing Systems u Subject Application to Dependability Problems – Can be addressed in the design u Subject Application to Security Problems – Requires solutions from the area of survivability, security, fault-tolerance Target Application

Page: 3 u Computation intensive parallel application – Medical (mammography comparison) Typical Application [RAGTIME] store image

Page: 4 Global Computing Architecture Internet u Large-scale distributed systems (e.g. Grid, P2P) – Eg : BOINC [Berkeley Open Infrastructure for Network Computing] u Transparent allocation of resources User

Page: 5 u Dataflow Graph – G = ( v,  ) v finite set of vertices v i  set of edges e jk vertices v j, v k  v u Two kinds of tasks T i Tasks in the traditional sense D j Data tasks inputs and outputs Definitions and Assumptions

Page: 6 Resource allocation u Assumption on the application : – large number of operations to perform = W 1 (sequential work) – huge degree of parallelism = W  (critical time = parallel work on #procs=  ) – Global computing application framework : W  << W 1 u Allocation : Distributed randomized work-stealing schedule [Cilk98] [Athapascan98] – Local non-preemptive execution of tasks » New created tasks are pushed in a local queue. – When a resource becomes idle, it randomly selects another one that has ready tasks (greedy) and steals the oldest ready task u Provable performances (with huge probability) [Bender-Rabin02] » On-line adaptation to the global computing platform

Page: 7 u In the Survivability Community our general computing environment is referred to as Unbounded Environment – Lack of physical / logical bound – Lack of global administrative view of the system. What risks are we subjecting our applications to? Security issues for a global computation

Page: 8 Assumptions u Anything is possible! » and it will happen! u Malicious act will occur sooner or later u It is hard or impossible to predict the behavior of an attack

Page: 9 Two kinds of failures (1/2) Internet 1. Node failures – “fail stop” model User

Page: 10 Fault Tolerance Approaches u Simplified Taxonomy for Fault Tolerance Protocols u Stable memory to store checkpoints (replication, ECC,.. ) u Two “extreme” protocols (distributed, asynchronous) are distinguished : – Pessimistic : Systematic storage of all events / communications : » Large overhead but ensures small restart time [MPICH-V1] – Optimistic : only events that ensure causality relations are stored [Com. induced] » Overhead is reduced but more recomputations in case of fault [Satin 05] u Compromises : – Non-coordinated : periodic local checkpoint of the tasks queue – Coordinated : global checkpoint of the stacks FT Protocol DuplicationCheckpointing Uncoordinated Message-Logging Communication- induced PessimisticCoordinatedOptimisiticCausal

Page: 11 Pessimistic [SEL] storage versus non-coordinated com. induced [TIC] 17.6% 23.5% 18.7% Application : Quadratic Assignment Problem with Kaapi : QAP-Nugent 24 [Cung&al 05]

Page: 12 Two kinds of failures (2/2) Internet 2. Task forgery – “massive attacks” User worm,virus bad result

Page: 13 Fault Models u Simplified Fault Taxonomy u Fault-Behavior and Assumptions – Independence of faults – Common mode faults -> towards arbitrary faults! u Fault Sources – Trojan, virus, DOS, etc. – How do faults affect the overall system? Fault BenignMalicious SymmetricAsymmetric

Page: 14 u Attacks – single nodes, difficult to solve with certification strategies – solutions: e.g. intrusion detection systems (IDS) u Massive Attacks – affects large number of nodes – may spread fast (worm, virus) – may be coordinated (Trojan) u Impact of Attacks – attacks are likely to be widespread within neighborhood, e.g. subnet u Our focus: massive attacks – virus, trojan, DoS, etc. Attacks and their impact

Page: 15 u Mainly addressed for independent tasks u Current approaches – Simple checker [Blum97] – Voting [eg BOINC, – Spot-checking [Germain-Playez 2003, based on Wald test] – Blacklisting – Credibility-based fault-tolerance [Sarmenta 2003] – Partial execution on reliable resources (partitioning) [Gao-Malewicz 2004] – Re-execution on reliable resources u Certification of Computation to detect massive attacks Certification Against Attacks

Page: 16 u GCP includes workers, checkpoint server and verifiers Global Computing Platform (GCP)

Page: 17 u Monte Carlo certification: – a randomized algorithm that 1. takes as input E and an arbitrary , 0 <  ≤1 2. delivers n either CORRECT n or FAILED, together with a proof that E has failed – certification is with error  if the probability of answer CORRECT, when E has actually failed, is less than or equal to . u Interest –  : fixed by the user (tunable certification) – Number of executions by the verifiers is not to large with respect of the number of tasks Probabilistic Certification

Page: 18 Protocols MCT and EMCTs u The Basic Protocol: The Monte Carlo Test (MCT) [SBAC04] 1. Uniformly select one task T in G we know input i(T,E) and output o(T,E) of T from checkpoint server 2. Re-execute T on verifier, using i(T,E) as inputs, to get output ô(T,E) If o(T,E) ≠ ô(T,E) return FAILED 3. Return CORRECT u Results about extended MCT (EMCTs) [EIT-b 2005] – Number N of re-execution depends – where  G depends on the graph structure, the ratio of tasks forgeries and of the protocol – E.g.: For massive attack and independent tasks:  G = q

Page: 19 u How many independent executions of MCT are necessary to achieve certification of E with probability of error ≤  ? – Prob. that MCT selects a non-forged tasks is – N independent applications of MCT results in  ≤ (1 - q) N Certification of Independent Tasks

Page: 20 u Relationship between certification error and N Certification of Independent Tasks For q = 1% : 300 checks =>  < 5% 4611 checks =>  < checks =>  <

Page: 21 u Algorithm EMCT 1. Uniformly select one task T in G 2. Re-execute all T j in G ≤ (T), which have not been verified yet, with input i(T,E) on a verifier and return FAILED if for any T j we have o(T j,E) ≠ ô(T j,E) 3. Return CORRECT u Behavior – disadvantage: the entire predecessor graph needs to be re-executed – however: the cost depends on the graph » luckily our application graphs are mainly trees Task dependencies

Page: 22 u Results of independent tasks still hold, – but N hides the cost of verification » independent tasks: C = 1 » dependent tasks: C = |G ≤ (T)| Analysis of EMCT

Page: 23 For EMCT the entire predecessor graph had to be verified To reduce verification cost two approaches are considered next: 1. Verification with fractions of G ≤ (T) 2. Verification with fixed number of tasks in G ≤ (T) Reducing the cost of verification

Page: 24 u Number of effective initiators – this is the # of initiators as perceived by the algorithm – e.g. for EMCT an initiator in G ≤ (T) is always found, if it exists u Efficient massive attack detection in the framework W  << W 1 Results for pathological cases

Page: 25 u Programming an application on a Global computing platform : – Designing adaptive algorithm for efficient resource allocation u Managing resource resilience and crash faults : – Tuned fault-tolerance protocol to decrease overhead – Key problem : efficient distributed stable memory [ECC promising] u Managing malicious intrusions : – Detection of massive attacks » Efficient probabilistic certification – Protection against local attacks : » Redundant computations » Self fault-tolerant algorithms [eg Lamport sorting network] [Varrette06] Conclusion

Page: 26 Questions? [89] Samir Jafar, Varrette Sébastien, and Jean-Louis Roch. Using data-flow analysis for resilience and result checking in peer-to-peer computations. In IEEE DEXA'2004, Zaragoza, August [92] Sébastien Varrette, Jean-Louis Roch, and Franck Leprévost. Flowcert: Probabilistic certification for peer-to-peer computations. IEEE SBAC-PAD 2004, pages , Foz do Iguacu, Brazil, October [97] Axel W. Krings, Jean-Louis Roch, and Samir Jafar. Certification of large distributed computations with task dependencies in hostile environments. IEEE EIT 2005, Lincoln, May [99] Samir Jafar, Thierry Gautier, Axel W. Krings, and Jean-Louis Roch. A checkpoint/recovery model for heterogeneous dataflow computations using work-stealing. EUROPAR'2005, Lisbonne, August [104]J.L Roch & AHA Team. Adaptive algorithms : theory and application. SIAM Parallel Processing 2006, San Francisc, February 2006