Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stabilization Enabling Technology Shlomi Dolev. Trustworthy Systems: Why is it So Hard? Corbató’91: "It almost goes without saying that ambitious systems.

Similar presentations


Presentation on theme: "Stabilization Enabling Technology Shlomi Dolev. Trustworthy Systems: Why is it So Hard? Corbató’91: "It almost goes without saying that ambitious systems."— Presentation transcript:

1 Stabilization Enabling Technology Shlomi Dolev

2 Trustworthy Systems: Why is it So Hard? Corbató’91: "It almost goes without saying that ambitious systems never quite work as expected“ http://larch-www.lcs.mit.edu:8001/~corbato/turing91/ http://larch-www.lcs.mit.edu:8001/~corbato/turing91/ "You must pay extreme attention to detail here. One wrong bit will make things fail… " http://my.execpc.com/~geezer/os/pm.htm http://my.execpc.com/~geezer/os/pm.htm From Pentium’s manual: “… if the ESP or SP register is 1 when the PUSH instruction is executed, the processor shuts down due to a lack of stack space. No exception is generated to indicate this condition"

3 Mars Rover - Spirit …The Spirit rover has a radiation-hardened R6000 CPU from Lockheed-Martin Federal Systems…The operating system is Wind River Systems' Vx-Works.. …attempted to allocate more files than the RAM-based directory structure could accommodate. That caused an exception, which caused the task that had attempted the allocation to be suspended… …Spirit fell silent, alone on the emptiness of Mars, trying and trying to reboot http://www.eetimes.com/sys/news/OEG20040220S0046

4 Linux and Windows do not Stabilize

5 Self-Stabilization Self-healing, Self-managing, Self-* Recovery Oriented Computing [Berkeley, Stanford] Autonomic Computing [IBM] Self-Stabilization Self-Stabilizing algorithm for mutual exclusion in a ring topology [Dijkstra’74]

6 Well Established Theory !

7 Self-Stabilization The combination and type of faults cannot be totally anticipated in on-going systems Any on-going system must be self stabilizing (or manually monitored) E L

8 First Self-Stabilizing Algorithm: Token Passing [Dij74]

9 Token Passing 1 P 1 : do forever 2 if x 1 =x n then 3 x 1 :=(x 1 +1)mod(n+1) 4 P i (i ≠ 1): do forever 5 if x i ≠x i- 1 then 6 x i :=x i-1

10 Token Passing Cont. Surely works when we start in x 1 = x 2 = … = x n = 0. One processor may change a state at a time. {0; 0; 0; 0; 0}; {1; 0; 0; 0; 0}; {1; 1; 0; 0; 0}; {1; 1; 1; 0; 0}; {1; 1; 1; 1; 0}; {1; 1; 1; 1; 1}; {2; 1; 1; 1; 1}; {2; 2; 1; 1; 1}; {2; 2; 2; 1; 1}; {2; 2; 2; 2; 1}; {2; 2; 2; 2; 2} …

11 Token Passing: Faults Transient fault, soft errors, wrong CRC, unexpected temporal severe conditions, etc. Assigns each processor with an arbitrary state (in the range of its state space). For example {3; 4; 4; 1; 0}. p 2 ; p 4 ; and p 5 have tokens! Will the system ever recover?

12 Token Passing: Automatic Recovery p 1 changes state infinitely often, Otherwise, let s 1 be the fixed state of p 1, p 2 eventually copies s 1 from p 1, then p 3 eventually copies s 1 from p 2, then... p n eventually copies s 1 from p n-1, then p 1 changes state. p 1 changes state in the order 4; 5; 0; 1; 2; 3; 4; 5; 0;...

13 Token Passing: Automatic Recovery Cont. In any initial state at least one state is missing, {4; 4; 1; 0; 2}, 3 and 5 are missing. Once p 1 reaches the missing state e.g., 5, all the processors must copy 5, before p 1 reads 5 from p n and changes state to 0.

14 Will It Stabilize With mod (n - 2)? Mod 3 {0,0,2,1,0} p 1 {1,0,2,1,0} p 5 {1,0,2,1,1} p 4 {1,0,2,2,1} p 3 {1,0,0,2,1} p 2 {1,1,0,2,1} +1 mod 3 !

15 Is Self-Stabilization a Toy?

16 Stabilization Stack Self Stabilizing Microprocessor [DH04,DH06] Self Stabilizing Operating System [DY04] Self-Stabilization Preserving Compiler[DH05] Self-Stabilizing Automatic Recoverer For Eventual Byzantine Software [BDK03] Recovery Oriented Programming[BD05]

17 Implementation Bottleneck Ask Intel, AMD, IBM to design a self- stabilizing microprocessor… Technology for converting off-the shelf processor to be self-stabilizing [DH06] Ask Microsoft, IBM, Red Hat, to convert existing code of OS to be self- stabilizing… Stabilizing Virtual Machine [DY07]

18 Enforcing stabilization by resetting Processors behave correctly after reset Periodic reset ensures correct behavior But damages closure… Need careful solutions

19 Periodic Reset Monitor Find a location P in OS code reached at least every T time At P: Save necessary information to RAM Request a reset and loop forever. Stabilizing watchdog accepts request and resets processor Upon reset: restore information and continue Stabilizing watchdog verifies that a reset is performed at least every T + epsilon time

20 Implementation using Intel XScale core Used in numerous processors Network, I/O, Handheld, Cellular etc. RISC architecture (ARMv5 compatible) Debug interface Allows interaction between WD and OS External debug break used for notifying the upcoming reset

21 Up to now Virtual Self-stabilizing processor on top of commercial quality processor Towards repeating the concept in OSs and VMMs (enforcing configuration and protecting critical operations)

22 Toward Self-Stabilizing Operating System (SOS) Shlomi Dolev and Reuven Yagel, SAACS’04 Workshop, Zaragoza

23 Basic Directions Black-box Take existing OS (Unix, Windows, RTOS) Add stabilization layer Carefully tailoring a tiny kernel Processor scheduling Memory management Device driver Hosting Byzantine processes

24 Assumptions Every configuration (processor/memory) is possible At least some program code is hardwired (in ROM) and is correct – Harvard Model Processor: Instruction manual (e.g. x86\IA-32) defines a transition function. Self-stabilizing [DH04]

25 Black Box Periodic Reset Re-install and Execute Watchdog timer (self-stabilizing) Periodic processor reset During bootstraps OS reinstall from ROM Weak self-stabilization E = (c i, a i, c i+1, …., RRE, c 1, a 1, c 2, a 2, …., c i, a i, c i+1, …., RRE, c 1, a 1, c 2, a 2, …. Is it always acceptable? Alternative: Periodic re-install code only, add consistency check and enforcement

26 Tailored Kernel Tiny Scheduler Tiny Memory Manager Requirements: Self-stabilizing Fair Process stabilization preserving (e.g. validity of P.C. value)

27 Tiny SOS Scheduler ; increase task 10 mov word ax, [currentProc] 11 and ax, PROC_MASK... ; load task state... ;restore ip 52 mov ax, [bx+4] ;validate ip 53 and ax, IP_MASK 54 mov word [ss:STACK TOP], ax ;restore general registers 55 mov cx, word [bx+12] 56 mov dx, word [bx+14] 57 mov si, word [bx+16] 58 mov di, word [bx+18] ~70 lines of a real machine assembly code 16bit Real mode & 32bit Protected mode. Standard build and emulation tools (Nasm, ld, Bochs) Detailed proof of requirement preservation

28

29 Any State Process(ing) Next Process Validated & Ready Clock tick / execute next Some Error Establish Scheduler Consistency Tiny SOS Scheduler

30 Any State Process(ing) Next Process Validated & Ready Clock tick / execute next NMI / load PC with scheduler handler Establish Scheduler Consistency Tiny SOS Scheduler

31 Sketch of Proof In every execution E, the code of the scheduler is started to be executed and is executed from the first instruction to the last instruction infinitely often In every execution E of the scheduler each process is executed infinitely often The self-stabilizing scheduler preservers stabilization of processes.

32 Talk Outline Self Stabilizing Microprocessor [DH06] Self Stabilizing Operating System [DY04] Self-Stabilization Preserving Compiler[DH05] Self-Stabilizing Automatic Recoverer For Eventual Byzantine Software [BDK03] Recover Oriented Programming[BD05]

33 Self-Stabilization Preserving Compiler Shlomi Dolev, Yinnon A. Haviv, Department of Computer Science Ben-Gurion University, Israel Mooly Sagiv, Department of Computer Science Tel Aviv University, Israel

34 The Gap. Need a transformation between: Input program P written in a high abstraction language, e.g., (D)ASM. Output program Q in a machine language, say, JVM. Existing compilers? P and Q behaves the same when started in the initial state. What if Q reaches an unexpected state due to soft-error experienced by microprocessor?

35 Trivial Example A statement of the form: For each i in {0..9} do f(i) May be compiled to  Start with cx=12 inside the loop… Moreover: Any runtime mechanism can get stuck / inconsistent. mov ax, 10 mov cx, 0 loop1: push cx call f inc cx cmp cx,ax jne loop

36 Stabilization Preserving Compiler – a closer look  State space of P Ensuring that Q eventually behaves as P:  State space of Q

37 The Transformation upon do Variable declarations upon do Enforce invariants Scheduler condition_1 … condition_n Statement_1 Statement_n

38 Self-Stabilization Preserving Compiler: Summary Front end of compiler for ASM. Self Stabilization preserving compiler. Language with clear semantics from any state. New demands for a compiler.

39 Talk Outline Self Stabilizing Microprocessor [DH04] Self Stabilizing Operating System [DY04] Self-Stabilization Preserving Compiler[DH05] Self-Stabilizing Automatic Recoverer For Eventual Byzantine Software [BDK03] Recover Oriented Programming[BD05]

40 Self-Stabilization and Evolving Systems Real world systems cannot be verified exhaustively… We enforce safety and live-ness specifications Contract between the client, project manager and programmers, that is checked on line! Make sure that the additional (thin) monitoring and recovering layer is self-stabilizing A change can be made to the implementation/specification to support evolving environments

41 Self-Stabilizing Recoverer for Eventual Byzantine Software Olga Brukman, Shlomi Dolev Department of Computer Science Ben-Gurion University, Israel Hillel Kolodner, Haifa Research Labs IBM, Israel

42 Software Contains Bugs Heisenbugs, corrupt states, leaked resources are common… Correct and faultless SW is hard Long-lived running programs, e.g., OS Usually software is tested when starting from initial state and considering limited time scenarios.

43 Fault Model Reflecting Reality Software packages can be trusted to work as required after restart. Eventual Byzantine software. System administrators and users use reboot to deal with faults.

44 Middleware Architecture OS Kernel OMR 1 2 … n

45 Monitor-Restarter for Process and Subsystem 1 2 …

46 Restart Actions – Mature Approach Subsystem waits for completion of a restart of its components. Restart action may vary, depending on component internal state. Reschedule Roll-back Kill & Restart Few restart attempts with more drastic restart actions.

47 Computational Model: rsf- execution An execution E is rsf (restart supporting fair)-execution iff E is a fair execution in which every subsystem sub i that is initialised during E respects its specification function ss i. Requirement: Every rsf-execution E has a suffix in which the system respects its specification function ss.

48 Tools for Implementation – Black Box Approach Software package is a black box. Package is monitored by recording it’s IO (e.g., strace in Linux). Monitors are independent of specific implementation

49 Tools for Implementation – Transparent Box Approach Software package implementation tool is known. Run-Time Reflection tools are used to monitor and restart the package. Possible in Java, C++, CORBA, COM.

50 Practical Experience: Printers Problem Corrupted pdf, doc or ps file sent to printing server. Printer can’t print the file. Cause retries by printing server Printer is “stuck” on one job. Predicate for printing server:  Restrict number of retries, try format conversions, send error message to user.

51 Recovery Oriented Programming Olga Brukman and Shlomi Dolev Department of Computer Science Ben-Gurion University, Israel

52 Towards Robust Software Programming Structural programming, OOD, Design Patterns… Testing and debugging Unit testing [JUnit, CppUnit]… Design By Contract (Eiffel) … Formal specification languages ASM, IO Automata, NURPL Model checking Online recovery ROC [PBB02]. Self-Stabilizing Autonomic Recoverer for Eventual Byzantine Software [BDK03] - black box software packages.

53 Our Contribution Program invariants derived from design specifications. Checked every time invariant variables are updated. Automatic code generation for invariant verification and recovery upon invariant violation. Invariants are verified during runtime. Change of invariant variable is pre-checked in sand-box. Violation is prevented and replaced with a recovery action.

54 Our Contribution Cont. Recovery action is chosen depending on the current state and history.  Roll back & resume.  Wait.  Reschedule.  Kill & restart.

55 External Monitoring Monitoring the whole task to avoid transient faults occurrence after which invariant variables are not changed ( and no invariant checks are done) liveness problem – monitor over time

56 Talk Conclusions Self-Stabilization as an effective paradigm for creating robust systems. Rigorous approach for designing basic system components Microprocessor Virtual machine monitor Operating system Compiler Evolving and Recovery Oriented

57

58

59 Self-Stabilizing Virtual Machine Hypervisor Architecture for Resilient Cloud Alexander Binun, Shlomi Dolev, Reuven Yagel {binun,dolev,yagel}@cs.bgu.ac.il Martin Kahil, Mark Bloch, Boaz Menuhin {kahilm,mbloch}@post.bgu.ac.il, boaz.menuhin@gmail.com Ben-Gurion University of the Negev Marc Lacoste, Thierry Coupaye, Aurelien Wailly {marc.lacoste,thierry.coupaye,aurelien.wailly}@orange.com Orange Labs, Paris

60 INTRODUCTION

61 Virtualization Virtual machines (VM) Guest OS runs apps Hypervisor (HV) Hardware for VMs Assume that HV is in host OS (e.g. KVM) Malware (e.g. rootkits) Get into HV from VM VM

62 Terminology Transient failures (TFs): Yield arbitrary changes of the system state SEU (Single Event Update), limitations of error detection algorithms We do not want to say exactly, as we risk forgetting a scenario We better assume a resulting arbitrary state

63 Terminology, Cont. Admissible execution: minimal requirements for a system to be recoverable  e.g., less than one third of the processors are under attack Possible limited faults during the automatic recovery from the unanticipated events in the past: CPU failures Message losses

64 Self-Stabilization Legal execution: the desired behavior Safe state: every execution starting from it is legal Self-stabilizing algorithm reaches a safe state in some finite time In every admissible execution From any arbitrary systems’ state Without external intervention

65 Self-Stabilization, Cont. The program is stored in a ROM Not a subject for changes can be periodically reloaded The system state can undergo unpredicted changes… … following which the system should converge to a safe state

66 Example: Token Passing Algorithm 66 P 1 : do forever X 1 = X N => X 1 := (X 1 +1) mod (N+1) P i, i ≠ 1: do forever X i+1 ≠ X i => X i+1 := X i Atomic step P1P1 P3P3 P2P2 P4P4 X1X1 X2X2 X3X3 X4X4 Code: write-protected Variables: may be corrupted Legal execution: exactly one P i changes X i in infinitely many states A safe state: X 1 = X 2 = … X N The only possible execution is: exactly one P i changes X i

67 Token Passing: Self Stabilization {0; 0; 0; 0; 0}; {1; 0; 0; 0; 0}; … {1; 1; 1; 1; 1}; {2; 1; 1; 1; 1}; … {2; 2; 2; 2; 2} x 1 x 2 x 3 … x N Failure: start from Arbitrary values {3; 7; 2; 3; 0}; … {4; …… }; … {M; …… }; … {M; M; M; M; M}; x 1 x 2 x 3 … x N Safe: P 1 eventually increments x 1 –In N rounds x 1 propagates along the chain, reaches P N, then increases In a state S x 1 gets unique (not encountered) value M after incrementing x 1 several times Then M is propagated to other processors, reaching a safe state Round Number

68 OUR APPROACH – MAIN PRINCIPLE

69 Rootkit Activity – Current State of Art

70

71

72

73

74 Our approach : Software watchdog brings the system into a safe state Periodic I’m Alive (frequent) Reboot Every software component (host,guests) can be corrupted –And the watchdog as well… corrupted …

75 Reload system from ROM upon a hardware timer signal Hardware interrupt Consistency check: Tampered => Reboot SMM ROM Validator is write-protected by hardware means: –It is the one that guarantees self-stabilization –Runs rarely as it is very time-consuming Watchdog quickly detects small problems –Runs frequently and efficiently

76 User Apps Guest OS VM 1 User Apps Guest OS VM 2 T 2 (VM 2 ) T 1 (VM 1 ) VM Manager create / delete VM User Existing Infrastructure

77 CPU 4. Saving CPU state & stop VM 2 VM 1 OS Scheduler Pool T2T2 T1T1 VMTable Hypervisor VM i State i 1. Schedule VM 3. Run CPU during some time 2. Activate VM User Existing Infrastructure I/O drivers Inter-VM traffic Hard ware

78 CPU OS Scheduler VMTable Hypervisor VM i State i I/O drivers Pool T2T2 T1T1 Our Architecture: Watchdog Hard ware Stabilization Manager Periodic Interrupt Watchdog Check scheduler state Check VM state Check traffic state Safe state? I’m Alive not alive during a while =>reboot

79 CPU OS Scheduler VMTable Hypervisor VM i State i I/O drivers Pool T2T2 T1T1 Hard ware Stabilization Manager Periodic Interrupt (every second) Watchdog Check scheduler state Check VM state Check traffic state not alive during a while =>reboot Timer Integrity checker Interrupt (every day) integrity failure=>reboot Our Architecture: Hardware- facilitated integrity checking Safe state? I’m Alive

80 Implementation: Employ external tools for examining VMs VM Benign output I am Alive

81 Implementation: Employ external tools for examining VMs (cont) VM Report malware Alarm Kill, suspend etc

82 Future Work Test the prototype on real malware collections –e.g. TechGainer[1] Intelligent safety enforcement –if the situation is not severely dangerous : restart only malfunctioning fragments reset malfunctioning printer instead of rebooting the computer –Guarded Commands ([2]) as a basis for the specification language {(guard,action), … } Guard  safety check, action  enforcement

83 Future Work Make the architecture support distributed cloud infrastructures –E.g. OpenStack[3] Are there competitors ? –Azure [5] – recovery through replication –Replicas synchronization algorithm may suffer from transient faults too

84


Download ppt "Stabilization Enabling Technology Shlomi Dolev. Trustworthy Systems: Why is it So Hard? Corbató’91: "It almost goes without saying that ambitious systems."

Similar presentations


Ads by Google