Kernfach System Software WS04/05

Slides:



Advertisements
Similar presentations
1 Radio Maria World. 2 Postazioni Transmitter locations.
Advertisements

I/O Management and Disk Scheduling
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Process Description and Control
/ /17 32/ / /
Reflection nurulquran.com.
Worksheets.
18 Copyright © 2005, Oracle. All rights reserved. Distributing Modular Applications: Introduction to Web Services.
Summative Math Test Algebra (28%) Geometry (29%)
Chapter 5 Input/Output 5.1 Principles of I/O hardware
Chapter 6 File Systems 6.1 Files 6.2 Directories
1 Chapter 11 I/O Management and Disk Scheduling Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and.
1 Chapter 12 File Management Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Chapter 4 Memory Management Basic memory management Swapping
1 Overview Assignment 4: hints Memory management Assignment 3: solution.
Module 10: Virtual Memory
Chapter 10: Virtual Memory
Virtual Memory: Page Replacement
Memory Management.
Virtual Memory Why? The need of memory more than the available physical memory. Process 3 Physical Memory Process 2 Process 1 Process 4.
Chapter 4 Memory Management Page Replacement 补充:什么叫页面抖动?
More on File Management
Chapter 6 File Systems 6.1 Files 6.2 Directories
1 Processes and Threads Chapter Processes 2.2 Threads 2.3 Interprocess communication 2.4 Classical IPC problems 2.5 Scheduling.
Part IV: Memory Management
Chapter 16 Java Virtual Machine. To compile a java program in Simple.java, enter javac Simple.java javac outputs Simple.class, a file that contains bytecode.
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
R4 Dynamically loading processes. Overview R4 is closely related to R3, much of what you have written for R3 applies to R4 In R3, we executed procedures.
Operating Systems Review.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 3 Memory Management Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
File Systems.
Review: Chapters 1 – Chapter 1: OS is a layer between user and hardware to make life easier for user and use hardware efficiently Control program.
1 CS 333 Introduction to Operating Systems Class 2 – OS-Related Hardware & Software The Process Concept Jonathan Walpole Computer Science Portland State.
CMPT 300: Operating Systems Review THIS REIVEW SHOULD NOT BE USED AS PREDICTORS OF THE ACTUAL QUESTIONS APPEARING ON THE FINAL EXAM.
Memory Management 2010.
OS Spring’03 Introduction Operating Systems Spring 2003.
Computer Organization and Architecture
1 Process Description and Control Chapter 3 = Why process? = What is a process? = How to represent processes? = How to control processes?
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
CS252: Systems Programming Ninghui Li Final Exam Review.
Introduction to Embedded Systems
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 11 Case Study 2: Windows Vista Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Distributed Systems. Interprocess Communication (IPC) Processes are either independent or cooperating – Threads provide a gray area – Cooperating processes.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
1 Memory Management 4.1 Basic memory management 4.2 Swapping 4.3 Virtual memory 4.4 Page replacement algorithms 4.5 Modeling page replacement algorithms.
Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Processes and Process Control 1. Processes and Process Control 2. Definitions of a Process 3. Systems state vs. Process State 4. A 2 State Process Model.
Processes and Virtual Memory
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Computer Science Lecture 3, page 1 CS677: Distributed OS Last Class: Communication in Distributed Systems Structured or unstructured? Addressing? Blocking/non-blocking?
Week 10 March 10, 2004 Adrienne Noble. Important Dates Project 4 due tomorrow (Friday) Final Exam on Tuesday, March 16, 2:30- 4:20pm.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Processes and threads.
Chapter 11: File System Implementation
CS 6560: Operating Systems Design
Main Memory Management
Chapter 9: Virtual-Memory Management
Chapter 15 – Part 1 The Internal Operating System
Overview Continuation from Monday (File system implementation)
Process Description and Control
Process Description and Control
Process Description and Control
CS510 Operating System Foundations
COMP755 Advanced Operating Systems
Structure of Processes
CSE 542: Operating Systems
Presentation transcript:

Kernfach System Software WS04/05 P. Reali M. Corti

Introduction Admin Lecture Exercises Mo 13-14 IFW A 36 We 10-12 IFW A 36 Exercises Always on Thursday 14-15 IFW A34 C. Tuduce (E) 14-15 IFW C42 V. Naoumov (E) 15-16 IFW A32.1 I. Chihaia (E) 15-16 RZ F21 C. Tuduce (E) 16-17 IFW A34 T. Frey (E) 16-17 IFW A32.1 K. Skoupý (E) System-Software WS 04/05

Introduction Additional Info Internet Homepage http://www.cs.inf.ethz.ch/ssw/ Inforum vis site Textbooks & Co. Lecture Slides A. Tanenbaum, Modern Operating Systems Silberschatz / Gavin, Operating Systems Concepts Selected articles and book chapters System-Software WS 04/05

Introduction Exercises Exercises are optional (feel free to shoot yourself in the foot) Weekly paper exercises test the knowledge acquired in the lecture identify troubles early exercise questions are similar to the exam ones Monthly programming assignment feel the gap between theory and practice System-Software WS 04/05

Introduction Exam Sometimes in March 2005 Written, 3 hours Allowed help 2 A4 page summary calculator Official Q&A session 2 weeks before the exam System-Software WS 04/05

Introduction Lecture Goals Operating System Concepts bottom-up approach no operating system course learn most important concepts feel the complexity of operating systems there‘s no silver-bullet! Basic knowledge for other lectures / term assignments Compilerbau Component Software .... OS-related assignments System-Software WS 04/05

Introduction What is an operating system? An operating system has two goals: Provide an abstraction of the hardware ABI (application binary interface) API (application programming interface) hide details Manage resources time and space multiplexing resource protection System-Software WS 04/05

Introduction Operating system target machines Targets mainframes servers multiprocessors desktops real-time systems embedded systems Different goals and requirements! memory efficiency reaction time abstraction level resources security ... System-Software WS 04/05

Introduction Memory vs. Speed Tradeoff Example: retrieve a list of names memory time Array Nn N List N(n+4) N/2 Bin. Tree N(n+8) log(N) Hash Table 3Nn 1 N = # names n = name length System-Software WS 04/05

Introduction Operating System as resource manager ... in the beginning was the hardware! Most relevant resources: CPU Memory Storage Network System-Software WS 04/05

Introduction Lecture Topics Virtual Machine Process Distributed Object-System Abstraction level Thread Coroutine Object-Oriented Runtime Support Scheduling Garbage Collection Distributed File-System Concurrency Support Memory Management Demand Paging Virtual Memory File System Runtime support CPU Memory Disk Network System-Software WS 04/05

Introduction A word of warning.... Most of the topics may seem simple..... .... and in fact they are! Problems are mostly due to: complexity when integrating system low-level („bit fiddling“) details bootstrapping (X needs Y, Y needs X) System-Software WS 04/05

Introduction Bootstrapping (Aos) SMP Timers Active Traps Interrupts Modules Module Hierarchy Storage Memory Locks Level Processor System-Software WS 04/05

Introduction Lecture Topics Overview Runtime Support Virtual Addressing Memory Management Distributed Obj. System Concurrency Concurrency Disc / Filesystem Case Study: JVM Oct‘04 Jan‘05 Nov‘04 Feb‘05 Dec‘04 System-Software WS 04/05

Run-time Support Overview Support for programming abstractions Procedures calling conventions parameters Object-Oriented Model objects methods (dynamic dispatching) Exceptions Handling ... more ... System-Software WS 04/05

Run-time Support Application Binary Interface (ABI) Object a, b, c, … with methods P, Q, R, … and internal procedures p, q, r, … Call Sequence Stack a.P b.Q 3 a.P b.Q b.q 2 a.P b.Q b.q 1 a.P b.Q c.R 4 Call a.P Call b.Q Call b.q Return b.q Call c.R Return c.R Return b.Q Return a.P Stack Pointer (SP) Procedure Activation Frame (PAF) System-Software WS 04/05

Run-time Support Procedure Activation Frame Save Registers Push Parameters Save PC Branch Save FP FP := SP Allocate Locals Caller Stack Pointer (SP) Call locals Frame Pointer (FP) Dynamic Link FP‘ PC Callee params Return Remove Locals Restore FP Restore PC Remove Parameters Restore Registers Caller Frame Caller System-Software WS 04/05

Run-time Support Procedure Activation Frame, Optimizations Many optimizations are possible use registers instead of stack register windows procedure inlining use SP instead of FP addressing System-Software WS 04/05

Run-time Support Procedure Activation Frame (Oberon / x86) Caller Callee push params call P push fp mov fp, sp sub sp, size(locals) push pc pc := P mov sp, fp pop fp ret size(params) ... pop pc add sp,size(params) System-Software WS 04/05

Run-time Support Calling Convention Convention between caller and callee how are parameters passed data layout left-to-right, right-to-left registers register window stack layout dynamic link static link register saving reserved registers System-Software WS 04/05

Run-time Support Calling Convention (Oberon) Parameter passing: on stack (exception: Oberon/PPC uses registers) left-to-right self (methods only) as last parameter structs and arrays passed as reference, value-parameters copied by the callee Stack dynamic link static link as last parameter (for local procedures) Registers saved by caller System-Software WS 04/05

Run-time Support Calling Convention (C) Parameter passing: on stack right-to-left arrays passed as reference (arrays are pointers!) Stack dynamic link Registers some saved by caller System-Software WS 04/05

Run-time Support Calling Convention (Java) Parameter passing left-to-right self as first parameter parameters pushed as operands parameters accessed as locals access through symbolic, type-safe operations System-Software WS 04/05

Run-time Support Object Oriented Support, Definitions Class Hierarchy Obj x = new ObjA(); static type of x is Obj dynamic type of x is ObjA x compiled as being compatible with Obj, but executes as ObjA. static and dynamic type can be different  the system must keep track of the dynamic type with an hidden „type descriptor“ Obj0 Obj ObjA ObjB Polymorphism System-Software WS 04/05

Run-Time Support Polymorphism VAR t: Triangle; s: Square; o: Figure; BEGIN t.Draw(); s.Draw(); o.Draw(); END; Type is statically known! Type is discovered at runtime! WHILE p # NIL DO p.Draw(); p := p.next END; System-Software WS 04/05

Run-time Support Object Oriented Support, Definitions Class Hierarchy Obj x = new ObjA(); if (x IS ObjA) { ... } // type test ObjA y = (ObjA)x // type cast x = y; // type coercion // (automatic convertion) Obj0 Obj ObjA ObjB System-Software WS 04/05

Run-time Support Object Oriented Support (High-level Java) Type Test Implementation if (a != null) { Class c = a.getClass(); while ((c != null) && (c != T)) { c = c.getSuperclass(); } return c == T; } else { return false; .... a IS T .... System-Software WS 04/05

Run-Time Support Type Descriptors struct TypeDescriptor { int level; type[] extensions; method[] methods; } class Object { TypeDescriptor type; many type-descriptor layouts are possible layout depends on the optimizations choosen System-Software WS 04/05

Run-Time Support Type Tests and Casts “extension level” Run-Time Support Type Tests and Casts 2 Obj0 Obj ObjA ObjB TD(Obj) 0: Obj0 1: Obj 2: NIL 3: NIL 1 0: Obj0 1: Obj 2: ObjA 3: NIL TD(ObjA) 0: Obj0 1: NIL 2: NIL 3: NIL TD(Obj0) (obj IS T) obj.type.extension[ T.level ] = T mov EAX, obj mov EAX, -4[EAX] cmp T, -4 * T.level - 8[EAX] bne .... System-Software WS 04/05

Run-time Support Object Oriented Support (High-level Java) Method Call Implementation .... a.M(.....) .... Class[] parTypes = new Class[params.Length()]; for (int i=0; i< params.Length(); i++) { parTypes[i] = params[i].getClass(); } Class c = a.getClass(); Method m = c.getDeclaredMethod(“M”, parTypes); res = m.invoke(self, parValues); Use method implementation for the actual class (dynamic type) System-Software WS 04/05

Run-Time Support Handlers / Function Pointers TYPE SomeType = POINTER TO SomeTypeDesc; Handler = PROCEDURE (self: SomeType; param: Par); SomeTypeDesc = RECORD handler: Handler; next: SomeType; END root PROC R Disadvantages: memory usage bad integration (explicit self) non constant Advantages: instance bound can be changed at run-time handler PROC Q next handler next handler next System-Software WS 04/05

Run-Time Support Method tables (vtables) Idea: have a per-type table of function pointers. Run-Time Support Method tables (vtables) TYPE A = OBJECT PROCEDURE M0; PROCEDURE M1; END A; B = OBJECT (A) PROCEDURE M2; END B; 0: A.M0 1: A.M1 A.MethodTable B.M0 overrides A.M0 0: A.M0 1: A.M1 B.MethodTable B.M0 B.M2 is new 2: B.M2 New methods add a new entry in the method table Overrides replace an entry in the method table Each method has an unique entry number System-Software WS 04/05

Run-Time Support Method tables 0: A.M0 1: A.M1 A.MethodTable TYPE A = OBJECT PROCEDURE M0; PROCEDURE M1; END A; B = OBJECT (A) PROCEDURE M2; END B; Virtual Dispatch o.M0; call o.Type.Methods[0] 0: A.M0 1: A.M1 B.MethodTable 0: B.M0 2: B.M2 mov eax, VALUE(o) mov eax, type[eax] mov eax, off + 4*mno[eax] call eax o Fields Type System-Software WS 04/05

Run-Time Support Oberon Type Descriptors td size type name method table superclass table pointers in object for GC mth table for method invocation ext table type descriptor is also an object! for type checks type desc for object allocation type desc obj size obj fields ptr offsets for garbage collection System-Software WS 04/05

Run-Time Support Interfaces, itables interface A { void m(); } interface B { void p(); does x implement A? x has an method table (itable) for each implemented interface Object x; A y = (A)x; y.m(); multiple itables: how is the right itable discovered? System-Software WS 04/05

Run-Time Support Interface support How to retrieve the right method table (if any)? Global table indexed by [class, interface] Local (per type) table / list indexed by [interface] Many optimizations are available use the usual trick: enumerate interfaces System-Software WS 04/05

Run-Time Support Interface support (I) Call is expensive because requires traversing a list: O(N) complexity Run-Time Support Interface support (I) Type Descriptor interfaces Intf0 Intf7 method table (vtable) method table (itable) method table (itable) interface i = x.type.interfaces; while ((i != null) && (i != Intf0) { i = i.next; } if (i != null) i.method[mth_nr](); Intf0 y = (Intf0)x; y.M(); System-Software WS 04/05

Run-Time Support Interface support (II) Lookup is fast (O(1)), but wastes memory Type Descriptor sparse array! interfaces 1 2 3 4 5 6 7 vtable Intf0 y = (Intf0)x; y.M(); itable2 interface i = x.type.interfaces[Intf0]; if (i != null) i.method[mth_nr](); itable7 System-Software WS 04/05

Run-Time Support Interface Implementation (III) overlap interface table index Type Descriptor u Type Descriptor t interfaces interfaces 1 2 3 4 5 6 7 vtablet 1 2 3 4 5 6 7 vtablet itableu,2 itablet,2 itableu,0 itablet,7 System-Software WS 04/05

Run-Time Support Interface Implementation (III) overlapped interface table index Type Descriptor Type Descriptor interfaces interfaces vtable vtable itable itable itable itable System-Software WS 04/05

Run-Time Support Interface Implementation (III) overlapped interface tables Type Descriptor Intf0 y = (Intf0)x; y.M(); interfaces vtable itable i = x.type.interfaces[Intf0]; if ((i != null) && (i in x.type)) i.method[mth_nr](); itable itable itable itable System-Software WS 04/05

Run-Time Support Exceptions void catchOne() { try { tryItOut(); } catch (TestExc e) { handleExc(e); } void catchOne() 0 aload_0 1 invokevirtual tryItOut(); 4 return 5 astore_1 6 aload_0 7 aload_1 8 invokevirtual handleExc 11 return ExceptionTable From To Target Type 0 4 5 TestExc System-Software WS 04/05

Run-Time Support Exception Handling / Zero Overhead void ExceptionHandler(state) { pc = state.pc, exc = state.exception; while (!Match(table[i], pc, exc)) i++; if (i == TableLength) { PopActivationFrame(state); pc = state.pc; i = 0; } state.pc = table[i].pchandler; ResumeExecution(state) try { ..... } catch (Exp1 e) { } catch (Exp2 e) { } pcstart pcend pchandler1 pchandler2 start end exception handler pcstart pcend Exp1 pchandler1 Exp2 pchandler2 Global Exception Table System-Software WS 04/05

Run-Time Support Exception Handling / Zero Overhead exception table filled by the loader / linker traverse whole table for each stack frame system has default handler for uncatched exceptions no exceptions => no overhead exception case is expensive system optimized for normal case System-Software WS 04/05

Run-Time Support Exception Handling / Fast Handling push catch descriptors on the stack Run-Time Support Exception Handling / Fast Handling try { save (FP, SP, Exp1, pchandler1) save (FP, SP, Exp2, pchandler2) ..... remove catch descr. jump end } catch (Exp1 e) { } catch (Exp2 e) { remove catch descr. jump end } end: add code instrumentation try { ..... } catch (Exp1 e) { } catch (Exp2 e) { } pchandler1 pchandler2 use an exception stack to keep track of the handlers System-Software WS 04/05

Run-Time Support Exception Handling / Fast Handling void ExceptionHandler(ThreadState state) { int FP, SP, handler; Exception e; do{ retrieve(FP, SP, e, handler); } while (!Match(state.exp, e)); state.fp = FP; // set frame to the one state.sp = SP; // containing the handler state.pc = handler; // resume with the handler ResumeExecution(state) } pop next exception descriptor from exception stack can resume in a different activation frame System-Software WS 04/05

Run-Time Support Exception Handling / Fast Handling code instrumentation insert exception descriptor at try remove descriptor before catch fast exception handling overhead even when no exceptions system optimized for exception case System-Software WS 04/05

Virtual Addressing Overview Virtual Addressing: abstraction of the MMU (Memory Management Unit) Work with virtual addresses, where addressreal = f(addressvirtual) Provides decoupling from real memory virtual memory demand paging separated address spaces System-Software WS 04/05

Virtual Addressing Pages programs use and run in this address spaces Memory as array of pages virtual address-space 2 unmapped range unmapped (invalid) page 7 6 5 5 page frame 3 page 4 3 2 1 1 real memory: pool of page frames 2 memory address virtual address-space 1 mapping System-Software WS 04/05

Virtual Addressing Page mapping Virtual Address  Real Address Virtual Address Real Address TLB page-no off frame off (PT, VA, RA) MMU frame Page Table page-no Page Frame off Translation Lookaside Buffer Associative Cache frame Page Table Ptr Register Real Memory System-Software WS 04/05

Virtual Addressing Definitions page smallest unit in the virtual address space page frame unit in the physical memory page table table mapping pages into page frames page fault access to a non-mapped page working set pages a process is currently using System-Software WS 04/05

Virtual Addressing Alternate Page Mapping 64 bit Address Space Virtual Addressing Alternate Page Mapping 1. Level Table Multilevel page tables Multipart Virtual Address Page table as (B*-)Tree Inverted Page-Table 2. Level Table pno1 pno2 off Next probe Process N pr vp pf pf pr, vp pr vp pf Hash pr unassigned vp pf 1 pr vp pf Hashtable pr unassigned vp pf System-Software WS 04/05

Virtual Addressing What for? Decoupling from real memory virtual memory (cheat: use more virtual memory than the available real memory) dynamically allocated contiguous memory blocks (for multiple stacks in multitasking systems) some optimizations null reference checks garbage collection (using dirty flag) Virtual Addressing is not for free! address mapping may require additional memory accesses page table takes space System-Software WS 04/05

Virtual Addressing Virtual Memory Use secondary storage (disc) to keep currently unused pages (swapping) Page table usually keeps some per-page flag invalid page not mapped referenced page has been referenced dirty page has been modified Accessing an invalid page causes a page-fault interrupt select page frame to be swapped out (victim or candidate) swap-in requested page frame System-Software WS 04/05

Virtual Addressing Virtual Memory / Demand Paging “Page-out” requested page Disc “Page-in” Real Memory victim set to invalid Page Table System-Software WS 04/05

Virtual Addressing Demand Paging Sequence OS Page-Fault Handler TLB IF VA IN TLB THEN RETURN RA MMU ELSE Access Page Table; IF Page invalid THEN Page-Fault ELSE RETURN RA END IF Free Page Frame exists THEN Assign frame to VA ELSE Search victim page; IF victim page modified THEN page-out to secondary storage END; Invalidate victim page; Page-in from secondary storage; Reset invalid flag Expected time to translate VA into RA E[t] = PTLB * tTLB + PPT * tPT + Pdisc * tdisc System-Software WS 04/05

Virtual Addressing Example Page size 4 KB Address size 32 Bits page offset: 12 Bits (4KB = 212) page number: 20 Bits (32 - 12) addressable memory: 232 = 4GB page table size: 220 * 32 Bits = 4 MB page table overhead: ca. 3% Real Memory 128 MB System-Software WS 04/05

Virtual Addressing Example TLB PTLB mov EAX, @Addr Memory 1-Ppage fault 1-PTLB Page Table 1 disc read 1 disc write 1 memory read Disc Ppage fault E[t] = PTLB tTLB + (1- PTLB)(tPT + PPF tdisc + (1-PPF)tmem) System-Software WS 04/05

Virtual Addressing Demand Paging: Page Replacement Optimal Strategy (Longest Unused) Take the page, that will remain unused for the longest time Requires oracle NRU: ”Not Recently Used” Reset the referenced flag at each tick Create page categories (good candidate to bad candidate) choose best candidate Pref ref mod 3 2 1 System-Software WS 04/05

Virtual Addressing Demand Paging: Page Replacement (2) LRU: “Least Recently Used” Assumption: not used in past ==> not used in the future Hardware implementation 64-Bit time-stamp for each page Software implementation “Aging”-Algorithm Choose page with lowest value 1 1 1 t(i) 1 1 1 t(i+1) t set if page accessed 1 1 1 1 1 1 Reference Flag System-Software WS 04/05

Virtual Addressing Demand Paging: Page Replacement (3) “Least Recently Created” LRC (FIFO) Page Lifespan as metric (old are swapped out) Chain sorted by creation time Bad handling for often-used pages Fix: “second chance” when accessed (ref flag set) during the last tick next cur := earliest; WHILE cur.ref DO cur.ref := FALSE; cur := cur.next END earliest Ref-Flag System-Software WS 04/05

Virtual Addressing Demand Paging: Page Replacement (4) Strategies: optimal LRU / NRU / LRC Exceptions: “page pinning”: page cannot be swapped out kernel code System-Software WS 04/05

Virtual Addressing Example working set {1,2,3,4} Accessed Pages: 1, 2, 1, 3, 4, 1, 2, 3, 4 Available Page Frames: 3 Page Access 1 2 3 4 Ideal 1, 2 1, 2, 3 1, 2, 4 2, 3, 4 2, 3 ,4 FIFO 3, 4, 1 4, 1, 2 LRU 1, 3, 4 1, 4, 2 4, 2, 3 PF! PF! PF! PF! PF! PF! PF! PF! PF! PF! PF! PF! PF! PF! PF! PF! PF! PF! PF! System-Software WS 04/05

Demand Paging Belady’s Anomaly Page access sequence LRC Strategie 3 Page Frames 9 Page Faults 4 Page Frames 10 Page Faults 0 1 2 3 0 1 4 0 1 2 3 4 0 1 2 3 0 1 4 4 4 2 3 3 0 1 2 3 0 1 1 1 4 2 2 0 1 2 3 0 0 0 1 4 4 Victim 0 1 2 3 0 1 x x x x x x x x x 0 1 2 3 3 3 4 0 1 2 3 4 0 1 2 2 2 3 4 0 1 2 3 0 1 1 1 2 3 4 0 1 2 0 0 0 1 2 3 4 0 1 Belady’s Anomaly: More page frames cause more page faults Victim 0 1 2 3 4 0 x x x x x x x x x x System-Software WS 04/05

Demand Paging How many page frames per process? Even Distribution Every process has the same amount of memory Thrashing every memory access causes a page-fault not enough page-frames for the current working-set CPU-Load System is swapping instead of running 100 % n 2 Process Count n+1 1 System-Software WS 04/05

Demand Paging How many page frames per process? (2) Depending on the process needs (1) use Working-Set Page Frames assigned according to the process’ working-set size. Swap-out a process when not enough memory available. Page Access 1 3 2 2 3 3 1 2 2 3 3 3 4 2 2 1 1 1 2 1 3 3 3 1 3 1 2 3 4 1 { 2, 3, 4 } { 1, 2, 3, 4 } Sliding Window WorkingSet System-Software WS 04/05

Demand Paging How many page frames per process? (3) Depending on the process needs (2) use Page-Fault Rate Page-Fault Rate HIGH LOW Time Swap out one process Swap in System-Software WS 04/05

Virtual Addressing Aos/Bluebottle, Memory Layout Example 4 GB Stacks 128 KB per stack max. 32768 active objects first stack page allocated on process creation PROCEDURE PageFault; BEGIN IF adr > 2GB THEN add page to stack ELSE Exception(NilTrap) END END PageFault; 2 GB Heap Kernel System-Software WS 04/05

Virtual Addressing Example: UNIX, Fork a UNIX Program consists of..... Process A read-only Page Table code Process B Fork() read-only text data data’ read-write “copy on write” System-Software WS 04/05

Virtual Addressing OS Control Oberon no virtual memory Windows Virtual Memory configuration Task Manager Linux Swap partition / Swap files ps / top System-Software WS 04/05

Virtual Addressing Segmentation e.g. Intel x86 Problem 640KB Max Memory 16bit addresses (i.e. 64KB) Solution work in a segment code / data segments check segment boundaries Addrreal = Segbase+Offset real memory code segment segment limit data segment segment base System-Software WS 04/05

Virtual Addressing Summary virtual addresses, addressreal = f(addressvirtual) Decoupling from real memory virtual memory demand paging separate address spaces Keywords page page frame page table page fault page flags dirty, used, unmapped page replacement strategy LRC, LRU, ideal, ... swapping thrashing, belady’s anomaly System-Software WS 04/05

Memory Management Overview Abstractions for applications heap memory blocks ( << memory pages) Operations: Allocate Deallocate Topics: memory organization free lists allocation strategies deallocation explicit garbage collection type-aware conservative copying / moving incremental generational System-Software WS 04/05

Memory Management Objects on the heap Object Instances: a, b, c, d, … Sequence: not enough space e Case 2 e Case 1 a c d ! dynamic allocation NEW(a) NEW(b) NEW(c) DISPOSE(b) NEW(d) NEW(e) c b explicit disposal e a System-Software WS 04/05 „Heap“

Memory Management Problem overview Problems Heap size limitation ( e, case 1) External Fragmentation ( e, case 2) Dangling Pointers (a points to b) Solutions System-managed list of free blocks („free list“) Vector of blocks with fixed size (Bitmap, with 0=free, 1=used) Automated detection and reclamation of unused blocks („garbage collection“) System-Software WS 04/05

Memory Management Theory: 50% rule Assumption: stable state M free blocks, N block allocated 50%-Rule: M = 1/2 N A B C B B B B C N = A + B + C M = 1/2 (2A + B + e) e = 0,1, or 2 block disposal: ΔM = (C - A) / N (C - A) / N = 1 - p C - A - N + pN = 0 block allocation: (splitting likelihood) ΔM = 1 - p 2M = 2A + B + e 2M = 2A + N - A - C + e 2M = N + A - C + e 2M +e = pN System-Software WS 04/05

Memory Management Theory: Memory Fragmentation Critical point  { 50%-Rule } (b/2)*F = H - b*B, /2*b*B = H - b*B  H/(b*B) = 1 + /2,  = 2/ - 2 System-Software WS 04/05

Memory Management Free-list management with a Bitmap Idea partition heap in blocks of size s use bitmap to track allocated blocks bitmap[i] = true  blocki allocated Problems internal fragmentation round up block size to next multiple of s map size size is (heap_size / s) bits loss due to internal fragmentation System-Software WS 04/05

Memory Management Free-list management with a list List organization sorted / non-sorted merging of empty blocks is simpler with sorted list one list / many lists (per size) search is simpler, merging is more difficult management data stored in the free block size, next pointer Operations Allocation Disposal with merge find free blocks next to current block, merge into bigger free block System-Software WS 04/05

Memory Management Memory allocation strategies block splitting: if a free-block is bigger than the requested block, then it is split first-fit use first free block which is big enough best-fit take smallest fitting block  causes a lot of fragmentation worst-fit take biggest available block quick-fit best-fit but multiple free-lists (one per block size)  fast allocation! free used internal fragmentation System-Software WS 04/05

Memory Management Buddy System (for fast block merging) Blocks have size 2k Block with size 2i has address j*2i (last i bits are 0) Blocks with address x=j*2i and (j XOR 1)*2i are buddies (can be merged into a block of size 2i+1) buddy = x XOR 2i 64 32 32 16 16 32 16 8 8 32 b1 xxxx 0 0000 b2 xxxx 1 0000 Split 2k+1 2k 2k-1 Merge System-Software WS 04/05

Memory Management Buddy System (for fast block merging) Problem: only buddies can be merged Cascading merge 16 16 32 no buddies buddies 16 16 32 16 8 8 32 16 8 8 32 16 8 8 32 16 16 32 32 32 System-Software WS 04/05

Memory Management Buddy System (for fast block merging) Allocation allocate(8) 32 32 split 16 16 32 split 8 8 16 32 quick fit 8 8 16 32 System-Software WS 04/05

Memory Management Example: Oberon / Aos Block size = k*32 free-lists for k = 1..9, one list for blocks > 9*32 Allocate quick-fit, splitting may be required Free-list management and block-merging done by the Garbage Collector k * 32 k * 32 96 96 initial state 64 64 Allocated Block 32 32 ALLOCATE(50) System-Software WS 04/05

Memory Management Garbage Collection Two steps: Free block detection type-aware collector is aware of the types traversed, i.e. know which values are pointers conservative collector doesn’t know which values are pointers Block Disposal return unused blocks to the free-lists GC Characteristics incremental gc is performed in small steps to minimize program interruption moving / copying / compacting blocks are moved around generational blocks are grouped in generations; different treatment or collection priority Barriers read intercept and check every pointer read operation write intercept and check every pointer write operation System-Software WS 04/05

Memory Management Garbage Collection: Reference Counting Every object has a Reference counter rc rc = 0  Object is „Garbage“ Problems Overhead no support for circular structures Useful for... Module hierarchies DAG-Structures (z. B. LISP) p q write barrier rc rc INC p.rc DEC q.rc IF q.rc = 0 THEN Collect q^ END; q := p p, q Pointers to Object q := p M A B rc >= 1 rc >= 1 C D System-Software WS 04/05

Memory Management Garbage Collection: Mark & Sweep Mark-Phase (Garbage Detection) Compute the Root-set consisting of global pointers (statics) in each module local pointers on the stack in each PAF temporary pointers in the CPU’s registers Traverse the graph of the live objects starting from the root-set with depth-first strategy; mark all reached objects. Sweep-Phase (Garbage Collection) Linear heap traversal. Non-marked blocks are inserted into free-lists. Optimization: lazy sweeping (sweep during allocation, allocation gets slower) System-Software WS 04/05

Memory Management Garbage Collection: root-set Run-time support from object-system. Hidden data structures with (compiler generated) information about pointers (metadata). Conservative approach. Guess which values could be pointers and threat them as such instance pointer global pointer off2 off1 off2 off1 off off Object Instance Type Tag Typ Descriptor Module Descriptor Module Data System-Software WS 04/05

Memory Management Garbage Collection: Mark with Pointer Rotation/1 Problem: Garbage collection called when free memory is low, but mark may require a lot of memory Solution: Pointer rotation algorithm (Deutsch, Schorre , Waite) Memory efficient iterative structures are temporarily inconsistent non-concurrent non-incremental System-Software WS 04/05

Memory Management Garbage Collection: Mark with Pointer Rotation/2 Simple case: list traversal q p q p p.link System-Software WS 04/05

Memory Management Garbage Collection: Mark with Pointer Rotation/3 Generic case: structure traversal q p q p System-Software WS 04/05

Memory Management Garbage Collection: Memory Compaction MS .NET nextavail Pointer: partition heap between allocated and free space Allocate: increment nextavail Garbace Collector performs memory compaction nextavail ALLOC GC System-Software WS 04/05

Memory Management Garbage Collection: Stop & Copy Partition heap in from and to regions Collection: traverse objects in from, copy to to leave forwarding pointer behind requires read barrier swap from and to Characteristics copying incremental (generational) access p instrument code with read barrier IF p is moved THEN replace p with forwarding pointer END; access p System-Software WS 04/05

Memory Management Garbage Collection: Stop & Copy 1 2 from to from to 3 4 from to to from System-Software WS 04/05

Memory Management Garbage Collection: Concurrent GC User Process „Stop-and-Go“ Approach „Incremental“ Approach Mutator GC Mutator GC Mutator Mutator Mutator Mutator Mutator Real-Time Constraint GC System-Software WS 04/05

Memory Management Garbage Collection: Tricolor marking „Wave-front“ Model State Color already traversed, behind wave black being traversed, on the wave grey not reached yet, in front of the wave white System-Software WS 04/05

Memory Management Garbage Collection: Tricolor marking / Isolation Mutator can change pointers at any time Critical case: black  white Remedy Write-Barrier color B gray color W gray Write Barrier B W unreachable System-Software WS 04/05

Memory Management Garbage Collection: Backer‘s Treadmill Free-Space Heap: double-linked chain of objects curscan From-Space To-Space System-Software WS 04/05

Memory Management Garbage Collection: Backer‘s Treadmill Free-Space conservative allocation progressive allocation curscan To-Space From-Space System-Software WS 04/05

Memory Management Garbage Collection: Backer‘s Treadmill Free-Space curscan collect reference curscan To-Space From-Space System-Software WS 04/05

Memory Management Garbage Collection: Backer‘s Treadmill State transitions after GC is complete From-Space + Free-Space  Free-Space ToSpace  FromSpace Fragmentation External: not removed Internal: depends on supported block sizes Allocation conservative: black progressive: white NEW(x) x curscan y Root Set NEW(y) System-Software WS 04/05

Memory Management Generational Garbage Collection collect where it is garbage is most likely to be found Generations Expected object life young  short life (temp data) old  long life Generations G0, G1, G2 Gen GC frequency G0 high G1 medium G2 low E J special handling for pointers across different generations required D G I G0 C F H B D G G1 A A A G2 System-Software WS 04/05

Memory Management Garbage Collection: Finalization Finalization (after-use cleanup) User-defined routine when object is collected Establish Consistency save buffers flush caches Release Resources close connections release file descriptors Dangers: Resurrection of objects: objects added to live structures Finalization sequence is undefined System-Software WS 04/05

Memory Management Garbage Collection: .NET Finalization Example Queue Rules: objects with finalizer belong to older generation finalizer only called once (ReRegisterForFinalize) FinalizationQueue: live object with finalizer FreachableQueue: collected objects to be finalized Finalization executed by different process for security reasons garbage D B C A B A GC E A Finalization Queue D C B E Freachable Queue A B thread System-Software WS 04/05

Memory Management Garbage Collection: Weak Pointers Objects referenced only through a weak pointer can be collected by the GC in case of need Used for Caches and Buffers Implementation Weak Pointers are not registered to the GC Use a weak reference table (indirect access) garbage in use garbage weak reference weak reference table System-Software WS 04/05

Memory Management Garbage Collection: Weak Pointers Example Oberon: internal file list system must keep track of open files to avoid buffer duplication file descriptor must be collected once user has no more reference to it use weak pointer in the system (otherwise would keep file alive!) System-Software WS 04/05

Memory Management Object Pools Application keeps a pool of preallocated object instances; handles allocation and disposal Simulation discrete events Buffers in a file system Provide dynamic allocation in real-time system PROCEDURE NewT (VAR p: ObjectT); BEGIN IF freeT = NIL THEN NEW(p) ELSE p := freeT; freeT := freeT.next END END NewT; PROCEDURE DisposeT (p: ObjectT); BEGIN p.next := freeT; freeT := p END DisposeT; System-Software WS 04/05

Garbage Collection, Recap GC kinds: compacting copying incremental generational Helpers: write barrier read barrier forwarding pointer pointer rotation Algorithms: Ref-Count Mark & Sweep Stop & Copy Mark & Copy (.NET) Baker’s Threadmill Dijkstra / Lamport Steele System-Software WS 04/05

Distributed Object Systems Overview Goals object-based approach hide communication details Advantages more space more CPU redundancy locality Problems Coherency ensure that same object definition is used Interoperability serialization type consistency type mapping Object life-time distributed garbage collection System-Software WS 04/05

Distributed Object Systems Architecture Naming Service Client Server Call Context Application Proxy Stub Impl. Message Object Broker Object Broker Impl. Skeleton IDL-Compiler IDL IDL-Compiler System-Software WS 04/05

Remote Procedure Invocation Overview network byte-ordering little end first Problem send structured information from A to B A and B may have different memory layouts “endianness” How is 0x1234 (2 bytes) representend in memory? 12 34 1 Big-Endian: MSB before LSB IBM, Motorola, Sparc Little-Endian: LSB before MSB VAX, Intel System-Software WS 04/05

Definitions Serialization Deserialization Marshaling conversion of an object‘s instance into a byte stream Deserialization conversion of a stream of bytes into an object‘s instance Marshaling gathering and conversion (may require serialization) to an appropriate format of all relevant data, e.g in a remote method call; includes details like name representation. System-Software WS 04/05

Remote Procedure Invocation Protocol Overview big-endian representation Protocols RPC + XDR (Sun) RFC 1014, June 1987 RFC 1057, June 1988 IIOP / CORBA (OMG) V2.0, February 1997 V3.0, August 2002 SOAP / XML (W3C) V1.1, May 2000 ... XDR Type System [unsigned] Integer (32-bit) [unsigned] Hyper-Integer (64-bit) Enumeration (unsigned int) Boolean (Enum) Float / Double (IEEE 32/64-bit) Opaque String Array (fix + variable size) Structure Union Void System-Software WS 04/05

Remote Procedure Invocation RPC Protocol Remote Procedure Call Marshalling of procedure parameters Message Format Authentication Naming Client Server PROCEDURE P(a, b, c) pack parameters send message to server await response unpack response Server unpack parameters find procedure invoke pack response send response P(a, b, c) System-Software WS 04/05

Distributed Object Systems Details References vs. Values client receives reference to remote object data values are copied to client for efficiency reasons decide whether an object is sent as reference or a value serializable (Java, .NET), valuetype (CORBA) MarshalByRefObject (.NET), java/RMI/Remote (Java), default (CORBA) object creation server creates objects client creates objects server can return references object instances one object for all requests one object for each requests one object per proxy conversation state stateless stateful System-Software WS 04/05

Distributed Object Systems Distr. Object Systems vs Distributed Object Systems Distr. Object Systems vs. Service Architecture Dist. Object System object oriented model object references stateful / stateless tight coupling Service Architecture OO-model / RPC service references stateless loose coupling internal communication between application’s tiers external communication between applications System-Software WS 04/05

Distributed Object Systems Distr. Object Systems vs Distributed Object Systems Distr. Object Systems vs. Service Architecture components / objects (distributed object system) stateful and stateless conversation transactions coupling Remoting RMI tight CORBA Web Services services remote procedure calls stateless conversation (session?) message loose environment homogeneous heterogeneous System-Software WS 04/05

Distributed Object Systems Type Mapping Interoperability Type System Type System 1 Type System 2 Possible Types Possible Types Possible Types Mappable Types Mappable Types Interop Subset System-Software WS 04/05

Distributed Object Systems Type Mapping, Example Java Type System CORBA Type System CLS Type System char enum enum double double double char wchar char union union union custom implementation custom implementation System-Software WS 04/05

Distributed Object Systems Examples Standards OMG CORBA IIOP Web Services SOAP Frameworks Java RMI (Sun) DCOM (Microsoft) .NET Remoting (Microsoft) IIOP.NET System-Software WS 04/05

Distributed Object Systems CORBA Common Object Request Broker Architecture                                                                                                       Client Application Object Remote Architecture Object Skeleton Client Stub Interface Repository Implementation Repository Object Adaptor CORBA Runtime CORBA Runtime Client Server „Object-Bus“ ORB ORB GIOP/IIOP TCP/IP Socket System-Software WS 04/05

Distributed Object Systems CORBA CORBA is a standard from OMG Object Management Group Common Object Request Broker Architecture CORBA is useful for... building distributed object systems heterogeneous environments tight integration CORBA defines... an object-oriented type system an interface definition language (IDL) an object request broker (ORB) an inter-orb protocol (IIOP) to serialize data and marshall method invocations language mappings from Java, C++, Ada, COBOL, Smalltalk, Lisp, Phyton ... and many additional standards and interfaces for distributed security, transactions, ... System-Software WS 04/05

Distributed Object Systems CORBA Basic Types integers 16-, 32-, 64bit integers (signed and unsigned) IEEE floating point 32-, 64-bit and extended-precision numbers fixed point char, string 8bit and wide boolean opaque (8bit), any enumerations Compound Types struct union sequence (variable-length array) array (fixed-length) interface concrete (pass-by-reference) abstract (pure definition) value type pass-by-value abstract (no state) Operations in / out / inout parameters raises Attributes System-Software WS 04/05

Distributed Object Systems CORBA / General Inter-ORB Protocol (GIOP) CDR (Common Data Representation) Variable byte ordering Aligned primitive types All CORBA Types supported IIOP (Internet IOP) GIOP over TCP/IP Defines Interoperable Object Reference (IOR) host post key Message Format Defined in IDL Messages Request, Reply CancelRequest, CancelReply LocateRequest, LocateReply CloseConnection MessageError Fragment Byte ordering flag Connection Management request multiplexing asymmetrical / bidirectional connections System-Software WS 04/05

Distributed Object Systems CORBA / GIOP Message in IDL module GIOP { struct Version { octet major; octet minor; } enum MsgType_1_0 { Request, Reply, CancelRequest, CancelReply, LocateRequest, LocateReply, CloseConnection, Error struct MessageHeader { char Magic[4]; Version GIOP_Version; boolean byte_order; octet message_size; unsigned long message_type; } } // module end GIOP System-Software WS 04/05

Distributed Object Systems CORBA Services System-level services defined in IDL Provide functionality required by most applications Naming Service Allows local or remote objects to be located by name Given a name, returns an object reference Hierarchical directory-like naming tree Allows getting initial reference of object Event Service Allows objects to dynamically register interest in an event Object will be notified when event occurs Push and pull models ... and more Trader, LifeCycle, Persistence, Transaction, Security System-Software WS 04/05

Distributed Object Systems WebServices Service-oriented architecture Rely on existing protocols SOAP messaging protocol WSDL service description protocol UDDI service location protocol Web Services SOAP HTTP TCP/IP System-Software WS 04/05

Distributed Object Systems SOAP Simple Object Access Protocol communication protocol XML-based describes object values XML Schemas as interface description language basic types string, boolean, decimal, float, double, duration, datetime, time, date, hexBinary, base64Binary, URI, Qname, NOTATION structured types list, union SOAP Message SOAP Envelope SOAP Header SOAP Body Method Call packed as structure messages are self-contained no external object references System-Software WS 04/05

Distributed Object Systems SOAP Message SOAP Envelope SOAP Header SOAP Body Example float Multiply(float a, float b); System-Software WS 04/05

Distributed Object Systems SOAP Example (Request) POST /quickstart/aspplus/samples/services/MathService/CS/MathService.asmx HTTP/1.1 Host: samples.gotdotnet.com Content-Type: text/xml; charset=utf-8 Content-Length: length SOAPAction: "http://tempuri.org/Multiply" <?xml version="1.0" encoding="utf-8"?> <soap:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"> <soap:Body> <Multiply xmlns="http://tempuri.org/"> <a>float</a> <b>float</b> </Multiply> </soap:Body> </soap:Envelope> System-Software WS 04/05

Distributed Object Systems SOAP Example (Answer) HTTP/1.1 200 OK Content-Type: text/xml; charset=utf-8 Content-Length: length <?xml version="1.0" encoding="utf-8"?> <soap:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"> <soap:Body> <MultiplyResponse xmlns="http://tempuri.org/"> <MultiplyResult>float</MultiplyResult> </MultiplyResponse> </soap:Body> </soap:Envelope> System-Software WS 04/05

Distributed Object Systems SOAP Example (Service Description-1) <?xml version="1.0" encoding="utf-8"?> <definitions ....> <types> <s:schema elementFormDefault="qualified" targetNamespace="http://tempuri.org/"> <s:element name="Multiply"> <s:complexType><s:sequence> <s:element minOccurs="1" maxOccurs="1" name="a" type="s:float" /> <s:element minOccurs="1" maxOccurs="1" name="b" type="s:float" /> </s:sequence></s:complexType> </s:element> </s:schema> </types> <message name="MultiplySoapIn"> <part name="parameters" element="s0:Multiply" /> </message> System-Software WS 04/05

Distributed Object Systems SOAP Example (Service Description-2) <binding name="MathServiceSoap" type="s0:MathServiceSoap"> <soap:binding transport="http://schemas.xmlsoap.org/soap/http" style="document" /> <operation name="Multiply"> <soap:operation soapAction="http://tempuri.org/Multiply" style="document" /> <input><soap:body use="literal" /></input> <output><soap:body use="literal" /></output> </operation> </binding> <service name="MathService"> <port name="MathServiceSoap" binding="s0:MathServiceSoap"> <soap:address location="http://samples.gotdotnet.com/quickstart/aspplus/samples/services/MathService/CS/MathService.asmx" /> </port> </service> </definitions> System-Software WS 04/05

Distributed Object Systems WebServices Comments XML (easily readable) system independent standard stateless (encouraged design pattern) bloated big messages (but easily compressed) requires expensive parsing Constraints Services no object references server-activated servant Goes over HTTP requires web server System-Software WS 04/05

Distributed Object Systems WebService Future Use SOAP-Header to store additional information about message or context Many standards to come... WS-Security WS-Policy WS-SecurityPolicy WS-Trust WS-SecureConversation WS-Addressing System-Software WS 04/05

Distributed Object Systems Java RMI Java Remote Method Invocation Object Client Application Lookup Register Lookup Register Object Stub Object Stub Remote Architecture Remote References Remote References Transport Layer Transport Layer Client Server Network TCP/IP Socket System-Software WS 04/05

Distributed Object Systems Java RMI Details Framework supports various implementations e.g. RMI/IIOP mapping limited to the Java type system, workarounds needed uses reflection to inspect objects System-Software WS 04/05

Distributed Object-Systems Low-Level Details: Java RMI/IIOP Common Type-System restricted CORBA Marshalling name mapping remote objects only references Interface Description Language (IDL) java to IDL mapping Message representation Underlying protocol IIOP (CORBA) System-Software WS 04/05

Distributed Object Systems Microsoft DCOM Distributed Common Object Model Client Application Object Object Proxy Remote Architecture Object Stub COM Runtime SCMs and Registration COM Runtime SCM SCM Client Registry Registry Server OXID Resolver RPC Channel Network Ping Server System-Software WS 04/05

Distributed Object Systems Microsoft .NET Remoting new Instace() or Activator.GetObject(...) Client Transparent Proxy Channel Channel Instance Application Domain Boundary ObjRef IChannelInfo ChannelInfo; IEnvoyInfo EnvoyInfo; IRemotingTypeInfo TypeInfo; string URI; Network System-Software WS 04/05

Distributed Object Systems Microsoft .NET Remoting Client Proxy Dispatcher Instance channel channel Message Chan.Sink(s) Message Chan.Sink(s) custom operations Instance s = new Instance(); s.DoSomething(); Formatter Formatter serialize object Stream Chan.Sink(s) Stream Chan.Sink(s) custom operations Transport Sink Transport Sink handle communication Network System-Software WS 04/05

Distributed Object Systems Microsoft .NET Remoting Activation client one instance per activation server / Singleton one instance of object server / SingleCall one instance per call Leases (Object Lifetimes) renew lease on call set maximal object lifetime Serialization SOAP Warning: non-standard types, only for .NET use binary user defined Transport TCP HTTP System-Software WS 04/05

Distributed Object Systems Microsoft .NET Remoting (Object Marshalling) MarshalByRefObjects remoted by reference client receives an ObjRef object, which is a“pointer“ to the original object [Serializable] all fields of instance are cloned to the client [NonSerialized] fields are ignored ISerializable object has method to define own serialization AppDomain 1 AppDomain 2 AppDomain 1 AppDomain 2 Obj Proxy Obj Obj‘ Serialized ObjRef Serialized fld1... fldn System-Software WS 04/05

Distributed Object Systems Microsoft .NET Remoting, Activation “stateless” Server-Side Activation (Well-Known Objects) Singleton Objects only one instance is allocated to process all requests SingleCall Objects one instance per call is allocated Client-Side Activation Client Activated Objects the client allocates and controls the object on the server “stateful” System-Software WS 04/05

Distributed Object Systems Microsoft .NET Remoting, Limitations Server-Activated Objects object configuration limited to the default constructor Client-Activated Objects class must be instantiated, no access over interface class hierarchy limitations use Factory Pattern to get interface reference to allow parametrization of the constructor Furthermore... interface information is lost when passing an object reference to another machine no control over the channel which channel is used which peer is allowed to connect System-Software WS 04/05

Distributed Object Systems Case Study: IIOP.NET Opensource project based on ETH-Diploma thesis http://iiop-net.sourceforge.net/ IIOP.NET (marketing) „Provide seamless interoperability between .NET and CORBA-based peers (including J2EE)“ IIOP.NET (technical) .NET remoting channel implementing the CORBA IIOP protocol Compiler to make .NET stubs from IDL definitions IDL definition generator from .NET metadata System-Software WS 04/05

Distributed Object Systems Case Study: IIOP.NET server client J2EE Java CORBA objects IIOP binary IIOP rather than SOAP transparent reuse of existing servers tight coupling object-level granularity efficiency Runtime: standard .NET remoting channel for IIOP transport sink formatter type-mapper Build tools IDL  CLS compiler CLS  IDL generator Java Type System IDL Type System CLS Type System Possible Types Possible Types Possible Types IDL Mappable Types IDL Mappable Types Interop Subset System-Software WS 04/05

Distributed Object Systems Case Study: IIOP.NET, Interoperability Application This is what we want Services Distributed Transaction Coordinator, Active Directory, … Conversation Activation model (EJB, MBR), global naming, distributed garbage collection, conversational state,… Contextual Data Interception Layer SessionID, TransactionID, cultureID, logical threadID … Message Format RPC, IIOP, HTTP, SOAP, proprietary binary format, messages, unknown data (exceptions), encryption Communication Protocol exchange raw data (bytes) across machines Data Model give structure to data type system / object model Message Format define encoding, serialization, allowed messages (invocation, ...) Contextual Data additional information (context), hidden information support for upper layers application infrastructure Conversation interaction model, message workflow Services Common external services Application the whole universe! Data Model Type system, mapping and conversion issues Communication Protocols TCP/UDP, Byte stream, point-to-point communication System-Software WS 04/05

Distributed Object Systems Case Study: IIOP.NET, Granularity Service Service Component Component Object Object Object Object Component Object Object Service Component Object Granularity Message-based Interface, Stateless Strongly-typed Interface, Stateless or Stateful Implementation Dependency, Stateful Coupling, Interaction System-Software WS 04/05

Distributed Object Systems Case Study: IIOP.NET 1.0 1.1 1.2 1.3 1.4 1.5 1st Article 2nd Article 1.6 1.6.0 20.03.04  1.5.1 01.12.03 1.5.0 12.10.03 1.4.1 21.09.03 1.4.0 25.08.03 1.3.1 04.08.03 1.3.0 14.07.03 1.2.3 01.07.03 1.2.2 23.06.03 1.2.1 15.06.03 1.2.0 08.06.03 1.1.0 28.05.03 1.0.0 08.05.03 1st article 01.07.03 2nd article 27.08.03 System-Software WS 04/05

Distributed Object Systems Case Study: IIOP.NET, Performance Test Case: WebSphere 5.0.1 as server Clients IBM SOAP-RPC Web Services IBM Java RMI/IIOP IIOP.NET Response time receiving 100 beans from server WS: 4.0 seconds IIOP.NET: 0.5 seconds when sending many more beans, WS are then 200% slower than IIOP.NET Source: posted on IIOP.NET forum System-Software WS 04/05

Processes and Threads Introduction CPU as resource, provide abstraction to it Allow multiprogramming pseudo-parallelism (single-processors) real parallelism (multi-processors) Required abstractions multiple activities -- execution of instructions protection of resources synchronization of activities Topics coroutines processes threads scheduling fairness starvation synchronization deadlocks System-Software WS 04/05

Processes and Threads Multithreading Stack 2 Stack 1 Call a.run Call b.Q Call b.q Return b.q Call c.R Return c.R Return b.Q Return a.run Call c.run Call d.Q Call d.q Return d.q Call e.R Return e.R Return d.Q Return c.run e.R b.q Thread 1 Thread 2 b.q d.Q b.Q c.run a.run 1 2 time 2 2 1 1 time time System-Software WS 04/05

Processes and Threads Coroutines (1) each activity has its own stack, address-space is shared explicit context switch (stack only) under programmer‘s control uses Transfer call switch to another coroutine System-Software WS 04/05

Processes and Threads Coroutines (2) Subroutines Call Call Return Return Coroutinen Start Transfer Start Transfer System-Software WS 04/05

Processes and Threads Coroutines (3) TYPE Coroutine = POINTER TO RECORD FP: LONGINT; stack: POINTER TO ARRAY OF SYSTEM.BYTE; END; VAR cur: Coroutine; (* Current Coroutine *) PROCEDURE Transfer*(to: Coroutine); BEGIN SYSTEM.GETREG(SYSTEM.EBP, cur.FP); cur := to; SYSTEM.PUTREG(SYSTEM.EBP, cur.FP); END Transfer; PUSH EBP SUB ESP, 4 save FP restore FP MOV ESP, EBP POP EBP RET 4 System-Software WS 04/05

Processes and Threads Coroutines (4) to’ SP FP PC’ FP’ locals stackQ stackP Q pcx FP” Transfer(Q) to’ SP FP PC’ FP’ locals stackQ stackP to’ SP FP PC’ FP’ locals stackQ stackP Q pcx FP” FP := Q.FP FP stackQ stackP Q pcx locals FP” SP return jump at PC’ System-Software WS 04/05

Processes and Threads Coroutines (5) Current stack: current execution state All other stacks: top PAF (proc activation frame) contains last Transfer call Start: create stack with fake Transfer-like PAF PROCEDURE Start(C: Coroutine; size: LONGINT); BEGIN NEW(C.stack, size); tos := SYSTEM.ADR(C.stack[0])+LEN(C.stack); SYSTEM.PUT(tos-4, 0); (* par = null *) SYSTEM.PUT(tos-8, 0); (* PC’ = null, not allowed to return *) SYSTEM.PUT(tos-12, 0); (* FP’ *) cur.FP := tos-12; END; System-Software WS 04/05

Processes and Threads Problems caused by multitasking Concurrent access to resources protection limit access to a resource synchronization synchronize task with resource state or other task Concurrent access to CPU task priorities scheduling One problem’s solution is another problem’s cause.... deadlocks fairness deadlines / periodicity constraints System-Software WS 04/05

Processes and Threads Protection: Mutual Exclusion Mutual Exclusion only one activity is allowed to access one resource at a time disable interrupts (single CPU only, avoid switches) locks flag: lock taken / lock free spin lock (uses busy waiting) exclusive lock read-write lock (multiple reader, one writers) System-Software WS 04/05

Processes and Threads Protection: Monitor Shared resources as Monitor resources are passive objects execution of critical sections inside monitor is mutually exclusive Global Monitor Lock Shared Monitor Lock for read-access (optional) monitor as a special module [original version (Hoare, Brinch Hansen)] object instance as monitor method and code block granularity Java, C#, Active Oberon, ... Resource task P task Q acquire acquire release release System-Software WS 04/05

Processes and Threads Protection one waiting queue per resource is required Simplistic implementation with coroutines Non-reentrant lock (no recursion allowed) PROCEDURE Acquire(r: Resource); BEGIN IF r.taken THEN InsertList(r.waiting, cur); SwitchToNextRoutine() ELSE r.taken := TRUE END END Acquire; PROCEDURE Release(r: Resource); BEGIN next := GetFromList(r.waiting); IF next # NIL THEN InsertList(ready , next); Transfer(GetNextTask()); ELSE r.taken := FALSE END END Release; System-Software WS 04/05

Processes and Threads Protection Shared resource as Process synchronization during communication Communicating Sequential Processes (CSP) C.A.R. Hoare (1978) Model of communication „Rendez-vous“ between two processes P!x (send x to process P) Q?y (ask y from process Q) Used in Ada, Occam task P task Q task P task Q P?x Q!z Q!z P?x System-Software WS 04/05

Processes and Threads Protection Some variations on the theme.... Reentrant Locks Readers / Writers one writer or multiple readers allowed Binary Semaphores one activity can get the resource Generic Semaphores N activities are allowed to get the resource System-Software WS 04/05

Processes and Threads Synchronization Wait on a condition / state Signals with Send/Wait Methods Require cooperation from all processes Example: Producer/Consumer with conditions nonempty/nonfull Semantic of Send Send-and-Pass vs. Send-and-Continue Generic system-handled conditions (Active Oberon) AWAIT(x > y); Wait on partner process CSP System-Software WS 04/05

Processes and Threads Synchronization: Implementation Example Process list double-chained list of all coroutines cur points to current (running) coroutine each signal has a LIFO list ready C1 C4 C2 link s C3 C5 cur Signal System-Software WS 04/05

Processes and Threads Synchronization: Implementation Example Terminate cur.next.prev := cur.prev; cur.prev.next := cur.next; Schedule Schedule prev := cur; WHILE ~cur.ready & cur.next # prev DO cur := cur.next END; IF cur.ready THEN Transfer(cur) ELSE (*deadlock*) END System-Software WS 04/05

Processes and Threads Synchronization: Implementation Example Init(s) s := NIL Wait(s) cur.link := s; s := cur; cur.ready := FALSE; Schedule (*to next ready from cur*) Send(s) IF s # NIL THEN (*send-and-pass*) cur := s; s.ready := TRUE; s := s.link END; Schedule (*to next ready from cur*) System-Software WS 04/05

Processes and Threads Active Oberon: Bounded Buffer Buffer* = OBJECT VAR data: ARRAY BufLen OF INTEGER; in, out: LONGINT; (* Put - insert element into the buffer *) PROCEDURE Put* (i: INTEGER); BEGIN {EXCLUSIVE} (*AWAIT ~full *) AWAIT ((in + 1) MOD BufLen # out); data[in] := i; in := (in + 1) MOD BufLen END Put; (* Get - get element from the buffer *) PROCEDURE Get* (VAR i: INTEGER); BEGIN {EXCLUSIVE} (*AWAIT ~empty *) AWAIT (in # out); i := data[out]; out := (out + 1) MOD BufLen END Get; PROCEDURE & Init; BEGIN in := 0; out := 0; END Init; END Buffer; System-Software WS 04/05

Processes and Threads CSP: Bounded Buffer (I) [bounded_buffer || producer || consumer] producer :: *[<produce item> bounded_buffer ! item; ] consumer :: *[bounded_buffer ? item; <consume item> Geoff Coulson Lancaster University System-Software WS 04/05

Processes and Threads CSP: Bounded Buffer (II) buffer: (0..9) item; in, out: integer; in := 0; out := 0; *[ in < out+10; producer ? buffer(in mod 10) -> in := in + 1; || out < in; consumer ! buffer(out mod 10) -> out := out + 1; ] System-Software WS 04/05

Processes and Threads Process State Process states Running: actually using the CPU Ready: waiting for a CPU Blocked: unable to run, waiting for external event Process state transitions wait for external event system scheduler external event happens Running 3 1 2 Blocked Ready 4 System-Software WS 04/05

Processes and Threads Process State (Active Oberon) Active Oberon provides monitor-like object protection conditions Condition are checked by the system. No explicit help or knowledge from user is required (no x.Signal) Running Awaiting Object Awaiting Condition Ready System-Software WS 04/05

Activities Program (static concept) ≠ Process (dynamic) Processes, jobs, tasks, threads (differences later) program code context: program counter (PC) and registers stack pointer state [new] running waiting ready [terminated] stack data section (heap) System-Software WS 04/05

Processes vs. Threads Process or job (heavyweight) code address space processor state private data (stack+registers) can have multiple threads Thread (lightweight) shared code shared address space processor state private data (stack+registers) Process: task or activity on a computer Kernel CPU System-Software WS 04/05

Processes vs. Threads: Example HEAP 1 HEAP 2 HEAP STACK 1 STACK 2 STACK 1 STACK 2 PROC instr … PROC 1 instr … PROC 2 instr … System-Software WS 04/05

Multitasking Programmed events that can cause a task switch protection (locks) acquire release synchronization wait on a condition send a signal (send-and-pass) System events that can cause a task switch voluntary switch (“yield”, task termination) process with higher priority becomes available consumption of the allowed time quantum synchronous asynchronous task preemption System-Software WS 04/05

Preemption Assign each process a time-quantum (normally in the order of tens of ms) Asynchronous task switches can happen at any time! task can be in the middle of a computation save whole CPU state (registers, flags, ...) Perform switch on resource conflict on synchronization request on timer-interrupt (time-quantum is over) System-Software WS 04/05

Context switch Scheduler invocation: Operations: preemption  interrupt cooperation  explicit call Operations: store the process state (PC, regs, …) choose the next process (strategy) [accounting] restore the state of the next process (regs, SP, PC, …) jump to the restored PC A context switch is usually expensive: 1–1000s depending on the system and number of processes hardware optimizations (e.g., multiple sets of registers – SPARC, DECSYSTEM-20) System-Software WS 04/05

Scheduling algorithms Three categories of environments: batch systems (e.g., VPP, DOS) usually non-preemptive (i.e., task is not stopped by scheduler, only synchronous switches) interactive systems (UNIX, Windows, Mac OS) cooperative or preemptive no task allowed to have the CPU forever real-time systems (PathWorks, RT Linux) timing constraints (deadlines, periodicity) System-Software WS 04/05

Scheduling Performance CPU utilization Throughput number of jobs per time unit minimize context switch penalty Turnaround time = exit time - arrival time execution, wait, I/O Response time = start time - request time Waiting time (I/O, waiting, …) Fairness System-Software WS 04/05

Scheduling algorithm goals All systems Fairness give every task a chance Policy enforcement Balance keep all subsystems busy Interactive systems Response time respond quickly Proportionality meet user’s expectations Batch systems Throughput maximize number of jobs Turnaround time minimize time in system CPU utilization keep CPU busy Real-time systems Meet deadlines avoid losing data Predictability avoid degradation Hard- vs. soft-real-time systems System-Software WS 04/05

Batch Scheduling Algorithms Choose task to run (task is usually not preempted) First Come First Serve (FCFS) fair, may cause long waiting times Shortest Job First (SJF) requires knowledge about job length Longest Response Ratio response ratio = (time in the system / CPU time) depends on the waiting time Highest Priority First with or without preemption Mixed the priority is adjusted dynamically (time in queue, length, priority, …) ETH-VPP is a batch system! Which algorithm does it use? System-Software WS 04/05

Preemptive Scheduling Algorithms Time sharing Each task has a predefined time quantum Round-Robin Schedule next task on the ready list Quantum choice: small: may cause frequent switches big: may cause slow response Implicit assumption: all task have same importance next P4 next P1 P2 P3 System-Software WS 04/05

Preemptive Scheduling Algorithms Priority scheduling process with highest priority is scheduled first Variants multilevel queue scheduling one list per priority, use round-robin on list dynamic priorities proportional to time in system inversely proportional to part of quantum used make time quantum proportional to priority System-Software WS 04/05

Real-Time Scheduling Algorithms Task needs to meet the deadline! Task cost is known (should) Two task kind: aperiodic periodic Reservation scheduler decides if system has enough resources for the task Algorithms: Rate Monotonic Scheduling assign static priorities (priority proportional to frequency) Earliest Deadline First task with closest deadline is chosen System-Software WS 04/05

Scheduling Algorithm Example Situation: Tasks P1, P2, P3, P4 Arrive at time t = 0 Priority: P1 highest, P4 lowest Time to process: 10, 2, 5, 3 System-Software WS 04/05

Scheduling Algorithm Example Highest Priority First P1 P2 P3 P4 10 12 17 20 System-Software WS 04/05

Scheduling Algorithm Example Shortest Job First P1 P2 P3 P4 2 5 10 20 System-Software WS 04/05

Scheduling Algorithm Example Timesharing with quantum = 2 P1 P2 P3 P4 2 4 6 8 10 12 14 16 18 20 13 System-Software WS 04/05

Scheduling Algorithm Example Timesharing with quantum  0 running at 1/4 running at 1/3 running at 1/2 P1 P2 P3 P4 8 11 15 20 System-Software WS 04/05

Scheduling Algorithm Example: Results Situation: Tasks P1, P2, P3, P4 Arrive at time t = 0 Priority: P1 highest, P4 lowest Time to process: 10, 2, 5, 3 Results turnaround response time Highest Priority First: 14.75 9.75 Shortest Job First: 9.25 4.25 Timesharing with Quantum = 2: 12.75 3.0 Timesharing with Quantum  0: 13.5 0 System-Software WS 04/05

Scheduling Examples UNIX BSD similar Windows NT preemption 32 priority levels (round robin) each second the priorities are recomputed (CPU usage, nice level, last run) BSD similar every 4th tick priorities are recomputed (usage estimation) Windows NT “real time” priorities: fixed, may run forever variable: dynamic priorities, preemption idle: last choice (swap manager) System-Software WS 04/05

Scheduling Examples: Quantum & Priorities Win2K: quantum = 20ms (professional) 120ms (user), configurable depending on type (I/O bound) BSD: quantum = 100ms priority = f(load,nice,timelast) Linux: quantum = quantum / 2 + priority f(quantum, nice) System-Software WS 04/05

Scheduling Problems Starvation A task is never scheduled (although ready)  “fairness” Deadlock No task is ready (nor it will ever become ready)  detection+recovery or avoidance System-Software WS 04/05

Deadlock Conditions Coffman conditions for a deadlock (1971): A holds R B wants R T Thread R1 T1 T2 R Resource R2 A wants S B holds S Coffman conditions for a deadlock (1971): Mutual exclusion Hold and wait No resource preemption Circular wait (cycle) System-Software WS 04/05

Deadlock Remedies Coarser lock granularity: use a single lock for all resources (e.g., Linux 2.0-2.4 “Big Kernel Lock”) Locking order: resources are ordered resource locking according to the resource order (ticketing) Two-phase-locking: try to acquire all the resources if successful, lock them; otherwise free them and try again System-Software WS 04/05

Deadlock Detection, Prevention & Recovery Deadlock detection: the system keeps a graph of locks and tries to detect cycles. time consuming the graph has to be kept consistent with the actual state Deadlock prevention (avoidance): remove one of the four Coffman conditions  cycles Recovery: kill processes and reclaim the resources rollback: requires to save the states of the processes regularly System-Software WS 04/05

Simple Deadlock Scenario Example Resources R, S, T Tasks A, B, C require { R, S }, { S, T }, { T, R } respectively Case 1: Sequential execution, no deadlock A +R +S -R -S B +S +T -S -T +T +R -T -R C System-Software WS 04/05

Simple Deadlock Scenario Case 2: Interleaving, deadlock A +R +S B +S +T +T +R C C R A T S B System-Software WS 04/05

Complex Deadlock Scenario Case with 6 resources and 7 tasks graphical representation R A B C S D T E F U V is this a case of deadlock? W G System-Software WS 04/05

Deadlock Avoidance Strategy in Bluebottle Processors Timers Each Kernel Module has a lock to protect its data When multiple locks are needed, acquire them according to the module hierarchy Threads Traps Interrupts Modules Module Hierarchy Blocks Memory Locks Configuration Module Lock System-Software WS 04/05

Priority Inversion A high-priority task can be blocked by a lower priority one. Example: High Medium Low waiting running ready System-Software WS 04/05

Priority Inversion Big problem for RTOS Solutions priority inheritance low-priority task holding resource inherits priority of high-priority task wanting the resource priority ceilings each resource has a priority corresponding to the highest priority of the users +1 the priority of the resource is transferred to the locking process can be used instead of semaphores System-Software WS 04/05

Example: Mars Pathfinder (1996–1998) VxWorks real-time system: preemptive, priorities Communication bus: shared resource (mutexes) Low priority task (short): meteorological data gathering Medium priority task (long): communication High priority: bus manager Detection: watchdog on bus activity  system reset Fix: activate priority inheritance via an uploaded on-the-fly patch (no memory protection). System-Software WS 04/05

Locking on Multiprocessor Machines Real parallelism! Cannot “disable interrupts” like on single processor machines (could stop every task, but not efficient) Software solutions Peterson, Dekker, ... Hardware support bus locking atomic instructions (Test And Set, Compare And Swap) System-Software WS 04/05

Locking on multiprocessor machines Test And Set TAS s: IF s = 0 THEN s := 1 ELSE CC := TRUE END Compare and Swap (Intel) CAS R1, R2, A: R1: expected value R2: new value A: address IF R1 = M[A] THEN M[A] := R2; CC := TRUE ELSE R1 := M[A]; CC := FALSE END These instructions are atomic even on multiprocessors! The usually do so by locking the data bus System-Software WS 04/05

Example: Semaphores on SMP Counter s: available resources Binary Semaphores with TAS Spinning (busy wait) Try TAS s JMP Try CS TAS s JMP Queuing CS Blocking System-Software WS 04/05

Example: Semaphores on SMP Counter s: available resources Generic Semaphores with CAS P(S): { S := S - 1} IF S < 0 THEN jump queuing END V(S): { S := S + 1} IF S <= 0 THEN jump dequeuing END P(s) Enter CS Exit CS V(s) Load R1s TryP MOVE R1R2 DEC R2 CAS R1, R2, s BNE TryP CMP R2, 0 BN Queuing [CS] [CS] Load R1s TryV MOVE R1R2 INC R2 CAS R1, R2, s BNE TryV CMP R2, 0 BNP Dequeuing System-Software WS 04/05

Spin-Locks: the Bluebottle/i386 way PROCEDURE AcquireSpinTimeout(VAR locked: BOOLEAN); CODE {SYSTEM.i386} MOV EBX, locked[EBP] ; EBX := ADR(locked) MOV AL, 1 ; AL := 1 CLI ; switch interrupts off before ; acquiring lock test: XCHG [EBX], AL ; set and read the lock ; atomically. ; LOCK prefix implicit. CMP AL, 1 ; was locked? JE test ; retry .. END AcquireSpinTimeout; CLI Clear Interrupt Flag EBP base pointer XCHG exchange AL 8bit EAX accumulator EBX base simplified version System-Software WS 04/05

Active Objects in Active Oberon Z = OBJECT VAR myT: T; I: INTEGER; PROCEDURE & NEW (t: T); BEGIN myT := t END NEW; PROCEDURE P (u: U; VAR v: V); BEGIN { EXCLUSIVE } i := 1 END P; BEGIN { ACTIVE } BEGIN { EXCLUSIVE } AWAIT (i > 0); END END Z; Initializer State Method Mutual Exclusion Object Activity Condition System-Software WS 04/05

Active Oberon Runtime Structures NIL CPUs Running 1 Lock Queue Wait Queue Awaiting Object Awaiting Assertion 2 Ready Queue Ready Ready System-Software WS 04/05

Active Oberon Implementation NIL 7 NEW Create object; Create process; Set to ready Running 2 3 Awaiting Object 6 Awaiting Assertion Preempt Set to ready; Run next ready 6 1 1 Ready 7 END Run next ready 4 5 1 NIL System-Software WS 04/05

Active Oberon Implementation NIL Enter Monitor IF monitor lock set THEN Put me in monitor obj wait list; Run next ready ELSE set monitor lock END 7 2 Running 1 2 3 Awaiting Object 6 Awaiting Assertion Exit Monitor Find first asserted x in wait list; IF x found THEN set x to ready ELSE Find first x in obj wait list; ELSE clear monitor lock END Run next ready 5 1 Ready 4 4 5 1 NIL System-Software WS 04/05

Active Oberon Implementation NIL 7 Running 2 3 3 AWAIT Put me in monitor assn wait list; Call Exit monitor Awaiting Object 6 Awaiting Assertion 1 Ready 4 5 NIL System-Software WS 04/05

Case Study: Windows CE 3.0 Real-time constraints Reaction time on events Execution time Threads with priorities and time quanta Priorities: 0 (high), …, 255 (low) Time quanta in ms Default 100 ms 0  no quantum Single processor end of quantum p q < p p System-Software WS 04/05

Case Study: Windows CE 3.0 Interrupt Handling ISR (Interrupt Service Routine) 1st level handling Kernel mode, uses kernel stack Installed at boot-time Creates event on-demand Preempted by ISR with higher priority IST (Interrupt Service Thread) 2nd level handling User mode Awaits events User Modus IST Event IRQ Event NK.EXE ISR Kernel Modus System-Software WS 04/05

Case Study: Windows CE 3.0 Synchronization on common resources: Critical sections: enter, leave operations Semaphores and mutexes (binary semaphores) Synchronization is performed with system/library calls (they are not part of a language). Priority inversion avoidance priority inheritance (thread inherits priority of task wanting the resource) CS [ ] [ ] [ ] System-Software WS 04/05

Case Study: Java Activities are mapped to threads (no processes) Synchronization in the language locks signals Threads provided by the library Scheduling depends on the JVM System-Software WS 04/05

Case Study: Java public class MyThread() extends Thread { public void run() { System.out.println("Running"); } public static void main(String [] arguments) { MyThread t = (new MyStread()).start(); System-Software WS 04/05

Case Study: Java public class MyThread() implements Runnable { public void run() { System.out.println("Running"); } public static void main(String [] arguments) { Thread t = (new Thread(this)).start(); System-Software WS 04/05

Case Study: Java Protection with monitor-like objects with method granularity public synchronized void someMethod() with statement granularity synchronized(anObject) { ... } Synchronization with signals wait() (with optional time-out) notify() / notifyAll() (“send and continue” pattern) System-Software WS 04/05

Case Study: Java private Object o; public synchronized consume() { while (o == null) { try { wait(); } catch (InterruptedException e) {} } use(o); o = null; notifyAll(); public synchronized void produce(Object p) { while (o != null) { o = p; System-Software WS 04/05

Case Study: POSIX Threads Standard interface for threads in C Mostly UNIX, possible on Windows Provided by a library (libpthread) and not part of the language. IEEE POSIX 1003.1c standard (1995) Various implementations (both user and kernel level) System-Software WS 04/05

Case Study: POSIX Threads #include <pthread.h> pthread_mutex_t m; void *run(){ pthread_mutex_lock(&m); // critical section pthread_mutex_unlock(&m); pthread_exit(NULL); } int main (int argc, char *argv[]){ pthread_t t; pthread_create(&t, NULL, run,NULL); System-Software WS 04/05

File Systems

File Systems - Overview Hardware File abstraction File organization File systems Oberon Unix FAT Distributed file systems NFS AFS Special topics Error recovery ISAM B* Trees System-Software WS 04/05

Hardware: the ATA Bus ATA / IDE (1986) ATA-2 / EIDE ATA-4 / ATAPI Advanced Technology Attachment Integrated Drive Electronics ATA-2 / EIDE ATA-4 / ATAPI ATA Packet Interface (SCSI command set) ATA-5 UDMA 66 ATA-6 UDMA 100 SATA ATA-7 UDMA 133 bus with 2 devices master / slave low-level interface head / cylinder / sector support for LBA (logical block addressing) PIO mode read byte by byte through hardware port DMA mode use DMA transfer System-Software WS 04/05

Hardware: the SCSI Bus SCSI: Small Computer Systems Interface SCSI-2 Fast SCSI Wide SCSI SCSI-3 Bus with 8 devices wide: 16 / 32 devices bus arbitration disconnected mode Device kinds direct access CD-ROM ... Block-oriented access read-block, write-block Transfer mode selection asynchronous (hand-shake) synchronous (period / offset) System-Software WS 04/05

Hardware: Hard Disk Organization Addressing cylinder (c) head (h) sector (s) Addressing sector (c, h, s) block (LBA) track (cylinder) sector surface (head) rotation axis System-Software WS 04/05

Hardware: Example Current disk example: ATA-100 250GB 512 bytes per sector (488·106 sectors) 8MB cache 8.9ms average seek time 7200 rpm System-Software WS 04/05

Hardware: Hard Disk Improvements Interleaving optimize sequential sector access Read-ahead Caching Sector defect management cylinder 1 5 4 2 6 7 3 System-Software WS 04/05

Hardware: Disk Scheduling Disk controllers have a queue of pending requests: type: read or write block number: translated into the (h,c,s)-tuple memory address (where to copy from and to) amount to be transferred (byte or block count) System-Software WS 04/05

Hardware: Disk Scheduling Performance: minimize head movements, maximize throughput Scheduling is now in the hardware First-come, first-served (FCFS) Shortest-seek-time-first (SSTF) SCAN (elevator) & C-SCAN LOOK & C-LOOK System-Software WS 04/05

Hardware: Disk Scheduling Example (head position, track number): queue = 31, 72, 4, 18, 147, 193, 199, 153, 114, 72 System-Software WS 04/05

Hardware: Disk Scheduling System-Software WS 04/05

Abstractions Block: array of sectors some systems call them “clusters” user configured reduces address space increases access speed causes internal fragmentation Disk: array of sectors File: stream of bytes sequential access random access stored on disk mapping byte to block block allocation management System-Software WS 04/05

Abstraction Layers Abstractions Implementations FAT File System Oberon OpenFile, WriteFile, ReadFile, SeekFile, CloseFile ISO 9660 Volume ext3 ReadBlock, WriteBlock AllocateBlock, FreeBlock NTFS Disk ATA driver ReadSector, WriteSector SCSI driver System-Software WS 04/05

File Organization How can we map groups of blocks into files? How do we manage free space? How can I jump to a certain location? Operation: read n bytes at position p. System-Software WS 04/05

File Organization: Contiguous Allocation File is a group of contiguous blocks Simple management Fast transfers IBM MVS (mainframe) start length System-Software WS 04/05

File Organization: Contiguous Allocation external fragmentation allocation how much space does a file need? first fit, best fit, …? file growth (error? move? extensions?) preallocation: internal fragmentation start length System-Software WS 04/05

File Organization: Linked Allocation File is a linked list of blocks no external fragmentation no growth problems Problems sequential files only (positioning requires traversal) space for pointers (1TB, 5B addr., 1% with 512B blocks) reliability (lost pointers) start System-Software WS 04/05

File Organization: Linked Allocation Clusters: series of contiguous blocks faster (less jumps) less space wasted for pointers internal fragmentation start System-Software WS 04/05

File Organization: Linked Allocation Pointer tables the list of pointers is stored in a separate table can be cached usually is stored twice (reliability) FAT (MS-DOS, OS/2, Windows, solid-state memory) start System-Software WS 04/05

File Organization: Indexed Allocation Index with block addresses Fast access for random-access files No external fragmentation Problems high management overhead limited file size (depending on the index structure) pointer overhead file System-Software WS 04/05

File Organization: Indexed Allocation Variation: linked list of indexes Advantage: no file size limitation Disadvantage: Index lookup requires sequential traversal of index list file System-Software WS 04/05

File Organization: Indexed Allocation multi-level indexes (index of indexes) UNIX Advantage: fast index lookup Disadvantage: limited file size file System-Software WS 04/05

File Organization: Indexed Allocation Example: blocks 2KB address 4B First level index blocks: 512 entries · 2KB = 1MB Second level index block: 512 entries · 2KB = 0.5GB file System-Software WS 04/05

Free Space Management Bitmap (e.g., HFS) Linked lists Grouping bit vector to mark free blocks simple needs caching Linked lists list of free blocks (similar to linked allocation) Grouping free blocks contain n address of free blocks (similar to multilevel indexing) Counting list of 2-tuples of series of free blocks (start, length) System-Software WS 04/05

Case Study: Oberon File System Disk module: controller driver block management FileDir module: maps files to locations implemented with B-trees garbage collection (files) the directory is the root set anonymous (nonregistered) files are collected Files module: allows user operations (read, create, write, …) access is performed through riders  Files FileDir Disk System-Software WS 04/05

Case Study: Oberon File System Characteristics Block size = 1KB File organization multilevel index: 64 direct 12 1st level indirect 672 data bytes in file header Block allocation allocation table created at boot-time (partition GC) no collection at run-time (partition fills up!) designed to optimize small files System-Software WS 04/05

Case Study: Oberon File System Block = 1KB d d d d 64 blocks d 1 d d d i1 d d d d 63 i2 d d d d d d 75 i1 d d (672B) 12 index blocks with 256 data blocks each System-Software WS 04/05

Case Study: Oberon File System Free block management: bitmap Garbage collection at startup 11111111111111111111111111111111 8 16 24 startup / GC 11010010011110111101110100011100 8 16 24 allocate 16,17 11010010011110110001110100011100 8 16 24 allocate 19 11010010011110110000110100011100 8 16 24 System-Software WS 04/05

Case Study: Oberon File System Internals “Rider”: current read or write position Buffer (cache) for consistency (each file sees the write operations on it) File Handle Rider f R R R “Hint” Buffer f System-Software WS 04/05

Case Study: Oberon RAM Disk File = POINTER TO Header; Index = POINTER TO Sector; Rider = RECORD eof: BOOLEAN; file: File; pos: LONGINT; adr: LONGINT; END; Header = RECORD mark: LONGINT; name: FileDir.Name; len, time, date: LONGINT ext: ARRAY 12 OF Index; sec: ARRAY 64 OF SectorTable; header primary sector table points to sectors 0 - 63 ext table index sector 0 points to sectors 64 - 319 index sector 1 points to sectors 320 - 575 System-Software WS 04/05

Case Study: Oberon RAM Disk PROCEDURE Read(VAR r: Rider; VAR x: SYSTEM.BYTE); VAR m: INTEGER; BEGIN IF r.pos < r.file.len THEN SYSTEM.GET(r.adr, x); INC(r.adr); INC(r.pos); IF r.adr MOD SS = 0 THEN (*end of sector *) m := SHORT(r.pos DIV SS); IF m < STS THEN r.adr := r.file.sec[m] ELSE r.adr := r.file.ext[(m-STS) DIV XS].x[(m-STS) MOD XS] END ELSE x := 0X; r.eof := TRUE END Read; SS = Sector Size STS = Sector Table Size XS = Index Size System-Software WS 04/05

Case Study: Oberon RAM Disk PROCEDURE Write(VAR r: Rider; x: SYSTEM.BYTE); VAR k, m, n: INTEGER; ix: LONGINT; BEGIN IF r.pos < r.file.len THEN m := SHORT(r.pos DIV SS); INC(r.pos); IF m < STS THEN r.adr := r.file.sec[m] ELSE r.adr := r.file.ext[(m-STS) DIV XS].x[(m-STS) MOD XS] END .... END; SYSTEM.PUT(r.adr, x); INC(r.adr); END Write; overwrite System-Software WS 04/05

Case Study: Oberon RAM Disk IF r.pos < r.file.len THEN .... ELSE IF r.adr MOD SS = 0 THEN m := SHORT(r.pos DIV SS); IF m < STS THEN Kernel.AllocSector(0, r.adr); r.file.sec[m] := r.adr ELSE n := (m-STS) DIV XS; k := (m-STS) MOD XS; IF k = 0 THEN Kernel.AllocSector(0, ix); r.file.ext[n] := SYSTEM.VAL(Index, ix) END; Kernel.AllocSector(0, r.adr); r.file.ext[n].x[k] := r.adr INC(r.pos); r.file.len := r.pos SYSTEM.PUT(r.adr, x); INC(r.adr); expand System-Software WS 04/05

Case Study: UNIX, inodes File system: files and directories (files with a special content) A file is represented by an inode Inode: file owner file type regular / directory / special access permissions access time reference count (links) table of contents file size Inode table of contents 10 (12) direct blocks 1 indirect block 1 double indirect block 1 triple indirect block System-Software WS 04/05

Case Study: UNIX, inodes type access refc info 1 i1 d i1 d i1 d 10 11 12 i2 i1 d i2 i1 d i2 i1 d inode i3 i2 i1 d i3 i2 i1 d i3 i2 i1 d System-Software WS 04/05

Case Study: UNIX, directories Directories are normal files with a special content. The data part contains a list with inode name Every directory has two special entries . the directory itself .. the parent directory System-Software WS 04/05

Case Study: UNIX, inodes type: dir blocks: 132 owner: root ref count: 1 type: dir blocks: 406 owner: root ref count: 1 type: file blocks: 42, 103 owner: root ref count: 1 inodes disk block block 132 block 406 block 42 / 2 . 2 .. 4 bin 3 root /root/ 3 . 2 .. 5 .tcshrc 6 mbox data block 103 data inode # name System-Software WS 04/05

Case Study: UNIX, soft and hard links two directories entries with the same inode number each file has a reference counter 42 file 42 hardlink Soft links the directory entry points to a special file with the path of the linked file 42 file 43 softlink (inode 43 points to a special file with the path of file) System-Software WS 04/05

Case Study: UNIX, hard links inode 2 inode 3 inode 5 type: dir blocks: 132 owner: root ref count: 1 type: dir blocks: 406 owner: root ref count: 1 type: file blocks: 42, 103 owner: root ref count: 2 inodes disk block block 132 block 406 block 42 / 2 . 2 .. 4 bin 3 root /root/ 3 . 2 .. 5 mails 5 mbox data block 103 data System-Software WS 04/05

Case Study: UNIX, soft links inode 2 inode 3 type: dir blocks: 132 owner: root ref count: 1 type: dir blocks: 406 owner: root ref count: 1 inode 5 block 42 type: file blocks: 42 owner: root ref count: 1 data block 132 block 406 / 2 . 2 .. 4 bin 3 root /root/ 3 . 2 .. 5 mbox 6 mails inode 6 block 43 type: file blocks: 43 owner: root ref count: 1 /root/mbox System-Software WS 04/05

Case Study: UNIX, Volume Layout A volume (partition) contains boot block bootstrap code super block size max file free space … inodes data blocks boot block super block inode list data blocks System-Software WS 04/05

Case Study: UNIX, Functions Core functions bread read block bwrite write block iget get inode from disk iput put inode to disk bmap map (inode, offset) to disk block namei convert path name to inode System-Software WS 04/05

Case Study: UNIX, namei namei (path) if (absolute path) inode = root; else inode = current directory inode; while (more path to process) { read directory (inode); if match(directory, name component) { inode = directory[name component]; iget(inode); } else { return no inode; } return inode; System-Software WS 04/05

Case Study: FAT FATnn: nn corresponds to the FAT size in bits FAT12, FAT16, FAT32 used by MS-DOS and Windows for disks and floppies Volume Layout boot block FAT1 FAT2 root directory data System-Software WS 04/05

Case Study: FAT, Example 1 2 EOF 3 4 12 5 FREE 6 9 7 BAD 8 11 10 13 … disk size 6 9 11 10 File 1: 4 12 File 2: 8 3 File 3: System-Software WS 04/05

Case Study: FAT, Directory Information about files is kept in the directory File name (8) Extension (3) A D V S H R Reserved (10) Time (2) Date (2) First block (2) File size (4) System-Software WS 04/05

Case Study: FAT, Max. Partition Size Block size FAT-12 FAT-16 FAT-32 0.5 KB 2 MB 1 KB 4 MB 2 KB 8 MB 128 MB 4 KB 16 MB 256 MB 1 TB 8 KB 512 MB 2 TB 16 KB 1024 MB 32 KB 2048 MB System-Software WS 04/05

File System Mounting More than one volume mounted in the same directory tree. afs ethz.ch home corti / usr bin floppy mnt dos cd System-Software WS 04/05

Virtual File System Support for several file systems disk based network special VFS: unifies the system calls Mirrors the traditional UNIX file system model Applications VFS ext3 FAT NFS AFS proc pts ext3 FAT NFS AFS proc pts System-Software WS 04/05

File System Mounting Each file system type has a method table System calls are indirect function calls through the method table Common interface (open, write, readdir, lock, …) Each file is associated with a the method table System-Software WS 04/05

File System Mounting: Special Files Devices disks memory USB devices serial ports … Kernel communication (e.g., proc) Uniform interface (open, close, read, write) Uniform protection (user, groups) System-Software WS 04/05

File Systems: Protection Restrict: access (who), operations (what), management FAT: flags in the directory e.g., read only execution based on name UNIX: restrictions in inodes based on users and groups operations: read, write, execute directories: manage files not so flexible VMS: access lists list of users and rights per file System-Software WS 04/05

Distributed File Systems

Distributed File Systems (DFS) Clients, servers and storage are dispersed among machines in a distributed system. Client Client Client Server Server Server Client Client Client Server System-Software WS 04/05

Overview Naming (dynamic): location transparency: file name does not reveal the file location location independence: file name does not change when storage is moved Caching (efficiency) write-through delayed-write write-on-close Consistency client-initiated: poll server for changes server-initiated: notify clients System-Software WS 04/05

Naming Simple approaches: Transparent Global name structure file is identified by a host, path pair Ibis (host:path) SMB (\\host\path) Transparent remote directory are mounted in the local file system not uniform (the mount point is not defined) NFS (/mnt/home, /home/) SMB (\\host\path mounted on Z:) Global name structure uniform and transparent naming AFS (/afs/cell/path) System-Software WS 04/05

Caching Reduces network and disk load Consistency problems Granularity: How much? Big/small chunks of data? Entire files? Big: +hit ratio, +hit penalty, +consistency problems Location: memory: +diskless stations, +speed disk: +cheaper, +persistent hybrid Space consumption on the clients System-Software WS 04/05

Caching Policies: write-through: +reliability, -performance (cache is effective only for read operations) delayed-write: +write speed, +unnecessary writes eliminated, -reliability write when the cache is full (+performance, -long time in the cache) regular intervals write-on-close System-Software WS 04/05

Consistency Is my cached copy up-to-date? Client-initiated approach: the client performs validity checks when? open/fixed intervals/every access Server-initiated approach: the server keeps track of cached files (parts) notifies the clients when conflicts are detected should the server allow conflicts? System-Software WS 04/05

Stateless and Stateful Servers Stateful: the server keeps track of each accessed file session IDs (e.g., identifying an inode on the server) fast simple requests caches fewer disk accesses read ahead volatile server crash: rebuild structures (recovery protocol) client crash: orphan detection and elimination System-Software WS 04/05

Stateless and Stateful Servers Stateless: each request is self-contained request: file and position complex requests need for uniform low-level naming scheme (to avoid name translations) need idempotent operations (same results if repeated) absolute byte counts No locking possible System-Software WS 04/05

File Replication A file can be present on failure independent machines Naming scheme manages the mapping same high-level name different low-level names Transparency Consistency System-Software WS 04/05

Distributed File-Systems (mainstream) NFS: Network File System (Sun) AFS: Andrew File System (CMU) SMB: Server Message Block (Microsoft) NCFS: Network Computer FS (Oberon) System-Software WS 04/05

Network File System (NFS) UNIX - based (Sun) mount file system from another machine into local directory stateless (no open/close) uses UDP to communicate based on RPC and XDR (External Data Representation) every operation is a remote procedure call known problems: no caching no disconnected mode efficiency security: IP based System-Software WS 04/05

mount -t nfs server:/home /home NFS: Example exports /home/ client(rw) etc reali / home corti server client mount -t nfs server:/home /home etc etc reali / / home home corti System-Software WS 04/05

NFS No special servers (each machine can act as a server and as a client) Cascading mounts are allowed mount -t nfs server1:/home /home mount -t nfs server2:/projects/corti /home/corti/projects Limited scalability (limited number of exports) System-Software WS 04/05

NFS: Stateless Protocol Each request contains a unique file identifier and an absolute offset No concurrency control (locking has to be performed by the applications) Committed information is assumed to be on disk (the server cannot cache writes) System-Software WS 04/05

Network File System (NFS) System call layer Virtual file system layer Virtual file system layer Local file system NFS client NFS server Local file system RPC / XDR RPC / XDR network (UDP) System-Software WS 04/05

Remote Procedure Invocation: Overview Problem send structured information from A to B A and B may have different memory layouts byte order problems How is 0x1234 (2 bytes) represented in memory? network byte-ordering 12 34 1 Big-endian: MSB before LSB IBM, Motorola, SPARC Little-endian: LSB before MSB VAX, Intel little end first System-Software WS 04/05

Marshalling / Serialization Marshalling: packing one or more data items into a buffer using a standard representation Presentation layer (OSI) RPC + XDR (Sun) RFC 1014, June 1987 RFC 1057, June 1988 IIOP / CORBA (OMG) V2.0, February 1997 V3.0, August 2002 SOAP / XML (W3C) V1.1, May 2000 XDR Type System [unsigned] integer (32-bit) [unsigned] hyper-integer (64-bit) enumeration (unsigned int) boolean (enum) float / double (IEEE 32/64-bit) opaque string array (fix + variable size) structure union void System-Software WS 04/05

RPC Protocol Remote procedure call Marshalling of procedure parameters Message format Authentication Naming Client Server procedure P(a, b, c) pack parameters send message to server await response unpack response Server unpack parameters find procedure invoke pack response send response P(a, b, c) System-Software WS 04/05

NFS Client RPC - protocol Server lookup lookup read read write write System-Software WS 04/05

NFS Efficiency Stateless protocols are inherently slow Caching: e.g., directory lookup Caching: file blocks (data) file attributes (inodes) read-ahead delayed write tradeoff between speed and consistency It is possible that two machines see different data System-Software WS 04/05

NFS: Security Exports based on IP addresses Data is not encrypted low security low granularity Data is not encrypted Permissions based on user and group ID uniform naming needed (e.g., NIS) System-Software WS 04/05

Andrew File System (AFS) 1983 CMU (later IBM, now open source) Scalable (>5000 workstations): network divided in clusters (cells) Client/user mobility (files are accessible from everywhere) Security: encrypted communication (Kerberos) Protection: control access lists Heterogeneity: clear interface to the server System-Software WS 04/05

Andrew File System (AFS) server provides a cell world-wide addressing scheme (name  cell) client caches a whole file server-synchronization on file open and close AFS is efficient low network overhead stateful: consistency is implemented with callbacks callback = client is in synch with server on store, server changes the callbacks System-Software WS 04/05

AFS: Logical View Private Space / Shared Space usr bin afs dir dir Mount Point vol Volume bin f System-Software WS 04/05

AFS: Physical View network client sever ethz.ch epfl.ch cell cmu.edu System-Software WS 04/05

AFS Client RPC - protocol Server open open Cache read write close System-Software WS 04/05

AFS: Consistency Interaction only when opening and closing files. Writes are not visible on other machines before a close. Clients assume that cached files are up-to-date. Servers keep track of caching by the clients (callbacks) clients are notified in case of changes System-Software WS 04/05

AFS: Kerberos Kerberos (Cerberos: three-headed dog guarding the Hades) authentication accounting audit Needham-Schroeder shared key protocol Distributed AFS: communication is encrypted System-Software WS 04/05

AFS: Protection Access lists: %> fs listacl thesis Access list for thesis is Normal rights: system:anyuser l trg rlidwk corti rlidwka It’s possible to allow (or deny) access to users or customized groups Restriction on: read, write, lookup, insert, administer, lock and delete. Supports UNIX control bits. System-Software WS 04/05

Network Fallacies The Eight Fallacies of Distributed Computing (Peter Deutsch) The network is reliable Latency is zero Bandwidth is infinite The network is secure The network topology doesn’t change There is one administrator Transport cost is zero The network is homogeneous System-Software WS 04/05

General Principles (Satyanarayan) From DFSs we learned the following lessons: we should try to move computations to the clients use caching whenever possible special files (e.g., temporary) can be specially treated. make scalable systems. trust the fewest possible entities batch work if possible System-Software WS 04/05

Kernel Structure

Introduction Kernel performs “dangerous” operations page table mapping scheduling Kernel must be protected against malign user code access to other processes’ data increasing own processes’ priority Kernel must have more rights than user code Solution: distinguish between kernel mode and user mode access kernel through system calls the system calls define the interface to the kernel System-Software WS 04/05

Kernel Protection application application application application system calls application application drivers memory manager file systems System-Software WS 04/05

Kernel Protection Means: hardware support separate address spaces privileged instructions supervisor mode separate address spaces user process has no access to kernel structures access memory / functions through symbolic names user has no access to hardware System-Software WS 04/05

Kernel Protection Privileged instructions in user mode generate a trap Mode switch: interrupts gated calls (user generated sw interrupt calls) Parameters: stack registers Examples: Intel x86: 4 protection levels (code/segment attribute), interrupt PowerPC: 2 levels (CPU attribute), special instruction System-Software WS 04/05

Linux System Calls (Intel) System calls are wrapped in libraries (e.g., libc) The library function writes the parameters in registers (5) writes the parameters on the stack (>5) writes the system call number in EAX calls int 0x80 The kernel jumps to the corresponding function in sys_call_table System-Software WS 04/05

Linux System Calls Examples: pid_t fork(void): creates a child process ssize_t write(int fd, const void *buf, size_t count): writes count bytes from buf to fd int kill(pid_t pid, int sig): send signal to a process int gettimeofday(struct timeval *tv, struct timezone *tz): gets the current time int open(const char *pathname, int flags): opens a file int ioctl(int d, int request, ...): manipulates special devices … System-Software WS 04/05

Windows System Calls Layered system: system call must be performed by a wrapper (NTDLL.DLL). The system call position in the KiSystemServiceTable is not known (depends on the build) call WriteFile() application NtWriteFile() KERNEL32.DLL … int 0x2e NTDLL.DLL KiSystem Service Table System-Software WS 04/05

Kernel Design: API vs. System Calls Linux system-calls are clearly specified (POSIX standard) system-calls do not change about 100 calls Windows system-calls are hidden only Win32 API is published Win32 is standard “thousands” of API calls, still growing some API calls are handled in user space More than one API: POSIX OS/2 System-Software WS 04/05

Protection and SMP What happens when two process (on two CPUs) enter in kernel mode? Big kernel lock: not allowed (OpenBSD, NetBSD) Fine grained locks in the kernel (FreeBSD 5, Linux 2.6) proc1: int 0x80 proc1: int 0x80 CPU 1 CPU 2 System-Software WS 04/05

Kernel Structure monolithic kernel layered system virtual machine big mess, no structure, one big block, fast MS-DOS (no protection), original UNIX micro-kernel (AIX, OS X) layered system layern uses functions from layern-1 OS/2 (some degree of layering) virtual machine define artificial environment for programs client-server tiny communication microkernel to access various services System-Software WS 04/05

Monolithic Kernels Monolithic Micro-kernel user-level applications scheduler signal handling file system swapping virtual memory scheduler signal handling file system swapping virtual memory terminal controllers device drivers memory controllers terminal controllers device drivers memory controllers System-Software WS 04/05

Layered Systems THE operating system A layer uses only functions from below What goes where? Less efficient user programs buffering I/O console drivers memory management CPU scheduling hardware System-Software WS 04/05

Virtual Machines VM operating system (IBM) slow and difficult to implement complete protection no sharing of resources useful for development and research compatibility procs procs procs virtual machine hardware System-Software WS 04/05

Design: Kernel or User Space? Big monolithic kernel: fast (less switches) less protection Examples: HTTP server in the Linux kernel. graphic routines in Windows Modular and micro-kernels: structured more separation move code to user space less efficient more secure Example: user level drivers System-Software WS 04/05

Virtual Machines Machine specification in software instruction set memory layout virtual devices .... JVM (Java Virtual Machine) .NET / Mono VMWare specified machine is a whole PC allows multiple PC environments on same machine IBM VM/370 System-Software WS 04/05

Case Study: JVM

Virtual Machines What is a machine? does something (...useful) programmable concrete (hardware) What is a virtual machine? a machine that is not concrete a software emulation of a physical computing environment Reality is somewhat fuzzy! Is a Pentium-II a machine? Hardware and software are logically equivalent (A. Tanenbaum) instructions RISC Core decoder Op1 Op2 Op3 System-Software WS 04/05

Virtual Machine, Intermediate Language Pascal P-Code (1975) stack-based processor strong type machine language compiler: one front end, many back ends UCSD Apple][ implementation, PDP 11, Z80 Modula M-Code (1980) high code density Lilith as microprogrammed virtual processor JVM – Java Virtual Machine (1995) Write Once – Run Everywhere interpreters, JIT compilers, Hot Spot Compiler Microsoft .NET (2000) language interoperability System-Software WS 04/05

JVM Case Study compiler (Java to bytecode) interpreter, ahead-of-time compiler, JIT dynamic loading and linking exception Handling memory management, garbage collection OO model with single inheritance and interfaces system classes to provide OS-like implementation compiler class loader runtime system System-Software WS 04/05

JVM: Type System Primitive types Object types Single class inheritance byte short int long float double char reference boolean mapped to int Object types classes interfaces arrays Single class inheritance Multiple interface implementation Arrays anonymous types subclasses of java.lang.Object System-Software WS 04/05

JVM: Java Byte-Code Memory access tload / tstore ttload / ttstore tconst getfield / putfield getstatic / putstatic Operations tadd / tsub / tmul / tdiv tshifts Conversions f2i / i2f / i2l / .... dup / dup2 / dup_x1 / ... Control ifeq / ifne / iflt / .... if_icmpeq / if_acmpeq invokestatic invokevirtual invokeinterface athrow treturn Allocation new / newarray Casting checkcast / instanceof System-Software WS 04/05

JVM: Java Byte-Code Example bipush Operation Push byte Format Forms bipush = 16 (0x10) Operand Stack ... => ..., value Description The immediate byte is sign-extended to an int value. That value is pushed onto the operand stack. bipush byte System-Software WS 04/05

JVM: Machine Organization Virtual Processor stack machine no registers typed instructions no memory addresses, only symbolic names Runtime Data Areas pc register stack locals parameters return values heap method area code runtime constant pool native method stack System-Software WS 04/05

JVM: Execution Example iload 5 iload 6 iadd istore 4 locals program v4 istore 4 v5 v5 iload 5 v6 v6 iload 6 iadd v5+v6 operand stack Time System-Software WS 04/05

JVM: Reflection Load and manipulate unknown classes at runtime. java.lang.Class getFields getMethods getConstructors java.lang.reflect.Field setObject getObject setInt getInt setFloat getFloat ..... java.lang.reflect.Method getModifiers invoke java.lang.reflectConstructor System-Software WS 04/05

JVM: Reflection – Example import java.lang.reflect.*; public class ReflectionExample { public static void main(String args[]) { try { Class c = Class.forName(args[0]); Method m[] = c.getDeclaredMethods(); for (int i = 0; i < m.length; i++) { System.out.println(m[i].toString()); } } catch (Throwable e) { System.err.println(e); System-Software WS 04/05

JVM: Java Weaknesses Transitive closure of java.lang.Object contains 1.1 47 1.2 178 1.3 180 1.4 248 5 (1.5) 280 classpath 0.03 299 class String { public String toUpperCase(Locale loc); .... } class Object { public String toString(); .... } public final class Locale implements Serializable, Cloneable { .... } System-Software WS 04/05

JVM: Java Weaknesses Class static initialization Problem T is a class and an instance of T is created T tmp = new T(); T is a class and a static method of T is invoked T.staticMethod(); A nonconstant static field of T is used or assigned (field is not static, not final, and not initialized with compile-time constant) T.someField = 42; Problem circular dependencies in static initialization code A static { x = B.f(); } B static { y = A.f(); } System-Software WS 04/05

JVM: Java Weaknesses hidden static initializer: Warning: interface Example { final static String labels[] = {“A”, “B”, “C”} } hidden static initializer: labels = new String[3]; labels[0] = “A”; labels[1] = “B”; labels[2] = “C”; Warning: in Java final means write-once! interfaces may contain code System-Software WS 04/05

JVM: Memory Model The JVM specs define a memory model: defines the relationship between variables and the underlying memory meant to guarantee the same behavior on every JVM The compiler is allowed to reorder operations unless synchronized or volatile is specified. System-Software WS 04/05

JVM: Reordering read and writes to ordinary variables can be reordered. public class Reordering { int x = 0, y = 0; public void writer() { x = 1; y = 2; } public void reader() { int r1 = y; int r2 = x; System-Software WS 04/05

JVM: Memory Model synchronized: in addition to specify a monitor it defines a memory barrier: acquiring the lock implies an invalidation of the caches releasing the lock implies a write back of the caches synchronized blocks on the same object are ordered. order among accesses to volatile variables is guaranteed (but not among volatile and other variables). System-Software WS 04/05

JVM: Double Checked Lock Singleton public class SomeClass { private static Resource resource = null; public Resource synchronized getResource() { if (resource == null) { resource = new Resource(); } return resource; System-Software WS 04/05

JVM: Double Checked Lock Double checked locking public class SomeClass { private static Resource resource = null; public Resource getResource() { if (resource == null) { synchronized (this) { resource = new Resource(); } return resource; System-Software WS 04/05

JVM: Double Checked Lock Thread 1 Thread 2 public class SomeClass { private Resource resource = null; public Resource getResource() { if (resource == null) { synchronized { resource = new Resource(); } return resource; public class SomeClass { private Resource resource = null; public Resource getResource() { if (resource == null) { synchronized { resource = new Resource(); } return resource; The object is instantiated but not yet initialized! System-Software WS 04/05

JVM: Immutable Objects are not Immutable all types are primitives or references to immutable objects all fieds are final Example (simplified): java.lang.String contains an array of characters the length an offset example: s = “abcd”, length = 2, offset = 2, string = “cd” String s1 = “/usr/tmp” String s2 = s1.substring(4); //should contain “/tmp” Sequence: s2 is instantiated, the fields are initialized (to 0), the array is copied, the fields are written by the constructor. What happens if instructions are reordered? System-Software WS 04/05

JVM: Reordering Volatile and Nonvolatile Stores volatile reads and writes are totally ordered among threads but not among normal variables example Thread 1 Thread 2 volatile boolean initialized = false; SomeObject o = null; o = new SomeObject; initialized = true; ? while (!initialized) { sleep(); } o.field = 42; System-Software WS 04/05

JVM: JSR 133 Java Community Process Java memory model revision Final means final Volatile fields cannot be reordered System-Software WS 04/05

Java JVM: Execution Interpreted (e.g., Sun JVM) bytecode instructions are interpreted sequentially the VM emulates the Java Virtual Machine slower quick startup Just-in-time compilers (e.g., Sun JVM, IBM JikesVM) bytecode is compiled to native code at load time (or later) code can be optimized (at compile time or later) quicker slow startup Ahead-of time compilers (e.g., GCJ) bytecode is compiled to native code offline quick execution static compilation System-Software WS 04/05

JVM: Loader – The Classfile Format version constant pool flags super class interfaces fields methods attributes } Constants: Values String / Integer / Float / ... References Field / Method / Class / ... Attributes: ConstantValue Code Exceptions System-Software WS 04/05

JVM: Class File Format class HelloWorld { public static void printHello() { System.out.println("hello, world"); } public static void main (String[] args) { HelloWorld myHello = new HelloWorld(); myHello.printHello(); System-Software WS 04/05

JVM: Class File (Constant Pool) String hello, world Class HelloWorld Class java/io/PrintStream Class java/lang/Object Class java/lang/System Methodref HelloWorld.<init>() Methodref java/lang/Object.<init>() Fieldref java/io/PrintStream java/lang/System.out Methodref HelloWorld.printHello() Methodref java/io/PrintStream.println(java/lang/String ) NameAndType <init> ()V NameAndType out Ljava/io/PrintStream; NameAndType printHello ()V NameAndType println (Ljava/lang/String;)V Unicode ()V Unicode (Ljava/lang/String;)V Unicode ([Ljava/lang/String;)V Unicode <init> Unicode Code Unicode ConstantValue Unicode Exceptions Unicode HelloWorld Unicode HelloWorld.java Unicode LineNumberTable Unicode Ljava/io/PrintStream; Unicode LocalVariables Unicode SourceFile Unicode hello, world Unicode java/io/PrintStream Unicode java/lang/Object Unicode java/lang/System Unicode main Unicode out Unicode printHello System-Software WS 04/05

JVM: Class File (Code) Methods 0 <init>() 0 ALOAD0 1 INVOKESPECIAL [7] java/lang/Object.<init>() 4 RETURN 1 PUBLIC STATIC main(java/lang/String []) 0 NEW [2] HelloWorld 3 DUP 4 INVOKESPECIAL [6] HelloWorld.<init>() 7 ASTORE1 8 INVOKESTATIC [9] HelloWorld.printHello() 11 RETURN 2 PUBLIC STATIC printHello() 0 GETSTATIC [8] java/io/PrintStream java/lang/System.out 3 LDC1 hello, world 5 INVOKEVIRTUAL [10] java/io/PrintStream.println(java/lang/String ) 8 RETURN System-Software WS 04/05

JVM: Compilation – Pattern Expansion Each byte code is translated according to fix patterns easy limited knowledge Example (pseudocode) switch (o) { case ICONST<n>: generate(“push n”); PC++; break; case ILOAD<n>: generate(“push off_n[FP]”); PC++; break; case IADD: generate(“pop -> R1”); generate(“pop -> R2”); generate(“add R1, R2 -> R1”); generate(“push R1”); PC++; break; … System-Software WS 04/05

JVM: Optimizing Pattern Expansion Main Idea: use internal virtual stack stack values are consts / fields / locals / array fields / registers / ... flush stack as late as possible iload 4 iload 5 iadd istore 6 emitted code MOV EAX, off4[FP] ADD EAX, off5[FP] MOV off6[FP], EAX local5 local5 virtual stack local4 local4 EAX EAX iload4 iload5 iadd istore6 System-Software WS 04/05

JVM: Compiler Comparison iload_4 iload_5 iadd istore_6 5 instructions 9 memory accesses 3 instructions 3 memory accesses pattern expansion push off4[FP] push off5[FP] pop EAX add 0[SP], EAX pop off6[FP] optimized mov EAX, off4[FP] add EAX, off5[FP] mov off6[FP], EAX System-Software WS 04/05

Linking (General) A compiled program contains references to external code (libraries) After loading the code the system need to link the code to the library identify the calls to external code locate the callees (and load them if necessary) patch the loaded code Two options: the code contains a list of sites for each callee the calls to external code are jumps to a procedure linkage table which is then patched (double indirection) System-Software WS 04/05

Linking (General) instr jump - instr jump 101 100 proc 0 proc 1 jump 1 instr 1 2 jump - 3 4 5 6 7 9 10 instr 1 2 jump 101 3 4 5 100 6 7 9 10 proc 0 5 proc 1 7 100 jump 101 System-Software WS 04/05

Linking (General) instr jump &p1 &p0 instr jump 101 100 proc 0 proc 1 instr 1 2 jump &p1 3 4 5 &p0 6 7 9 10 instr 1 2 jump 101 3 4 5 100 6 7 9 10 proc 0 5 proc 1 7 100 jump &p0 101 &p1 System-Software WS 04/05

JVM: Linking Bytecode interpreter Native code (ahead of time compiler) references to other objects are made through the JVM (e.g., invokevirtual, getfield, …) Native code (ahead of time compiler) static linking classic native linking JIT compiler only some classes are compiled calls could reference classes that are not yet loaded or compiled (delayed compilation) code instrumentation System-Software WS 04/05

JVM: Methods and Fields Resolution method and fields are accessed through special VM functions (e.g., invokevirtual, getfield, …) the parameters of the special call defines the target the parameters are indexes in the constant pool the VM checks id the call is legal and if the target is presentl System-Software WS 04/05

JVM: JIT – Linking and Instrumentation Use code instrumentation to detect first access of static fields and methods class A { .... ...B.x } class B { int x; } B.x CheckClass(B); B.x IF ~B.initialized THEN Initialize(B) END; System-Software WS 04/05

Compilation and Linking Overview C header C header C source Compiler Object File Object File Object File Object file Linker Loader Loaded Code System-Software WS 04/05

Compilation and Linking Overview Oberon source Compiler Object File Object & Symbol Loader Linker Loaded Module Loaded Module Loaded Module Loaded Module System-Software WS 04/05

Compilation and Linking Overview Java source Compiler Class File Reflection API Class Loader JIT Compiler Loader Linker Class Class System-Software WS 04/05

Jaos Jaos (Java on Active Object System) is a Java virtual machine for the Bluebottle system goals: implement a JVM for the Bluebottle system show that the Bluebottle kernel is generic enough to support more than one system interoperability between the Active Oberon and Java languages interoperability between the Oberon System and the Java APIs System-Software WS 04/05

Jaos (Interoperability Framework) Oberon source Compiler Object & Symbol Metadata Loader Oberon Browser Java Reflection API Class File Loader Java Metadata Metadata JIT Compiler Loaded Class Linker Oberon Loader Linker Loaded Module Loader Linker Loaded Module Loaded Module System-Software WS 04/05

JVM: Verification Compiler generates “good” code.... .... that could be changed before reaching the JVM need for verification Verification makes the VM simpler (less run-time checks): no operand stack overflow load / stores are valid VM types are correct no pointer forging no violation of access restrictions access objects as they are (type) local variable initialized before load … System-Software WS 04/05

JVM: Verification Pass1 (Loading): class file version check class file format check class file complete Pass 2 (Linking): final classes are not subclassed every class has a superclass (but Object) constant pool references constant pool names System-Software WS 04/05

JVM: Verification Pass 3 (Linking): For each operation in code Delayed for performance reasons Pass 3 (Linking): For each operation in code (independent of the path): operation stack size is the same accessed variable types are correct method parameters are appropriate field assignment with correct types opcode arguments are appropriate Pass 4 (RunTime): First time a type is referenced: load types when referenced check access visibility class initialization First member access: member exists member type same as declared current method has right to access member Byte-Code Verification System-Software WS 04/05

JVM: Byte-Code Verification branch destination must exists opcodes must be legal access only existing locals code does not end in the middle of an instruction types in byte-code must be respected execution cannot fall of the end of the code exception handler begin and end are sound System-Software WS 04/05

Addendum: Security

Security internal protection external protection problems: memory protection file system accesses external protection accessibility problems: program threats System-Software WS 04/05

Security: Program Threats Trojan horses: a code segment that misuses its environment mail attachments web downloads (e.g., SEXY.EXE which formats your hard disk) programs with the same name as common utilities misleading names (e.g., README.TXT.EXE) Trap door (in programs or compilers): an intentional hole in the software System-Software WS 04/05

Security: System Threats worms: a standalone program that spawns other processes (copies of itself) to reduce system performance example: Morris worm (1988) exploited holes in rsh, finger and sendmail to gain access to other machines once on the other machine it was able to replicate itself used by spammers to spread and distribute spamming applications viruses: similar to worms but embedded in other programs they usually infect other programs and the boot sector System-Software WS 04/05

Security: System Threats Denial of service perform many requests to steal all the available resources often distributed (using worms) Example: SYN flooding attacks the attacker tries to connect the victim answers with a synchronize and acknowledge packet and waits for acknowledgment Countermeasures active filtering request dropping cookie based protocols (requests must be authenticated) stateless protocols System-Software WS 04/05

Security: System Threats badly implemented and designed software: lpr (setuid) with an option to delete the printed file mkdir (first create the inode then change the owner) it was possible to change the inode before the chown … buffer overflows password in memory or swap files insecure protocols (FTP, SMTP) missing sanity checks (syscalls, command in input, …) short keys and passwords proprietary protocols System-Software WS 04/05

Bad design: A very recent example Texas Instruments produces RFID tags offering cryptographic functionalities. used for cars and electronic payments 40 bit keys proprietary protocol Attack from Johns Hopkins University and RSA Labs less than 2 hours for 5 keys less than 3500$ System-Software WS 04/05

Security: Buffer Overflows Overwrite a function’s return address p1 & p2 array function foo(int p1, int p2) { char array[10]; strcpy(array, someinput); } RET FP array Avoid strcpy and check the length, e.g., strncpy System-Software WS 04/05

Security: Monitoring check for suspicious patterns audit logs login times audit logs periodic scans for security holes (bad passwords, set-uid programs, changes to system programs) system integrity checks (checksums for executable files) [tripwire] network services monitor network activity System-Software WS 04/05

Example: Firewalling Many applications use network sockets to communicate (even on a single machine) Many applications are not protected Solution: filter all the incoming connections by default and allow only the trusted ones System-Software WS 04/05

Security: (some) Design Principles Open systems (programs and protocols) Default is deny access Check for current authority (timeouts, …) Give the least privilege possible Simple protection mechanisms Do not ask to much to the users (or they will avoid to protect themselves) System-Software WS 04/05

Security and Systems: Some Examples Enhancements to memory management: Intel XD bit, AMD NX bit mark pages according to the content (data or code) an exception is generated if the PC is moved to a data address prevents some buffer overflow attacks dynamically generated code has to be generated through special system calls Windows XP SP2, Linux, BSD … System-Software WS 04/05

Security and Systems: Some Examples SELinux National Security Agency (USA) patches to the Linux kernel to enforce mandory access control open source independent from the traditional UNIX roles (users and groups) configurable policies restricting what a program is able to do System-Software WS 04/05

Security and Systems: Some Examples OpenBSD audit process (proactive bug search) random gaps in the stack ProPolice: gcc puts a random integer on the stack in a call prologue and checks it when returning W^X: pages are writable xor executable System-Software WS 04/05

Security and Systems: Some Examples OpenBSD randomized shared library order and addresses mmap() and malloc() return randomized addresses guard pages between objects privilege separation and revocation System-Software WS 04/05

fork unprivileged child Privilege Separation unprivileged child process to contain and restrict the effects of programming errors e.g., openssh listen *22 time network connection monitor network processing request auth auth result key exchange authentication fork unprivileged child monitor user request processing request PTY pass PTY user network data state export fork user child System-Software WS 04/05