Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kernfach System Software WS04/05

Similar presentations


Presentation on theme: "Kernfach System Software WS04/05"— Presentation transcript:

1 Kernfach System Software WS04/05
P. Reali M. Corti

2 Introduction Admin Lecture Exercises Mo 13-14 IFW A 36
We IFW A 36 Exercises Always on Thursday 14-15 IFW A34 C. Tuduce (E) 14-15 IFW C42 V. Naoumov (E) 15-16 IFW A32.1 I. Chihaia (E) 15-16 RZ F21 C. Tuduce (E) 16-17 IFW A34 T. Frey (E) 16-17 IFW A32.1 K. Skoupý (E) System-Software WS 04/05

3 Introduction Additional Info
Internet Homepage Inforum vis site Textbooks & Co. Lecture Slides A. Tanenbaum, Modern Operating Systems Silberschatz / Gavin, Operating Systems Concepts Selected articles and book chapters System-Software WS 04/05

4 Introduction Exercises
Exercises are optional (feel free to shoot yourself in the foot) Weekly paper exercises test the knowledge acquired in the lecture identify troubles early exercise questions are similar to the exam ones Monthly programming assignment feel the gap between theory and practice System-Software WS 04/05

5 Introduction Exam Sometimes in March 2005 Written, 3 hours
Allowed help 2 A4 page summary calculator Official Q&A session 2 weeks before the exam System-Software WS 04/05

6 Introduction Lecture Goals
Operating System Concepts bottom-up approach no operating system course learn most important concepts feel the complexity of operating systems there‘s no silver-bullet! Basic knowledge for other lectures / term assignments Compilerbau Component Software .... OS-related assignments System-Software WS 04/05

7 Introduction What is an operating system?
An operating system has two goals: Provide an abstraction of the hardware ABI (application binary interface) API (application programming interface) hide details Manage resources time and space multiplexing resource protection System-Software WS 04/05

8 Introduction Operating system target machines
Targets mainframes servers multiprocessors desktops real-time systems embedded systems Different goals and requirements! memory efficiency reaction time abstraction level resources security ... System-Software WS 04/05

9 Introduction Memory vs. Speed Tradeoff
Example: retrieve a list of names memory time Array Nn N List N(n+4) N/2 Bin. Tree N(n+8) log(N) Hash Table 3Nn 1 N = # names n = name length System-Software WS 04/05

10 Introduction Operating System as resource manager
... in the beginning was the hardware! Most relevant resources: CPU Memory Storage Network System-Software WS 04/05

11 Introduction Lecture Topics
Virtual Machine Process Distributed Object-System Abstraction level Thread Coroutine Object-Oriented Runtime Support Scheduling Garbage Collection Distributed File-System Concurrency Support Memory Management Demand Paging Virtual Memory File System Runtime support CPU Memory Disk Network System-Software WS 04/05

12 Introduction A word of warning....
Most of the topics may seem simple..... .... and in fact they are! Problems are mostly due to: complexity when integrating system low-level („bit fiddling“) details bootstrapping (X needs Y, Y needs X) System-Software WS 04/05

13 Introduction Bootstrapping (Aos)
SMP Timers Active Traps Interrupts Modules Module Hierarchy Storage Memory Locks Level Processor System-Software WS 04/05

14 Introduction Lecture Topics
Overview Runtime Support Virtual Addressing Memory Management Distributed Obj. System Concurrency Concurrency Disc / Filesystem Case Study: JVM Oct‘04 Jan‘05 Nov‘04 Feb‘05 Dec‘04 System-Software WS 04/05

15 Run-time Support Overview
Support for programming abstractions Procedures calling conventions parameters Object-Oriented Model objects methods (dynamic dispatching) Exceptions Handling ... more ... System-Software WS 04/05

16 Run-time Support Application Binary Interface (ABI)
Object a, b, c, … with methods P, Q, R, … and internal procedures p, q, r, … Call Sequence Stack a.P b.Q 3 a.P b.Q b.q 2 a.P b.Q b.q 1 a.P b.Q c.R 4 Call a.P Call b.Q Call b.q Return b.q Call c.R Return c.R Return b.Q Return a.P Stack Pointer (SP) Procedure Activation Frame (PAF) System-Software WS 04/05

17 Run-time Support Procedure Activation Frame
Save Registers Push Parameters Save PC Branch Save FP FP := SP Allocate Locals Caller Stack Pointer (SP) Call locals Frame Pointer (FP) Dynamic Link FP‘ PC Callee params Return Remove Locals Restore FP Restore PC Remove Parameters Restore Registers Caller Frame Caller System-Software WS 04/05

18 Run-time Support Procedure Activation Frame, Optimizations
Many optimizations are possible use registers instead of stack register windows procedure inlining use SP instead of FP addressing System-Software WS 04/05

19 Run-time Support Procedure Activation Frame (Oberon / x86)
Caller Callee push params call P push fp mov fp, sp sub sp, size(locals) push pc pc := P mov sp, fp pop fp ret size(params) ... pop pc add sp,size(params) System-Software WS 04/05

20 Run-time Support Calling Convention
Convention between caller and callee how are parameters passed data layout left-to-right, right-to-left registers register window stack layout dynamic link static link register saving reserved registers System-Software WS 04/05

21 Run-time Support Calling Convention (Oberon)
Parameter passing: on stack (exception: Oberon/PPC uses registers) left-to-right self (methods only) as last parameter structs and arrays passed as reference, value-parameters copied by the callee Stack dynamic link static link as last parameter (for local procedures) Registers saved by caller System-Software WS 04/05

22 Run-time Support Calling Convention (C)
Parameter passing: on stack right-to-left arrays passed as reference (arrays are pointers!) Stack dynamic link Registers some saved by caller System-Software WS 04/05

23 Run-time Support Calling Convention (Java)
Parameter passing left-to-right self as first parameter parameters pushed as operands parameters accessed as locals access through symbolic, type-safe operations System-Software WS 04/05

24 Run-time Support Object Oriented Support, Definitions
Class Hierarchy Obj x = new ObjA(); static type of x is Obj dynamic type of x is ObjA x compiled as being compatible with Obj, but executes as ObjA. static and dynamic type can be different  the system must keep track of the dynamic type with an hidden „type descriptor“ Obj0 Obj ObjA ObjB Polymorphism System-Software WS 04/05

25 Run-Time Support Polymorphism
VAR t: Triangle; s: Square; o: Figure; BEGIN t.Draw(); s.Draw(); o.Draw(); END; Type is statically known! Type is discovered at runtime! WHILE p # NIL DO p.Draw(); p := p.next END; System-Software WS 04/05

26 Run-time Support Object Oriented Support, Definitions
Class Hierarchy Obj x = new ObjA(); if (x IS ObjA) { ... } // type test ObjA y = (ObjA)x // type cast x = y; // type coercion // (automatic convertion) Obj0 Obj ObjA ObjB System-Software WS 04/05

27 Run-time Support Object Oriented Support (High-level Java)
Type Test Implementation if (a != null) { Class c = a.getClass(); while ((c != null) && (c != T)) { c = c.getSuperclass(); } return c == T; } else { return false; .... a IS T .... System-Software WS 04/05

28 Run-Time Support Type Descriptors
struct TypeDescriptor { int level; type[] extensions; method[] methods; } class Object { TypeDescriptor type; many type-descriptor layouts are possible layout depends on the optimizations choosen System-Software WS 04/05

29 Run-Time Support Type Tests and Casts
“extension level” Run-Time Support Type Tests and Casts 2 Obj0 Obj ObjA ObjB TD(Obj) 0: Obj0 1: Obj 2: NIL 3: NIL 1 0: Obj0 1: Obj 2: ObjA 3: NIL TD(ObjA) 0: Obj0 1: NIL 2: NIL 3: NIL TD(Obj0) (obj IS T) obj.type.extension[ T.level ] = T mov EAX, obj mov EAX, -4[EAX] cmp T, -4 * T.level - 8[EAX] bne .... System-Software WS 04/05

30 Run-time Support Object Oriented Support (High-level Java)
Method Call Implementation .... a.M(.....) .... Class[] parTypes = new Class[params.Length()]; for (int i=0; i< params.Length(); i++) { parTypes[i] = params[i].getClass(); } Class c = a.getClass(); Method m = c.getDeclaredMethod(“M”, parTypes); res = m.invoke(self, parValues); Use method implementation for the actual class (dynamic type) System-Software WS 04/05

31 Run-Time Support Handlers / Function Pointers
TYPE SomeType = POINTER TO SomeTypeDesc; Handler = PROCEDURE (self: SomeType; param: Par); SomeTypeDesc = RECORD handler: Handler; next: SomeType; END root PROC R Disadvantages: memory usage bad integration (explicit self) non constant Advantages: instance bound can be changed at run-time handler PROC Q next handler next handler next System-Software WS 04/05

32 Run-Time Support Method tables (vtables)
Idea: have a per-type table of function pointers. Run-Time Support Method tables (vtables) TYPE A = OBJECT PROCEDURE M0; PROCEDURE M1; END A; B = OBJECT (A) PROCEDURE M2; END B; 0: A.M0 1: A.M1 A.MethodTable B.M0 overrides A.M0 0: A.M0 1: A.M1 B.MethodTable B.M0 B.M2 is new 2: B.M2 New methods add a new entry in the method table Overrides replace an entry in the method table Each method has an unique entry number System-Software WS 04/05

33 Run-Time Support Method tables
0: A.M0 1: A.M1 A.MethodTable TYPE A = OBJECT PROCEDURE M0; PROCEDURE M1; END A; B = OBJECT (A) PROCEDURE M2; END B; Virtual Dispatch o.M0; call o.Type.Methods[0] 0: A.M0 1: A.M1 B.MethodTable 0: B.M0 2: B.M2 mov eax, VALUE(o) mov eax, type[eax] mov eax, off + 4*mno[eax] call eax o Fields Type System-Software WS 04/05

34 Run-Time Support Oberon Type Descriptors
td size type name method table superclass table pointers in object for GC mth table for method invocation ext table type descriptor is also an object! for type checks type desc for object allocation type desc obj size obj fields ptr offsets for garbage collection System-Software WS 04/05

35 Run-Time Support Interfaces, itables
interface A { void m(); } interface B { void p(); does x implement A? x has an method table (itable) for each implemented interface Object x; A y = (A)x; y.m(); multiple itables: how is the right itable discovered? System-Software WS 04/05

36 Run-Time Support Interface support
How to retrieve the right method table (if any)? Global table indexed by [class, interface] Local (per type) table / list indexed by [interface] Many optimizations are available use the usual trick: enumerate interfaces System-Software WS 04/05

37 Run-Time Support Interface support (I)
Call is expensive because requires traversing a list: O(N) complexity Run-Time Support Interface support (I) Type Descriptor interfaces Intf0 Intf7 method table (vtable) method table (itable) method table (itable) interface i = x.type.interfaces; while ((i != null) && (i != Intf0) { i = i.next; } if (i != null) i.method[mth_nr](); Intf0 y = (Intf0)x; y.M(); System-Software WS 04/05

38 Run-Time Support Interface support (II)
Lookup is fast (O(1)), but wastes memory Type Descriptor sparse array! interfaces 1 2 3 4 5 6 7 vtable Intf0 y = (Intf0)x; y.M(); itable2 interface i = x.type.interfaces[Intf0]; if (i != null) i.method[mth_nr](); itable7 System-Software WS 04/05

39 Run-Time Support Interface Implementation (III)
overlap interface table index Type Descriptor u Type Descriptor t interfaces interfaces 1 2 3 4 5 6 7 vtablet 1 2 3 4 5 6 7 vtablet itableu,2 itablet,2 itableu,0 itablet,7 System-Software WS 04/05

40 Run-Time Support Interface Implementation (III)
overlapped interface table index Type Descriptor Type Descriptor interfaces interfaces vtable vtable itable itable itable itable System-Software WS 04/05

41 Run-Time Support Interface Implementation (III)
overlapped interface tables Type Descriptor Intf0 y = (Intf0)x; y.M(); interfaces vtable itable i = x.type.interfaces[Intf0]; if ((i != null) && (i in x.type)) i.method[mth_nr](); itable itable itable itable System-Software WS 04/05

42 Run-Time Support Exceptions
void catchOne() { try { tryItOut(); } catch (TestExc e) { handleExc(e); } void catchOne() 0 aload_0 1 invokevirtual tryItOut(); 4 return 5 astore_1 6 aload_0 7 aload_1 8 invokevirtual handleExc 11 return ExceptionTable From To Target Type TestExc System-Software WS 04/05

43 Run-Time Support Exception Handling / Zero Overhead
void ExceptionHandler(state) { pc = state.pc, exc = state.exception; while (!Match(table[i], pc, exc)) i++; if (i == TableLength) { PopActivationFrame(state); pc = state.pc; i = 0; } state.pc = table[i].pchandler; ResumeExecution(state) try { ..... } catch (Exp1 e) { } catch (Exp2 e) { } pcstart pcend pchandler1 pchandler2 start end exception handler pcstart pcend Exp1 pchandler1 Exp2 pchandler2 Global Exception Table System-Software WS 04/05

44 Run-Time Support Exception Handling / Zero Overhead
exception table filled by the loader / linker traverse whole table for each stack frame system has default handler for uncatched exceptions no exceptions => no overhead exception case is expensive system optimized for normal case System-Software WS 04/05

45 Run-Time Support Exception Handling / Fast Handling
push catch descriptors on the stack Run-Time Support Exception Handling / Fast Handling try { save (FP, SP, Exp1, pchandler1) save (FP, SP, Exp2, pchandler2) ..... remove catch descr. jump end } catch (Exp1 e) { } catch (Exp2 e) { remove catch descr. jump end } end: add code instrumentation try { ..... } catch (Exp1 e) { } catch (Exp2 e) { } pchandler1 pchandler2 use an exception stack to keep track of the handlers System-Software WS 04/05

46 Run-Time Support Exception Handling / Fast Handling
void ExceptionHandler(ThreadState state) { int FP, SP, handler; Exception e; do{ retrieve(FP, SP, e, handler); } while (!Match(state.exp, e)); state.fp = FP; // set frame to the one state.sp = SP; // containing the handler state.pc = handler; // resume with the handler ResumeExecution(state) } pop next exception descriptor from exception stack can resume in a different activation frame System-Software WS 04/05

47 Run-Time Support Exception Handling / Fast Handling
code instrumentation insert exception descriptor at try remove descriptor before catch fast exception handling overhead even when no exceptions system optimized for exception case System-Software WS 04/05

48 Virtual Addressing Overview
Virtual Addressing: abstraction of the MMU (Memory Management Unit) Work with virtual addresses, where addressreal = f(addressvirtual) Provides decoupling from real memory virtual memory demand paging separated address spaces System-Software WS 04/05

49 Virtual Addressing Pages
programs use and run in this address spaces Memory as array of pages virtual address-space 2 unmapped range unmapped (invalid) page 7 6 5 5 page frame 3 page 4 3 2 1 1 real memory: pool of page frames 2 memory address virtual address-space 1 mapping System-Software WS 04/05

50 Virtual Addressing Page mapping
Virtual Address  Real Address Virtual Address Real Address TLB page-no off frame off (PT, VA, RA) MMU frame Page Table page-no Page Frame off Translation Lookaside Buffer Associative Cache frame Page Table Ptr Register Real Memory System-Software WS 04/05

51 Virtual Addressing Definitions
page smallest unit in the virtual address space page frame unit in the physical memory page table table mapping pages into page frames page fault access to a non-mapped page working set pages a process is currently using System-Software WS 04/05

52 Virtual Addressing Alternate Page Mapping
64 bit Address Space Virtual Addressing Alternate Page Mapping 1. Level Table Multilevel page tables Multipart Virtual Address Page table as (B*-)Tree Inverted Page-Table 2. Level Table pno1 pno2 off Next probe Process N pr vp pf pf pr, vp pr vp pf Hash pr unassigned vp pf 1 pr vp pf Hashtable pr unassigned vp pf System-Software WS 04/05

53 Virtual Addressing What for?
Decoupling from real memory virtual memory (cheat: use more virtual memory than the available real memory) dynamically allocated contiguous memory blocks (for multiple stacks in multitasking systems) some optimizations null reference checks garbage collection (using dirty flag) Virtual Addressing is not for free! address mapping may require additional memory accesses page table takes space System-Software WS 04/05

54 Virtual Addressing Virtual Memory
Use secondary storage (disc) to keep currently unused pages (swapping) Page table usually keeps some per-page flag invalid page not mapped referenced page has been referenced dirty page has been modified Accessing an invalid page causes a page-fault interrupt select page frame to be swapped out (victim or candidate) swap-in requested page frame System-Software WS 04/05

55 Virtual Addressing Virtual Memory / Demand Paging
“Page-out” requested page Disc “Page-in” Real Memory victim set to invalid Page Table System-Software WS 04/05

56 Virtual Addressing Demand Paging Sequence
OS Page-Fault Handler TLB IF VA IN TLB THEN RETURN RA MMU ELSE Access Page Table; IF Page invalid THEN Page-Fault ELSE RETURN RA END IF Free Page Frame exists THEN Assign frame to VA ELSE Search victim page; IF victim page modified THEN page-out to secondary storage END; Invalidate victim page; Page-in from secondary storage; Reset invalid flag Expected time to translate VA into RA E[t] = PTLB * tTLB + PPT * tPT + Pdisc * tdisc System-Software WS 04/05

57 Virtual Addressing Example
Page size 4 KB Address size 32 Bits page offset: 12 Bits (4KB = 212) page number: 20 Bits ( ) addressable memory: 232 = 4GB page table size: 220 * 32 Bits = 4 MB page table overhead: ca. 3% Real Memory 128 MB System-Software WS 04/05

58 Virtual Addressing Example
TLB PTLB mov Memory 1-Ppage fault 1-PTLB Page Table 1 disc read 1 disc write 1 memory read Disc Ppage fault E[t] = PTLB tTLB + (1- PTLB)(tPT + PPF tdisc + (1-PPF)tmem) System-Software WS 04/05

59 Virtual Addressing Demand Paging: Page Replacement
Optimal Strategy (Longest Unused) Take the page, that will remain unused for the longest time Requires oracle NRU: ”Not Recently Used” Reset the referenced flag at each tick Create page categories (good candidate to bad candidate) choose best candidate Pref ref mod 3 2 1 System-Software WS 04/05

60 Virtual Addressing Demand Paging: Page Replacement (2)
LRU: “Least Recently Used” Assumption: not used in past ==> not used in the future Hardware implementation 64-Bit time-stamp for each page Software implementation “Aging”-Algorithm Choose page with lowest value 1 1 1 t(i) 1 1 1 t(i+1) t set if page accessed 1 1 1 1 1 1 Reference Flag System-Software WS 04/05

61 Virtual Addressing Demand Paging: Page Replacement (3)
“Least Recently Created” LRC (FIFO) Page Lifespan as metric (old are swapped out) Chain sorted by creation time Bad handling for often-used pages Fix: “second chance” when accessed (ref flag set) during the last tick next cur := earliest; WHILE cur.ref DO cur.ref := FALSE; cur := cur.next END earliest Ref-Flag System-Software WS 04/05

62 Virtual Addressing Demand Paging: Page Replacement (4)
Strategies: optimal LRU / NRU / LRC Exceptions: “page pinning”: page cannot be swapped out kernel code System-Software WS 04/05

63 Virtual Addressing Example
working set {1,2,3,4} Accessed Pages: 1, 2, 1, 3, 4, 1, 2, 3, 4 Available Page Frames: 3 Page Access 1 2 3 4 Ideal 1, 2 1, 2, 3 1, 2, 4 2, 3, 4 2, 3 ,4 FIFO 3, 4, 1 4, 1, 2 LRU 1, 3, 4 1, 4, 2 4, 2, 3 PF! PF! PF! PF! PF! PF! PF! PF! PF! PF! PF! PF! PF! PF! PF! PF! PF! PF! PF! System-Software WS 04/05

64 Demand Paging Belady’s Anomaly
Page access sequence LRC Strategie 3 Page Frames 9 Page Faults 4 Page Frames 10 Page Faults Victim x x x x x x x x x Belady’s Anomaly: More page frames cause more page faults Victim x x x x x x x x x x System-Software WS 04/05

65 Demand Paging How many page frames per process?
Even Distribution Every process has the same amount of memory Thrashing every memory access causes a page-fault not enough page-frames for the current working-set CPU-Load System is swapping instead of running 100 % n 2 Process Count n+1 1 System-Software WS 04/05

66 Demand Paging How many page frames per process? (2)
Depending on the process needs (1) use Working-Set Page Frames assigned according to the process’ working-set size. Swap-out a process when not enough memory available. Page Access { 2, 3, 4 } { 1, 2, 3, 4 } Sliding Window WorkingSet System-Software WS 04/05

67 Demand Paging How many page frames per process? (3)
Depending on the process needs (2) use Page-Fault Rate Page-Fault Rate HIGH LOW Time Swap out one process Swap in System-Software WS 04/05

68 Virtual Addressing Aos/Bluebottle, Memory Layout Example
4 GB Stacks 128 KB per stack max active objects first stack page allocated on process creation PROCEDURE PageFault; BEGIN IF adr > 2GB THEN add page to stack ELSE Exception(NilTrap) END END PageFault; 2 GB Heap Kernel System-Software WS 04/05

69 Virtual Addressing Example: UNIX, Fork
a UNIX Program consists of..... Process A read-only Page Table code Process B Fork() read-only text data data’ read-write “copy on write” System-Software WS 04/05

70 Virtual Addressing OS Control
Oberon no virtual memory Windows Virtual Memory configuration Task Manager Linux Swap partition / Swap files ps / top System-Software WS 04/05

71 Virtual Addressing Segmentation
e.g. Intel x86 Problem 640KB Max Memory 16bit addresses (i.e. 64KB) Solution work in a segment code / data segments check segment boundaries Addrreal = Segbase+Offset real memory code segment segment limit data segment segment base System-Software WS 04/05

72 Virtual Addressing Summary
virtual addresses, addressreal = f(addressvirtual) Decoupling from real memory virtual memory demand paging separate address spaces Keywords page page frame page table page fault page flags dirty, used, unmapped page replacement strategy LRC, LRU, ideal, ... swapping thrashing, belady’s anomaly System-Software WS 04/05

73 Memory Management Overview
Abstractions for applications heap memory blocks ( << memory pages) Operations: Allocate Deallocate Topics: memory organization free lists allocation strategies deallocation explicit garbage collection type-aware conservative copying / moving incremental generational System-Software WS 04/05

74 Memory Management Objects on the heap
Object Instances: a, b, c, d, … Sequence: not enough space e Case 2 e Case 1 a c d ! dynamic allocation NEW(a) NEW(b) NEW(c) DISPOSE(b) NEW(d) NEW(e) c b explicit disposal e a System-Software WS 04/05 „Heap“

75 Memory Management Problem overview
Problems Heap size limitation ( e, case 1) External Fragmentation ( e, case 2) Dangling Pointers (a points to b) Solutions System-managed list of free blocks („free list“) Vector of blocks with fixed size (Bitmap, with 0=free, 1=used) Automated detection and reclamation of unused blocks („garbage collection“) System-Software WS 04/05

76 Memory Management Theory: 50% rule
Assumption: stable state M free blocks, N block allocated 50%-Rule: M = 1/2 N A B C B B B B C N = A + B + C M = 1/2 (2A + B + e) e = 0,1, or 2 block disposal: ΔM = (C - A) / N (C - A) / N = 1 - p C - A - N + pN = 0 block allocation: (splitting likelihood) ΔM = 1 - p 2M = 2A + B + e 2M = 2A + N - A - C + e 2M = N + A - C + e 2M +e = pN System-Software WS 04/05

77 Memory Management Theory: Memory Fragmentation
Critical point  { 50%-Rule } (b/2)*F = H - b*B, /2*b*B = H - b*B H/(b*B) = 1 + /2,  = 2/ - 2 System-Software WS 04/05

78 Memory Management Free-list management with a Bitmap
Idea partition heap in blocks of size s use bitmap to track allocated blocks bitmap[i] = true  blocki allocated Problems internal fragmentation round up block size to next multiple of s map size size is (heap_size / s) bits loss due to internal fragmentation System-Software WS 04/05

79 Memory Management Free-list management with a list
List organization sorted / non-sorted merging of empty blocks is simpler with sorted list one list / many lists (per size) search is simpler, merging is more difficult management data stored in the free block size, next pointer Operations Allocation Disposal with merge find free blocks next to current block, merge into bigger free block System-Software WS 04/05

80 Memory Management Memory allocation strategies
block splitting: if a free-block is bigger than the requested block, then it is split first-fit use first free block which is big enough best-fit take smallest fitting block  causes a lot of fragmentation worst-fit take biggest available block quick-fit best-fit but multiple free-lists (one per block size)  fast allocation! free used internal fragmentation System-Software WS 04/05

81 Memory Management Buddy System (for fast block merging)
Blocks have size 2k Block with size 2i has address j*2i (last i bits are 0) Blocks with address x=j*2i and (j XOR 1)*2i are buddies (can be merged into a block of size 2i+1) buddy = x XOR 2i 64 32 32 16 16 32 16 8 8 32 b1 xxxx b2 xxxx Split 2k+1 2k 2k-1 Merge System-Software WS 04/05

82 Memory Management Buddy System (for fast block merging)
Problem: only buddies can be merged Cascading merge 16 16 32 no buddies buddies 16 16 32 16 8 8 32 16 8 8 32 16 8 8 32 16 16 32 32 32 System-Software WS 04/05

83 Memory Management Buddy System (for fast block merging)
Allocation allocate(8) 32 32 split 16 16 32 split 8 8 16 32 quick fit 8 8 16 32 System-Software WS 04/05

84 Memory Management Example: Oberon / Aos
Block size = k*32 free-lists for k = 1..9, one list for blocks > 9*32 Allocate quick-fit, splitting may be required Free-list management and block-merging done by the Garbage Collector k * 32 k * 32 96 96 initial state 64 64 Allocated Block 32 32 ALLOCATE(50) System-Software WS 04/05

85 Memory Management Garbage Collection
Two steps: Free block detection type-aware collector is aware of the types traversed, i.e. know which values are pointers conservative collector doesn’t know which values are pointers Block Disposal return unused blocks to the free-lists GC Characteristics incremental gc is performed in small steps to minimize program interruption moving / copying / compacting blocks are moved around generational blocks are grouped in generations; different treatment or collection priority Barriers read intercept and check every pointer read operation write intercept and check every pointer write operation System-Software WS 04/05

86 Memory Management Garbage Collection: Reference Counting
Every object has a Reference counter rc rc = 0  Object is „Garbage“ Problems Overhead no support for circular structures Useful for... Module hierarchies DAG-Structures (z. B. LISP) p q write barrier rc rc INC p.rc DEC q.rc IF q.rc = 0 THEN Collect q^ END; q := p p, q Pointers to Object q := p M A B rc >= 1 rc >= 1 C D System-Software WS 04/05

87 Memory Management Garbage Collection: Mark & Sweep
Mark-Phase (Garbage Detection) Compute the Root-set consisting of global pointers (statics) in each module local pointers on the stack in each PAF temporary pointers in the CPU’s registers Traverse the graph of the live objects starting from the root-set with depth-first strategy; mark all reached objects. Sweep-Phase (Garbage Collection) Linear heap traversal. Non-marked blocks are inserted into free-lists. Optimization: lazy sweeping (sweep during allocation, allocation gets slower) System-Software WS 04/05

88 Memory Management Garbage Collection: root-set
Run-time support from object-system. Hidden data structures with (compiler generated) information about pointers (metadata). Conservative approach. Guess which values could be pointers and threat them as such instance pointer global pointer off2 off1 off2 off1 off off Object Instance Type Tag Typ Descriptor Module Descriptor Module Data System-Software WS 04/05

89 Memory Management Garbage Collection: Mark with Pointer Rotation/1
Problem: Garbage collection called when free memory is low, but mark may require a lot of memory Solution: Pointer rotation algorithm (Deutsch, Schorre , Waite) Memory efficient iterative structures are temporarily inconsistent non-concurrent non-incremental System-Software WS 04/05

90 Memory Management Garbage Collection: Mark with Pointer Rotation/2
Simple case: list traversal q p q p p.link System-Software WS 04/05

91 Memory Management Garbage Collection: Mark with Pointer Rotation/3
Generic case: structure traversal q p q p System-Software WS 04/05

92 Memory Management Garbage Collection: Memory Compaction
MS .NET nextavail Pointer: partition heap between allocated and free space Allocate: increment nextavail Garbace Collector performs memory compaction nextavail ALLOC GC System-Software WS 04/05

93 Memory Management Garbage Collection: Stop & Copy
Partition heap in from and to regions Collection: traverse objects in from, copy to to leave forwarding pointer behind requires read barrier swap from and to Characteristics copying incremental (generational) access p instrument code with read barrier IF p is moved THEN replace p with forwarding pointer END; access p System-Software WS 04/05

94 Memory Management Garbage Collection: Stop & Copy
1 2 from to from to 3 4 from to to from System-Software WS 04/05

95 Memory Management Garbage Collection: Concurrent GC
User Process „Stop-and-Go“ Approach „Incremental“ Approach Mutator GC Mutator GC Mutator Mutator Mutator Mutator Mutator Real-Time Constraint GC System-Software WS 04/05

96 Memory Management Garbage Collection: Tricolor marking
„Wave-front“ Model State Color already traversed, behind wave black being traversed, on the wave grey not reached yet, in front of the wave white System-Software WS 04/05

97 Memory Management Garbage Collection: Tricolor marking / Isolation
Mutator can change pointers at any time Critical case: black  white Remedy Write-Barrier color B gray color W gray Write Barrier B W unreachable System-Software WS 04/05

98 Memory Management Garbage Collection: Backer‘s Treadmill
Free-Space Heap: double-linked chain of objects curscan From-Space To-Space System-Software WS 04/05

99 Memory Management Garbage Collection: Backer‘s Treadmill
Free-Space conservative allocation progressive allocation curscan To-Space From-Space System-Software WS 04/05

100 Memory Management Garbage Collection: Backer‘s Treadmill
Free-Space curscan collect reference curscan To-Space From-Space System-Software WS 04/05

101 Memory Management Garbage Collection: Backer‘s Treadmill
State transitions after GC is complete From-Space + Free-Space  Free-Space ToSpace  FromSpace Fragmentation External: not removed Internal: depends on supported block sizes Allocation conservative: black progressive: white NEW(x) x curscan y Root Set NEW(y) System-Software WS 04/05

102 Memory Management Generational Garbage Collection
collect where it is garbage is most likely to be found Generations Expected object life young  short life (temp data) old  long life Generations G0, G1, G2 Gen GC frequency G0 high G1 medium G2 low E J special handling for pointers across different generations required D G I G0 C F H B D G G1 A A A G2 System-Software WS 04/05

103 Memory Management Garbage Collection: Finalization
Finalization (after-use cleanup) User-defined routine when object is collected Establish Consistency save buffers flush caches Release Resources close connections release file descriptors Dangers: Resurrection of objects: objects added to live structures Finalization sequence is undefined System-Software WS 04/05

104 Memory Management Garbage Collection: .NET Finalization Example
Queue Rules: objects with finalizer belong to older generation finalizer only called once (ReRegisterForFinalize) FinalizationQueue: live object with finalizer FreachableQueue: collected objects to be finalized Finalization executed by different process for security reasons garbage D B C A B A GC E A Finalization Queue D C B E Freachable Queue A B thread System-Software WS 04/05

105 Memory Management Garbage Collection: Weak Pointers
Objects referenced only through a weak pointer can be collected by the GC in case of need Used for Caches and Buffers Implementation Weak Pointers are not registered to the GC Use a weak reference table (indirect access) garbage in use garbage weak reference weak reference table System-Software WS 04/05

106 Memory Management Garbage Collection: Weak Pointers Example
Oberon: internal file list system must keep track of open files to avoid buffer duplication file descriptor must be collected once user has no more reference to it use weak pointer in the system (otherwise would keep file alive!) System-Software WS 04/05

107 Memory Management Object Pools
Application keeps a pool of preallocated object instances; handles allocation and disposal Simulation discrete events Buffers in a file system Provide dynamic allocation in real-time system PROCEDURE NewT (VAR p: ObjectT); BEGIN IF freeT = NIL THEN NEW(p) ELSE p := freeT; freeT := freeT.next END END NewT; PROCEDURE DisposeT (p: ObjectT); BEGIN p.next := freeT; freeT := p END DisposeT; System-Software WS 04/05

108 Garbage Collection, Recap
GC kinds: compacting copying incremental generational Helpers: write barrier read barrier forwarding pointer pointer rotation Algorithms: Ref-Count Mark & Sweep Stop & Copy Mark & Copy (.NET) Baker’s Threadmill Dijkstra / Lamport Steele System-Software WS 04/05

109 Distributed Object Systems Overview
Goals object-based approach hide communication details Advantages more space more CPU redundancy locality Problems Coherency ensure that same object definition is used Interoperability serialization type consistency type mapping Object life-time distributed garbage collection System-Software WS 04/05

110 Distributed Object Systems Architecture
Naming Service Client Server Call Context Application Proxy Stub Impl. Message Object Broker Object Broker Impl. Skeleton IDL-Compiler IDL IDL-Compiler System-Software WS 04/05

111 Remote Procedure Invocation Overview
network byte-ordering little end first Problem send structured information from A to B A and B may have different memory layouts “endianness” How is 0x1234 (2 bytes) representend in memory? 12 34 1 Big-Endian: MSB before LSB IBM, Motorola, Sparc Little-Endian: LSB before MSB VAX, Intel System-Software WS 04/05

112 Definitions Serialization Deserialization Marshaling
conversion of an object‘s instance into a byte stream Deserialization conversion of a stream of bytes into an object‘s instance Marshaling gathering and conversion (may require serialization) to an appropriate format of all relevant data, e.g in a remote method call; includes details like name representation. System-Software WS 04/05

113 Remote Procedure Invocation Protocol Overview
big-endian representation Protocols RPC + XDR (Sun) RFC 1014, June 1987 RFC 1057, June 1988 IIOP / CORBA (OMG) V2.0, February 1997 V3.0, August 2002 SOAP / XML (W3C) V1.1, May 2000 ... XDR Type System [unsigned] Integer (32-bit) [unsigned] Hyper-Integer (64-bit) Enumeration (unsigned int) Boolean (Enum) Float / Double (IEEE 32/64-bit) Opaque String Array (fix + variable size) Structure Union Void System-Software WS 04/05

114 Remote Procedure Invocation RPC Protocol
Remote Procedure Call Marshalling of procedure parameters Message Format Authentication Naming Client Server PROCEDURE P(a, b, c) pack parameters send message to server await response unpack response Server unpack parameters find procedure invoke pack response send response P(a, b, c) System-Software WS 04/05

115 Distributed Object Systems Details
References vs. Values client receives reference to remote object data values are copied to client for efficiency reasons decide whether an object is sent as reference or a value serializable (Java, .NET), valuetype (CORBA) MarshalByRefObject (.NET), java/RMI/Remote (Java), default (CORBA) object creation server creates objects client creates objects server can return references object instances one object for all requests one object for each requests one object per proxy conversation state stateless stateful System-Software WS 04/05

116 Distributed Object Systems Distr. Object Systems vs
Distributed Object Systems Distr. Object Systems vs. Service Architecture Dist. Object System object oriented model object references stateful / stateless tight coupling Service Architecture OO-model / RPC service references stateless loose coupling internal communication between application’s tiers external communication between applications System-Software WS 04/05

117 Distributed Object Systems Distr. Object Systems vs
Distributed Object Systems Distr. Object Systems vs. Service Architecture components / objects (distributed object system) stateful and stateless conversation transactions coupling Remoting RMI tight CORBA Web Services services remote procedure calls stateless conversation (session?) message loose environment homogeneous heterogeneous System-Software WS 04/05

118 Distributed Object Systems Type Mapping
Interoperability Type System Type System 1 Type System 2 Possible Types Possible Types Possible Types Mappable Types Mappable Types Interop Subset System-Software WS 04/05

119 Distributed Object Systems Type Mapping, Example
Java Type System CORBA Type System CLS Type System char enum enum double double double char wchar char union union union custom implementation custom implementation System-Software WS 04/05

120 Distributed Object Systems Examples
Standards OMG CORBA IIOP Web Services SOAP Frameworks Java RMI (Sun) DCOM (Microsoft) .NET Remoting (Microsoft) IIOP.NET System-Software WS 04/05

121 Distributed Object Systems CORBA
Common Object Request Broker Architecture                                                                                                       Client Application Object Remote Architecture Object Skeleton Client Stub Interface Repository Implementation Repository Object Adaptor CORBA Runtime CORBA Runtime Client Server „Object-Bus“ ORB ORB GIOP/IIOP TCP/IP Socket System-Software WS 04/05

122 Distributed Object Systems CORBA
CORBA is a standard from OMG Object Management Group Common Object Request Broker Architecture CORBA is useful for... building distributed object systems heterogeneous environments tight integration CORBA defines... an object-oriented type system an interface definition language (IDL) an object request broker (ORB) an inter-orb protocol (IIOP) to serialize data and marshall method invocations language mappings from Java, C++, Ada, COBOL, Smalltalk, Lisp, Phyton ... and many additional standards and interfaces for distributed security, transactions, ... System-Software WS 04/05

123 Distributed Object Systems CORBA
Basic Types integers 16-, 32-, 64bit integers (signed and unsigned) IEEE floating point 32-, 64-bit and extended-precision numbers fixed point char, string 8bit and wide boolean opaque (8bit), any enumerations Compound Types struct union sequence (variable-length array) array (fixed-length) interface concrete (pass-by-reference) abstract (pure definition) value type pass-by-value abstract (no state) Operations in / out / inout parameters raises Attributes System-Software WS 04/05

124 Distributed Object Systems CORBA / General Inter-ORB Protocol (GIOP)
CDR (Common Data Representation) Variable byte ordering Aligned primitive types All CORBA Types supported IIOP (Internet IOP) GIOP over TCP/IP Defines Interoperable Object Reference (IOR) host post key Message Format Defined in IDL Messages Request, Reply CancelRequest, CancelReply LocateRequest, LocateReply CloseConnection MessageError Fragment Byte ordering flag Connection Management request multiplexing asymmetrical / bidirectional connections System-Software WS 04/05

125 Distributed Object Systems CORBA / GIOP Message in IDL
module GIOP { struct Version { octet major; octet minor; } enum MsgType_1_0 { Request, Reply, CancelRequest, CancelReply, LocateRequest, LocateReply, CloseConnection, Error struct MessageHeader { char Magic[4]; Version GIOP_Version; boolean byte_order; octet message_size; unsigned long message_type; } } // module end GIOP System-Software WS 04/05

126 Distributed Object Systems CORBA Services
System-level services defined in IDL Provide functionality required by most applications Naming Service Allows local or remote objects to be located by name Given a name, returns an object reference Hierarchical directory-like naming tree Allows getting initial reference of object Event Service Allows objects to dynamically register interest in an event Object will be notified when event occurs Push and pull models ... and more Trader, LifeCycle, Persistence, Transaction, Security System-Software WS 04/05

127 Distributed Object Systems WebServices
Service-oriented architecture Rely on existing protocols SOAP messaging protocol WSDL service description protocol UDDI service location protocol Web Services SOAP HTTP TCP/IP System-Software WS 04/05

128 Distributed Object Systems SOAP
Simple Object Access Protocol communication protocol XML-based describes object values XML Schemas as interface description language basic types string, boolean, decimal, float, double, duration, datetime, time, date, hexBinary, base64Binary, URI, Qname, NOTATION structured types list, union SOAP Message SOAP Envelope SOAP Header SOAP Body Method Call packed as structure messages are self-contained no external object references System-Software WS 04/05

129 Distributed Object Systems SOAP Message
SOAP Envelope SOAP Header SOAP Body Example float Multiply(float a, float b); System-Software WS 04/05

130 Distributed Object Systems SOAP Example (Request)
POST /quickstart/aspplus/samples/services/MathService/CS/MathService.asmx HTTP/1.1 Host: samples.gotdotnet.com Content-Type: text/xml; charset=utf-8 Content-Length: length SOAPAction: "http://tempuri.org/Multiply" <?xml version="1.0" encoding="utf-8"?> <soap:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"> <soap:Body> <Multiply xmlns="http://tempuri.org/"> <a>float</a> <b>float</b> </Multiply> </soap:Body> </soap:Envelope> System-Software WS 04/05

131 Distributed Object Systems SOAP Example (Answer)
HTTP/ OK Content-Type: text/xml; charset=utf-8 Content-Length: length <?xml version="1.0" encoding="utf-8"?> <soap:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"> <soap:Body> <MultiplyResponse xmlns="http://tempuri.org/"> <MultiplyResult>float</MultiplyResult> </MultiplyResponse> </soap:Body> </soap:Envelope> System-Software WS 04/05

132 Distributed Object Systems SOAP Example (Service Description-1)
<?xml version="1.0" encoding="utf-8"?> <definitions ....> <types> <s:schema elementFormDefault="qualified" targetNamespace="http://tempuri.org/"> <s:element name="Multiply"> <s:complexType><s:sequence> <s:element minOccurs="1" maxOccurs="1" name="a" type="s:float" /> <s:element minOccurs="1" maxOccurs="1" name="b" type="s:float" /> </s:sequence></s:complexType> </s:element> </s:schema> </types> <message name="MultiplySoapIn"> <part name="parameters" element="s0:Multiply" /> </message> System-Software WS 04/05

133 Distributed Object Systems SOAP Example (Service Description-2)
<binding name="MathServiceSoap" type="s0:MathServiceSoap"> <soap:binding transport="http://schemas.xmlsoap.org/soap/http" style="document" /> <operation name="Multiply"> <soap:operation soapAction="http://tempuri.org/Multiply" style="document" /> <input><soap:body use="literal" /></input> <output><soap:body use="literal" /></output> </operation> </binding> <service name="MathService"> <port name="MathServiceSoap" binding="s0:MathServiceSoap"> <soap:address location="http://samples.gotdotnet.com/quickstart/aspplus/samples/services/MathService/CS/MathService.asmx" /> </port> </service> </definitions> System-Software WS 04/05

134 Distributed Object Systems WebServices
Comments XML (easily readable) system independent standard stateless (encouraged design pattern) bloated big messages (but easily compressed) requires expensive parsing Constraints Services no object references server-activated servant Goes over HTTP requires web server System-Software WS 04/05

135 Distributed Object Systems WebService Future
Use SOAP-Header to store additional information about message or context Many standards to come... WS-Security WS-Policy WS-SecurityPolicy WS-Trust WS-SecureConversation WS-Addressing System-Software WS 04/05

136 Distributed Object Systems Java RMI
Java Remote Method Invocation Object Client Application Lookup Register Lookup Register Object Stub Object Stub Remote Architecture Remote References Remote References Transport Layer Transport Layer Client Server Network TCP/IP Socket System-Software WS 04/05

137 Distributed Object Systems Java RMI Details
Framework supports various implementations e.g. RMI/IIOP mapping limited to the Java type system, workarounds needed uses reflection to inspect objects System-Software WS 04/05

138 Distributed Object-Systems Low-Level Details: Java RMI/IIOP
Common Type-System restricted CORBA Marshalling name mapping remote objects only references Interface Description Language (IDL) java to IDL mapping Message representation Underlying protocol IIOP (CORBA) System-Software WS 04/05

139 Distributed Object Systems Microsoft DCOM
Distributed Common Object Model Client Application Object Object Proxy Remote Architecture Object Stub COM Runtime SCMs and Registration COM Runtime SCM SCM Client Registry Registry Server OXID Resolver RPC Channel Network Ping Server System-Software WS 04/05

140 Distributed Object Systems Microsoft .NET Remoting
new Instace() or Activator.GetObject(...) Client Transparent Proxy Channel Channel Instance Application Domain Boundary ObjRef IChannelInfo ChannelInfo; IEnvoyInfo EnvoyInfo; IRemotingTypeInfo TypeInfo; string URI; Network System-Software WS 04/05

141 Distributed Object Systems Microsoft .NET Remoting
Client Proxy Dispatcher Instance channel channel Message Chan.Sink(s) Message Chan.Sink(s) custom operations Instance s = new Instance(); s.DoSomething(); Formatter Formatter serialize object Stream Chan.Sink(s) Stream Chan.Sink(s) custom operations Transport Sink Transport Sink handle communication Network System-Software WS 04/05

142 Distributed Object Systems Microsoft .NET Remoting
Activation client one instance per activation server / Singleton one instance of object server / SingleCall one instance per call Leases (Object Lifetimes) renew lease on call set maximal object lifetime Serialization SOAP Warning: non-standard types, only for .NET use binary user defined Transport TCP HTTP System-Software WS 04/05

143 Distributed Object Systems Microsoft .NET Remoting (Object Marshalling)
MarshalByRefObjects remoted by reference client receives an ObjRef object, which is a“pointer“ to the original object [Serializable] all fields of instance are cloned to the client [NonSerialized] fields are ignored ISerializable object has method to define own serialization AppDomain 1 AppDomain 2 AppDomain 1 AppDomain 2 Obj Proxy Obj Obj‘ Serialized ObjRef Serialized fld1... fldn System-Software WS 04/05

144 Distributed Object Systems Microsoft .NET Remoting, Activation
“stateless” Server-Side Activation (Well-Known Objects) Singleton Objects only one instance is allocated to process all requests SingleCall Objects one instance per call is allocated Client-Side Activation Client Activated Objects the client allocates and controls the object on the server “stateful” System-Software WS 04/05

145 Distributed Object Systems Microsoft .NET Remoting, Limitations
Server-Activated Objects object configuration limited to the default constructor Client-Activated Objects class must be instantiated, no access over interface class hierarchy limitations use Factory Pattern to get interface reference to allow parametrization of the constructor Furthermore... interface information is lost when passing an object reference to another machine no control over the channel which channel is used which peer is allowed to connect System-Software WS 04/05

146 Distributed Object Systems Case Study: IIOP.NET
Opensource project based on ETH-Diploma thesis IIOP.NET (marketing) „Provide seamless interoperability between .NET and CORBA-based peers (including J2EE)“ IIOP.NET (technical) .NET remoting channel implementing the CORBA IIOP protocol Compiler to make .NET stubs from IDL definitions IDL definition generator from .NET metadata System-Software WS 04/05

147 Distributed Object Systems Case Study: IIOP.NET
server client J2EE Java CORBA objects IIOP binary IIOP rather than SOAP transparent reuse of existing servers tight coupling object-level granularity efficiency Runtime: standard .NET remoting channel for IIOP transport sink formatter type-mapper Build tools IDL  CLS compiler CLS  IDL generator Java Type System IDL Type System CLS Type System Possible Types Possible Types Possible Types IDL Mappable Types IDL Mappable Types Interop Subset System-Software WS 04/05

148 Distributed Object Systems Case Study: IIOP.NET, Interoperability
Application This is what we want Services Distributed Transaction Coordinator, Active Directory, … Conversation Activation model (EJB, MBR), global naming, distributed garbage collection, conversational state,… Contextual Data Interception Layer SessionID, TransactionID, cultureID, logical threadID … Message Format RPC, IIOP, HTTP, SOAP, proprietary binary format, messages, unknown data (exceptions), encryption Communication Protocol exchange raw data (bytes) across machines Data Model give structure to data type system / object model Message Format define encoding, serialization, allowed messages (invocation, ...) Contextual Data additional information (context), hidden information support for upper layers application infrastructure Conversation interaction model, message workflow Services Common external services Application the whole universe! Data Model Type system, mapping and conversion issues Communication Protocols TCP/UDP, Byte stream, point-to-point communication System-Software WS 04/05

149 Distributed Object Systems Case Study: IIOP.NET, Granularity
Service Service Component Component Object Object Object Object Component Object Object Service Component Object Granularity Message-based Interface, Stateless Strongly-typed Interface, Stateless or Stateful Implementation Dependency, Stateful Coupling, Interaction System-Software WS 04/05

150 Distributed Object Systems Case Study: IIOP.NET
1.0 1.1 1.2 1.3 1.4 1.5 1st Article 2nd Article 1.6   1st article 2nd article System-Software WS 04/05

151 Distributed Object Systems Case Study: IIOP.NET, Performance
Test Case: WebSphere as server Clients IBM SOAP-RPC Web Services IBM Java RMI/IIOP IIOP.NET Response time receiving 100 beans from server WS: 4.0 seconds IIOP.NET: 0.5 seconds when sending many more beans, WS are then 200% slower than IIOP.NET Source: posted on IIOP.NET forum System-Software WS 04/05

152 Processes and Threads Introduction
CPU as resource, provide abstraction to it Allow multiprogramming pseudo-parallelism (single-processors) real parallelism (multi-processors) Required abstractions multiple activities -- execution of instructions protection of resources synchronization of activities Topics coroutines processes threads scheduling fairness starvation synchronization deadlocks System-Software WS 04/05

153 Processes and Threads Multithreading
Stack 2 Stack 1 Call a.run Call b.Q Call b.q Return b.q Call c.R Return c.R Return b.Q Return a.run Call c.run Call d.Q Call d.q Return d.q Call e.R Return e.R Return d.Q Return c.run e.R b.q Thread 1 Thread 2 b.q d.Q b.Q c.run a.run 1 2 time 2 2 1 1 time time System-Software WS 04/05

154 Processes and Threads Coroutines (1)
each activity has its own stack, address-space is shared explicit context switch (stack only) under programmer‘s control uses Transfer call switch to another coroutine System-Software WS 04/05

155 Processes and Threads Coroutines (2)
Subroutines Call Call Return Return Coroutinen Start Transfer Start Transfer System-Software WS 04/05

156 Processes and Threads Coroutines (3)
TYPE Coroutine = POINTER TO RECORD FP: LONGINT; stack: POINTER TO ARRAY OF SYSTEM.BYTE; END; VAR cur: Coroutine; (* Current Coroutine *) PROCEDURE Transfer*(to: Coroutine); BEGIN SYSTEM.GETREG(SYSTEM.EBP, cur.FP); cur := to; SYSTEM.PUTREG(SYSTEM.EBP, cur.FP); END Transfer; PUSH EBP SUB ESP, 4 save FP restore FP MOV ESP, EBP POP EBP RET 4 System-Software WS 04/05

157 Processes and Threads Coroutines (4)
to’ SP FP PC’ FP’ locals stackQ stackP Q pcx FP” Transfer(Q) to’ SP FP PC’ FP’ locals stackQ stackP to’ SP FP PC’ FP’ locals stackQ stackP Q pcx FP” FP := Q.FP FP stackQ stackP Q pcx locals FP” SP return jump at PC’ System-Software WS 04/05

158 Processes and Threads Coroutines (5)
Current stack: current execution state All other stacks: top PAF (proc activation frame) contains last Transfer call Start: create stack with fake Transfer-like PAF PROCEDURE Start(C: Coroutine; size: LONGINT); BEGIN NEW(C.stack, size); tos := SYSTEM.ADR(C.stack[0])+LEN(C.stack); SYSTEM.PUT(tos-4, 0); (* par = null *) SYSTEM.PUT(tos-8, 0); (* PC’ = null, not allowed to return *) SYSTEM.PUT(tos-12, 0); (* FP’ *) cur.FP := tos-12; END; System-Software WS 04/05

159 Processes and Threads Problems caused by multitasking
Concurrent access to resources protection limit access to a resource synchronization synchronize task with resource state or other task Concurrent access to CPU task priorities scheduling One problem’s solution is another problem’s cause.... deadlocks fairness deadlines / periodicity constraints System-Software WS 04/05

160 Processes and Threads Protection: Mutual Exclusion
Mutual Exclusion only one activity is allowed to access one resource at a time disable interrupts (single CPU only, avoid switches) locks flag: lock taken / lock free spin lock (uses busy waiting) exclusive lock read-write lock (multiple reader, one writers) System-Software WS 04/05

161 Processes and Threads Protection: Monitor
Shared resources as Monitor resources are passive objects execution of critical sections inside monitor is mutually exclusive Global Monitor Lock Shared Monitor Lock for read-access (optional) monitor as a special module [original version (Hoare, Brinch Hansen)] object instance as monitor method and code block granularity Java, C#, Active Oberon, ... Resource task P task Q acquire acquire release release System-Software WS 04/05

162 Processes and Threads Protection
one waiting queue per resource is required Simplistic implementation with coroutines Non-reentrant lock (no recursion allowed) PROCEDURE Acquire(r: Resource); BEGIN IF r.taken THEN InsertList(r.waiting, cur); SwitchToNextRoutine() ELSE r.taken := TRUE END END Acquire; PROCEDURE Release(r: Resource); BEGIN next := GetFromList(r.waiting); IF next # NIL THEN InsertList(ready , next); Transfer(GetNextTask()); ELSE r.taken := FALSE END END Release; System-Software WS 04/05

163 Processes and Threads Protection
Shared resource as Process synchronization during communication Communicating Sequential Processes (CSP) C.A.R. Hoare (1978) Model of communication „Rendez-vous“ between two processes P!x (send x to process P) Q?y (ask y from process Q) Used in Ada, Occam task P task Q task P task Q P?x Q!z Q!z P?x System-Software WS 04/05

164 Processes and Threads Protection
Some variations on the theme.... Reentrant Locks Readers / Writers one writer or multiple readers allowed Binary Semaphores one activity can get the resource Generic Semaphores N activities are allowed to get the resource System-Software WS 04/05

165 Processes and Threads Synchronization
Wait on a condition / state Signals with Send/Wait Methods Require cooperation from all processes Example: Producer/Consumer with conditions nonempty/nonfull Semantic of Send Send-and-Pass vs. Send-and-Continue Generic system-handled conditions (Active Oberon) AWAIT(x > y); Wait on partner process CSP System-Software WS 04/05

166 Processes and Threads Synchronization: Implementation Example
Process list double-chained list of all coroutines cur points to current (running) coroutine each signal has a LIFO list ready C1 C4 C2 link s C3 C5 cur Signal System-Software WS 04/05

167 Processes and Threads Synchronization: Implementation Example
Terminate cur.next.prev := cur.prev; cur.prev.next := cur.next; Schedule Schedule prev := cur; WHILE ~cur.ready & cur.next # prev DO cur := cur.next END; IF cur.ready THEN Transfer(cur) ELSE (*deadlock*) END System-Software WS 04/05

168 Processes and Threads Synchronization: Implementation Example
Init(s) s := NIL Wait(s) cur.link := s; s := cur; cur.ready := FALSE; Schedule (*to next ready from cur*) Send(s) IF s # NIL THEN (*send-and-pass*) cur := s; s.ready := TRUE; s := s.link END; Schedule (*to next ready from cur*) System-Software WS 04/05

169 Processes and Threads Active Oberon: Bounded Buffer
Buffer* = OBJECT VAR data: ARRAY BufLen OF INTEGER; in, out: LONGINT; (* Put - insert element into the buffer *) PROCEDURE Put* (i: INTEGER); BEGIN {EXCLUSIVE} (*AWAIT ~full *) AWAIT ((in + 1) MOD BufLen # out); data[in] := i; in := (in + 1) MOD BufLen END Put; (* Get - get element from the buffer *) PROCEDURE Get* (VAR i: INTEGER); BEGIN {EXCLUSIVE} (*AWAIT ~empty *) AWAIT (in # out); i := data[out]; out := (out + 1) MOD BufLen END Get; PROCEDURE & Init; BEGIN in := 0; out := 0; END Init; END Buffer; System-Software WS 04/05

170 Processes and Threads CSP: Bounded Buffer (I)
[bounded_buffer || producer || consumer] producer :: *[<produce item> bounded_buffer ! item; ] consumer :: *[bounded_buffer ? item; <consume item> Geoff Coulson Lancaster University System-Software WS 04/05

171 Processes and Threads CSP: Bounded Buffer (II)
buffer: (0..9) item; in, out: integer; in := 0; out := 0; *[ in < out+10; producer ? buffer(in mod 10) -> in := in + 1; || out < in; consumer ! buffer(out mod 10) -> out := out + 1; ] System-Software WS 04/05

172 Processes and Threads Process State
Process states Running: actually using the CPU Ready: waiting for a CPU Blocked: unable to run, waiting for external event Process state transitions wait for external event system scheduler external event happens Running 3 1 2 Blocked Ready 4 System-Software WS 04/05

173 Processes and Threads Process State (Active Oberon)
Active Oberon provides monitor-like object protection conditions Condition are checked by the system. No explicit help or knowledge from user is required (no x.Signal) Running Awaiting Object Awaiting Condition Ready System-Software WS 04/05

174 Activities Program (static concept) ≠ Process (dynamic)
Processes, jobs, tasks, threads (differences later) program code context: program counter (PC) and registers stack pointer state [new] running waiting ready [terminated] stack data section (heap) System-Software WS 04/05

175 Processes vs. Threads Process or job (heavyweight)
code address space processor state private data (stack+registers) can have multiple threads Thread (lightweight) shared code shared address space processor state private data (stack+registers) Process: task or activity on a computer Kernel CPU System-Software WS 04/05

176 Processes vs. Threads: Example
HEAP 1 HEAP 2 HEAP STACK 1 STACK 2 STACK 1 STACK 2 PROC instr PROC 1 instr PROC 2 instr System-Software WS 04/05

177 Multitasking Programmed events that can cause a task switch
protection (locks) acquire release synchronization wait on a condition send a signal (send-and-pass) System events that can cause a task switch voluntary switch (“yield”, task termination) process with higher priority becomes available consumption of the allowed time quantum synchronous asynchronous task preemption System-Software WS 04/05

178 Preemption Assign each process a time-quantum (normally in the order of tens of ms) Asynchronous task switches can happen at any time! task can be in the middle of a computation save whole CPU state (registers, flags, ...) Perform switch on resource conflict on synchronization request on timer-interrupt (time-quantum is over) System-Software WS 04/05

179 Context switch Scheduler invocation: Operations:
preemption  interrupt cooperation  explicit call Operations: store the process state (PC, regs, …) choose the next process (strategy) [accounting] restore the state of the next process (regs, SP, PC, …) jump to the restored PC A context switch is usually expensive: 1–1000s depending on the system and number of processes hardware optimizations (e.g., multiple sets of registers – SPARC, DECSYSTEM-20) System-Software WS 04/05

180 Scheduling algorithms
Three categories of environments: batch systems (e.g., VPP, DOS) usually non-preemptive (i.e., task is not stopped by scheduler, only synchronous switches) interactive systems (UNIX, Windows, Mac OS) cooperative or preemptive no task allowed to have the CPU forever real-time systems (PathWorks, RT Linux) timing constraints (deadlines, periodicity) System-Software WS 04/05

181 Scheduling Performance
CPU utilization Throughput number of jobs per time unit minimize context switch penalty Turnaround time = exit time - arrival time execution, wait, I/O Response time = start time - request time Waiting time (I/O, waiting, …) Fairness System-Software WS 04/05

182 Scheduling algorithm goals
All systems Fairness give every task a chance Policy enforcement Balance keep all subsystems busy Interactive systems Response time respond quickly Proportionality meet user’s expectations Batch systems Throughput maximize number of jobs Turnaround time minimize time in system CPU utilization keep CPU busy Real-time systems Meet deadlines avoid losing data Predictability avoid degradation Hard- vs. soft-real-time systems System-Software WS 04/05

183 Batch Scheduling Algorithms
Choose task to run (task is usually not preempted) First Come First Serve (FCFS) fair, may cause long waiting times Shortest Job First (SJF) requires knowledge about job length Longest Response Ratio response ratio = (time in the system / CPU time) depends on the waiting time Highest Priority First with or without preemption Mixed the priority is adjusted dynamically (time in queue, length, priority, …) ETH-VPP is a batch system! Which algorithm does it use? System-Software WS 04/05

184 Preemptive Scheduling Algorithms
Time sharing Each task has a predefined time quantum Round-Robin Schedule next task on the ready list Quantum choice: small: may cause frequent switches big: may cause slow response Implicit assumption: all task have same importance next P4 next P1 P2 P3 System-Software WS 04/05

185 Preemptive Scheduling Algorithms
Priority scheduling process with highest priority is scheduled first Variants multilevel queue scheduling one list per priority, use round-robin on list dynamic priorities proportional to time in system inversely proportional to part of quantum used make time quantum proportional to priority System-Software WS 04/05

186 Real-Time Scheduling Algorithms
Task needs to meet the deadline! Task cost is known (should) Two task kind: aperiodic periodic Reservation scheduler decides if system has enough resources for the task Algorithms: Rate Monotonic Scheduling assign static priorities (priority proportional to frequency) Earliest Deadline First task with closest deadline is chosen System-Software WS 04/05

187 Scheduling Algorithm Example
Situation: Tasks P1, P2, P3, P4 Arrive at time t = 0 Priority: P1 highest, P4 lowest Time to process: 10, 2, 5, 3 System-Software WS 04/05

188 Scheduling Algorithm Example
Highest Priority First P1 P2 P3 P4 10 12 17 20 System-Software WS 04/05

189 Scheduling Algorithm Example
Shortest Job First P1 P2 P3 P4 2 5 10 20 System-Software WS 04/05

190 Scheduling Algorithm Example
Timesharing with quantum = 2 P1 P2 P3 P4 2 4 6 8 10 12 14 16 18 20 13 System-Software WS 04/05

191 Scheduling Algorithm Example
Timesharing with quantum  0 running at 1/4 running at 1/3 running at 1/2 P1 P2 P3 P4 8 11 15 20 System-Software WS 04/05

192 Scheduling Algorithm Example: Results
Situation: Tasks P1, P2, P3, P4 Arrive at time t = 0 Priority: P1 highest, P4 lowest Time to process: 10, 2, 5, 3 Results turnaround response time Highest Priority First: Shortest Job First: Timesharing with Quantum = 2: Timesharing with Quantum  0: System-Software WS 04/05

193 Scheduling Examples UNIX BSD similar Windows NT preemption
32 priority levels (round robin) each second the priorities are recomputed (CPU usage, nice level, last run) BSD similar every 4th tick priorities are recomputed (usage estimation) Windows NT “real time” priorities: fixed, may run forever variable: dynamic priorities, preemption idle: last choice (swap manager) System-Software WS 04/05

194 Scheduling Examples: Quantum & Priorities
Win2K: quantum = 20ms (professional) 120ms (user), configurable depending on type (I/O bound) BSD: quantum = 100ms priority = f(load,nice,timelast) Linux: quantum = quantum / 2 + priority f(quantum, nice) System-Software WS 04/05

195 Scheduling Problems Starvation A task is never scheduled (although ready)  “fairness” Deadlock No task is ready (nor it will ever become ready)  detection+recovery or avoidance System-Software WS 04/05

196 Deadlock Conditions Coffman conditions for a deadlock (1971):
A holds R B wants R T Thread R1 T1 T2 R Resource R2 A wants S B holds S Coffman conditions for a deadlock (1971): Mutual exclusion Hold and wait No resource preemption Circular wait (cycle) System-Software WS 04/05

197 Deadlock Remedies Coarser lock granularity:
use a single lock for all resources (e.g., Linux “Big Kernel Lock”) Locking order: resources are ordered resource locking according to the resource order (ticketing) Two-phase-locking: try to acquire all the resources if successful, lock them; otherwise free them and try again System-Software WS 04/05

198 Deadlock Detection, Prevention & Recovery
Deadlock detection: the system keeps a graph of locks and tries to detect cycles. time consuming the graph has to be kept consistent with the actual state Deadlock prevention (avoidance): remove one of the four Coffman conditions  cycles Recovery: kill processes and reclaim the resources rollback: requires to save the states of the processes regularly System-Software WS 04/05

199 Simple Deadlock Scenario
Example Resources R, S, T Tasks A, B, C require { R, S }, { S, T }, { T, R } respectively Case 1: Sequential execution, no deadlock A +R +S -R -S B +S T -S -T +T +R -T -R C System-Software WS 04/05

200 Simple Deadlock Scenario
Case 2: Interleaving, deadlock A +R +S B +S +T +T +R C C R A T S B System-Software WS 04/05

201 Complex Deadlock Scenario
Case with 6 resources and 7 tasks graphical representation R A B C S D T E F U V is this a case of deadlock? W G System-Software WS 04/05

202 Deadlock Avoidance Strategy in Bluebottle
Processors Timers Each Kernel Module has a lock to protect its data When multiple locks are needed, acquire them according to the module hierarchy Threads Traps Interrupts Modules Module Hierarchy Blocks Memory Locks Configuration Module Lock System-Software WS 04/05

203 Priority Inversion A high-priority task can be blocked by a lower priority one. Example: High Medium Low waiting running ready System-Software WS 04/05

204 Priority Inversion Big problem for RTOS Solutions
priority inheritance low-priority task holding resource inherits priority of high-priority task wanting the resource priority ceilings each resource has a priority corresponding to the highest priority of the users +1 the priority of the resource is transferred to the locking process can be used instead of semaphores System-Software WS 04/05

205 Example: Mars Pathfinder (1996–1998)
VxWorks real-time system: preemptive, priorities Communication bus: shared resource (mutexes) Low priority task (short): meteorological data gathering Medium priority task (long): communication High priority: bus manager Detection: watchdog on bus activity  system reset Fix: activate priority inheritance via an uploaded on-the-fly patch (no memory protection). System-Software WS 04/05

206 Locking on Multiprocessor Machines
Real parallelism! Cannot “disable interrupts” like on single processor machines (could stop every task, but not efficient) Software solutions Peterson, Dekker, ... Hardware support bus locking atomic instructions (Test And Set, Compare And Swap) System-Software WS 04/05

207 Locking on multiprocessor machines
Test And Set TAS s: IF s = 0 THEN s := 1 ELSE CC := TRUE END Compare and Swap (Intel) CAS R1, R2, A: R1: expected value R2: new value A: address IF R1 = M[A] THEN M[A] := R2; CC := TRUE ELSE R1 := M[A]; CC := FALSE END These instructions are atomic even on multiprocessors! The usually do so by locking the data bus System-Software WS 04/05

208 Example: Semaphores on SMP
Counter s: available resources Binary Semaphores with TAS Spinning (busy wait) Try TAS s JMP Try CS TAS s JMP Queuing CS Blocking System-Software WS 04/05

209 Example: Semaphores on SMP
Counter s: available resources Generic Semaphores with CAS P(S): { S := S - 1} IF S < 0 THEN jump queuing END V(S): { S := S + 1} IF S <= 0 THEN jump dequeuing END P(s) Enter CS Exit CS V(s) Load R1s TryP MOVE R1R2 DEC R2 CAS R1, R2, s BNE TryP CMP R2, 0 BN Queuing [CS] [CS] Load R1s TryV MOVE R1R2 INC R2 CAS R1, R2, s BNE TryV CMP R2, 0 BNP Dequeuing System-Software WS 04/05

210 Spin-Locks: the Bluebottle/i386 way
PROCEDURE AcquireSpinTimeout(VAR locked: BOOLEAN); CODE {SYSTEM.i386} MOV EBX, locked[EBP] ; EBX := ADR(locked) MOV AL, 1 ; AL := 1 CLI ; switch interrupts off before ; acquiring lock test: XCHG [EBX], AL ; set and read the lock ; atomically. ; LOCK prefix implicit. CMP AL, 1 ; was locked? JE test ; retry .. END AcquireSpinTimeout; CLI Clear Interrupt Flag EBP base pointer XCHG exchange AL 8bit EAX accumulator EBX base simplified version System-Software WS 04/05

211 Active Objects in Active Oberon
Z = OBJECT VAR myT: T; I: INTEGER; PROCEDURE & NEW (t: T); BEGIN myT := t END NEW; PROCEDURE P (u: U; VAR v: V); BEGIN { EXCLUSIVE } i := 1 END P; BEGIN { ACTIVE } BEGIN { EXCLUSIVE } AWAIT (i > 0); END END Z; Initializer State Method Mutual Exclusion Object Activity Condition System-Software WS 04/05

212 Active Oberon Runtime Structures
NIL CPUs Running 1 Lock Queue Wait Queue Awaiting Object Awaiting Assertion 2 Ready Queue Ready Ready System-Software WS 04/05

213 Active Oberon Implementation
NIL 7 NEW Create object; Create process; Set to ready Running 2 3 Awaiting Object 6 Awaiting Assertion Preempt Set to ready; Run next ready 6 1 1 Ready 7 END Run next ready 4 5 1 NIL System-Software WS 04/05

214 Active Oberon Implementation
NIL Enter Monitor IF monitor lock set THEN Put me in monitor obj wait list; Run next ready ELSE set monitor lock END 7 2 Running 1 2 3 Awaiting Object 6 Awaiting Assertion Exit Monitor Find first asserted x in wait list; IF x found THEN set x to ready ELSE Find first x in obj wait list; ELSE clear monitor lock END Run next ready 5 1 Ready 4 4 5 1 NIL System-Software WS 04/05

215 Active Oberon Implementation
NIL 7 Running 2 3 3 AWAIT Put me in monitor assn wait list; Call Exit monitor Awaiting Object 6 Awaiting Assertion 1 Ready 4 5 NIL System-Software WS 04/05

216 Case Study: Windows CE 3.0 Real-time constraints
Reaction time on events Execution time Threads with priorities and time quanta Priorities: 0 (high), …, 255 (low) Time quanta in ms Default 100 ms 0  no quantum Single processor end of quantum p q < p p System-Software WS 04/05

217 Case Study: Windows CE 3.0 Interrupt Handling
ISR (Interrupt Service Routine) 1st level handling Kernel mode, uses kernel stack Installed at boot-time Creates event on-demand Preempted by ISR with higher priority IST (Interrupt Service Thread) 2nd level handling User mode Awaits events User Modus IST Event IRQ Event NK.EXE ISR Kernel Modus System-Software WS 04/05

218 Case Study: Windows CE 3.0 Synchronization on common resources:
Critical sections: enter, leave operations Semaphores and mutexes (binary semaphores) Synchronization is performed with system/library calls (they are not part of a language). Priority inversion avoidance priority inheritance (thread inherits priority of task wanting the resource) CS [ ] [ ] [ ] System-Software WS 04/05

219 Case Study: Java Activities are mapped to threads (no processes)
Synchronization in the language locks signals Threads provided by the library Scheduling depends on the JVM System-Software WS 04/05

220 Case Study: Java public class MyThread() extends Thread {
public void run() { System.out.println("Running"); } public static void main(String [] arguments) { MyThread t = (new MyStread()).start(); System-Software WS 04/05

221 Case Study: Java public class MyThread() implements Runnable {
public void run() { System.out.println("Running"); } public static void main(String [] arguments) { Thread t = (new Thread(this)).start(); System-Software WS 04/05

222 Case Study: Java Protection with monitor-like objects
with method granularity public synchronized void someMethod() with statement granularity synchronized(anObject) { ... } Synchronization with signals wait() (with optional time-out) notify() / notifyAll() (“send and continue” pattern) System-Software WS 04/05

223 Case Study: Java private Object o; public synchronized consume() {
while (o == null) { try { wait(); } catch (InterruptedException e) {} } use(o); o = null; notifyAll(); public synchronized void produce(Object p) { while (o != null) { o = p; System-Software WS 04/05

224 Case Study: POSIX Threads
Standard interface for threads in C Mostly UNIX, possible on Windows Provided by a library (libpthread) and not part of the language. IEEE POSIX c standard (1995) Various implementations (both user and kernel level) System-Software WS 04/05

225 Case Study: POSIX Threads
#include <pthread.h> pthread_mutex_t m; void *run(){ pthread_mutex_lock(&m); // critical section pthread_mutex_unlock(&m); pthread_exit(NULL); } int main (int argc, char *argv[]){ pthread_t t; pthread_create(&t, NULL, run,NULL); System-Software WS 04/05

226 File Systems

227 File Systems - Overview
Hardware File abstraction File organization File systems Oberon Unix FAT Distributed file systems NFS AFS Special topics Error recovery ISAM B* Trees System-Software WS 04/05

228 Hardware: the ATA Bus ATA / IDE (1986) ATA-2 / EIDE ATA-4 / ATAPI
Advanced Technology Attachment Integrated Drive Electronics ATA-2 / EIDE ATA-4 / ATAPI ATA Packet Interface (SCSI command set) ATA-5 UDMA 66 ATA-6 UDMA 100 SATA ATA-7 UDMA 133 bus with 2 devices master / slave low-level interface head / cylinder / sector support for LBA (logical block addressing) PIO mode read byte by byte through hardware port DMA mode use DMA transfer System-Software WS 04/05

229 Hardware: the SCSI Bus SCSI: Small Computer Systems Interface SCSI-2
Fast SCSI Wide SCSI SCSI-3 Bus with 8 devices wide: 16 / 32 devices bus arbitration disconnected mode Device kinds direct access CD-ROM ... Block-oriented access read-block, write-block Transfer mode selection asynchronous (hand-shake) synchronous (period / offset) System-Software WS 04/05

230 Hardware: Hard Disk Organization Addressing cylinder (c) head (h)
sector (s) Addressing sector (c, h, s) block (LBA) track (cylinder) sector surface (head) rotation axis System-Software WS 04/05

231 Hardware: Example Current disk example: ATA-100 250GB
512 bytes per sector (488·106 sectors) 8MB cache 8.9ms average seek time 7200 rpm System-Software WS 04/05

232 Hardware: Hard Disk Improvements
Interleaving optimize sequential sector access Read-ahead Caching Sector defect management cylinder 1 5 4 2 6 7 3 System-Software WS 04/05

233 Hardware: Disk Scheduling
Disk controllers have a queue of pending requests: type: read or write block number: translated into the (h,c,s)-tuple memory address (where to copy from and to) amount to be transferred (byte or block count) System-Software WS 04/05

234 Hardware: Disk Scheduling
Performance: minimize head movements, maximize throughput Scheduling is now in the hardware First-come, first-served (FCFS) Shortest-seek-time-first (SSTF) SCAN (elevator) & C-SCAN LOOK & C-LOOK System-Software WS 04/05

235 Hardware: Disk Scheduling
Example (head position, track number): queue = 31, 72, 4, 18, 147, 193, 199, 153, 114, 72 System-Software WS 04/05

236 Hardware: Disk Scheduling
System-Software WS 04/05

237 Abstractions Block: array of sectors some systems call them “clusters”
user configured reduces address space increases access speed causes internal fragmentation Disk: array of sectors File: stream of bytes sequential access random access stored on disk mapping byte to block block allocation management System-Software WS 04/05

238 Abstraction Layers Abstractions Implementations FAT File System Oberon
OpenFile, WriteFile, ReadFile, SeekFile, CloseFile ISO 9660 Volume ext3 ReadBlock, WriteBlock AllocateBlock, FreeBlock NTFS Disk ATA driver ReadSector, WriteSector SCSI driver System-Software WS 04/05

239 File Organization How can we map groups of blocks into files?
How do we manage free space? How can I jump to a certain location? Operation: read n bytes at position p. System-Software WS 04/05

240 File Organization: Contiguous Allocation
File is a group of contiguous blocks Simple management Fast transfers IBM MVS (mainframe) start length System-Software WS 04/05

241 File Organization: Contiguous Allocation
external fragmentation allocation how much space does a file need? first fit, best fit, …? file growth (error? move? extensions?) preallocation: internal fragmentation start length System-Software WS 04/05

242 File Organization: Linked Allocation
File is a linked list of blocks no external fragmentation no growth problems Problems sequential files only (positioning requires traversal) space for pointers (1TB, 5B addr., 1% with 512B blocks) reliability (lost pointers) start System-Software WS 04/05

243 File Organization: Linked Allocation
Clusters: series of contiguous blocks faster (less jumps) less space wasted for pointers internal fragmentation start System-Software WS 04/05

244 File Organization: Linked Allocation
Pointer tables the list of pointers is stored in a separate table can be cached usually is stored twice (reliability) FAT (MS-DOS, OS/2, Windows, solid-state memory) start System-Software WS 04/05

245 File Organization: Indexed Allocation
Index with block addresses Fast access for random-access files No external fragmentation Problems high management overhead limited file size (depending on the index structure) pointer overhead file System-Software WS 04/05

246 File Organization: Indexed Allocation
Variation: linked list of indexes Advantage: no file size limitation Disadvantage: Index lookup requires sequential traversal of index list file System-Software WS 04/05

247 File Organization: Indexed Allocation
multi-level indexes (index of indexes) UNIX Advantage: fast index lookup Disadvantage: limited file size file System-Software WS 04/05

248 File Organization: Indexed Allocation
Example: blocks 2KB address 4B First level index blocks: 512 entries · 2KB = 1MB Second level index block: 512 entries · 2KB = 0.5GB file System-Software WS 04/05

249 Free Space Management Bitmap (e.g., HFS) Linked lists Grouping
bit vector to mark free blocks simple needs caching Linked lists list of free blocks (similar to linked allocation) Grouping free blocks contain n address of free blocks (similar to multilevel indexing) Counting list of 2-tuples of series of free blocks (start, length) System-Software WS 04/05

250 Case Study: Oberon File System
Disk module: controller driver block management FileDir module: maps files to locations implemented with B-trees garbage collection (files) the directory is the root set anonymous (nonregistered) files are collected Files module: allows user operations (read, create, write, …) access is performed through riders Files FileDir Disk System-Software WS 04/05

251 Case Study: Oberon File System
Characteristics Block size = 1KB File organization multilevel index: 64 direct 12 1st level indirect 672 data bytes in file header Block allocation allocation table created at boot-time (partition GC) no collection at run-time (partition fills up!) designed to optimize small files System-Software WS 04/05

252 Case Study: Oberon File System
Block = 1KB d d d d 64 blocks d 1 d d d i1 d d d d 63 i2 d d d d d d 75 i1 d d (672B) 12 index blocks with 256 data blocks each System-Software WS 04/05

253 Case Study: Oberon File System
Free block management: bitmap Garbage collection at startup 8 16 24 startup / GC 8 16 24 allocate 16,17 8 16 24 allocate 19 8 16 24 System-Software WS 04/05

254 Case Study: Oberon File System
Internals “Rider”: current read or write position Buffer (cache) for consistency (each file sees the write operations on it) File Handle Rider f R R R “Hint” Buffer f System-Software WS 04/05

255 Case Study: Oberon RAM Disk
File = POINTER TO Header; Index = POINTER TO Sector; Rider = RECORD eof: BOOLEAN; file: File; pos: LONGINT; adr: LONGINT; END; Header = RECORD mark: LONGINT; name: FileDir.Name; len, time, date: LONGINT ext: ARRAY 12 OF Index; sec: ARRAY 64 OF SectorTable; header primary sector table points to sectors ext table index sector 0 points to sectors index sector 1 points to sectors System-Software WS 04/05

256 Case Study: Oberon RAM Disk
PROCEDURE Read(VAR r: Rider; VAR x: SYSTEM.BYTE); VAR m: INTEGER; BEGIN IF r.pos < r.file.len THEN SYSTEM.GET(r.adr, x); INC(r.adr); INC(r.pos); IF r.adr MOD SS = 0 THEN (*end of sector *) m := SHORT(r.pos DIV SS); IF m < STS THEN r.adr := r.file.sec[m] ELSE r.adr := r.file.ext[(m-STS) DIV XS].x[(m-STS) MOD XS] END ELSE x := 0X; r.eof := TRUE END Read; SS = Sector Size STS = Sector Table Size XS = Index Size System-Software WS 04/05

257 Case Study: Oberon RAM Disk
PROCEDURE Write(VAR r: Rider; x: SYSTEM.BYTE); VAR k, m, n: INTEGER; ix: LONGINT; BEGIN IF r.pos < r.file.len THEN m := SHORT(r.pos DIV SS); INC(r.pos); IF m < STS THEN r.adr := r.file.sec[m] ELSE r.adr := r.file.ext[(m-STS) DIV XS].x[(m-STS) MOD XS] END .... END; SYSTEM.PUT(r.adr, x); INC(r.adr); END Write; overwrite System-Software WS 04/05

258 Case Study: Oberon RAM Disk
IF r.pos < r.file.len THEN .... ELSE IF r.adr MOD SS = 0 THEN m := SHORT(r.pos DIV SS); IF m < STS THEN Kernel.AllocSector(0, r.adr); r.file.sec[m] := r.adr ELSE n := (m-STS) DIV XS; k := (m-STS) MOD XS; IF k = 0 THEN Kernel.AllocSector(0, ix); r.file.ext[n] := SYSTEM.VAL(Index, ix) END; Kernel.AllocSector(0, r.adr); r.file.ext[n].x[k] := r.adr INC(r.pos); r.file.len := r.pos SYSTEM.PUT(r.adr, x); INC(r.adr); expand System-Software WS 04/05

259 Case Study: UNIX, inodes
File system: files and directories (files with a special content) A file is represented by an inode Inode: file owner file type regular / directory / special access permissions access time reference count (links) table of contents file size Inode table of contents 10 (12) direct blocks 1 indirect block 1 double indirect block 1 triple indirect block System-Software WS 04/05

260 Case Study: UNIX, inodes
type access refc info 1 i1 d i1 d i1 d 10 11 12 i2 i1 d i2 i1 d i2 i1 d inode i3 i2 i1 d i3 i2 i1 d i3 i2 i1 d System-Software WS 04/05

261 Case Study: UNIX, directories
Directories are normal files with a special content. The data part contains a list with inode name Every directory has two special entries . the directory itself .. the parent directory System-Software WS 04/05

262 Case Study: UNIX, inodes
type: dir blocks: 132 owner: root ref count: 1 type: dir blocks: 406 owner: root ref count: 1 type: file blocks: 42, 103 owner: root ref count: 1 inodes disk block block 132 block 406 block 42 / 2 . 2 .. 4 bin 3 root /root/ 3 . 2 .. 5 .tcshrc 6 mbox data block 103 data inode # name System-Software WS 04/05

263 Case Study: UNIX, soft and hard links
two directories entries with the same inode number each file has a reference counter 42 file 42 hardlink Soft links the directory entry points to a special file with the path of the linked file 42 file 43 softlink (inode 43 points to a special file with the path of file) System-Software WS 04/05

264 Case Study: UNIX, hard links
inode 2 inode 3 inode 5 type: dir blocks: 132 owner: root ref count: 1 type: dir blocks: 406 owner: root ref count: 1 type: file blocks: 42, 103 owner: root ref count: 2 inodes disk block block 132 block 406 block 42 / 2 . 2 .. 4 bin 3 root /root/ 3 . 2 .. 5 mails 5 mbox data block 103 data System-Software WS 04/05

265 Case Study: UNIX, soft links
inode 2 inode 3 type: dir blocks: 132 owner: root ref count: 1 type: dir blocks: 406 owner: root ref count: 1 inode 5 block 42 type: file blocks: 42 owner: root ref count: 1 data block 132 block 406 / 2 . 2 .. 4 bin 3 root /root/ 3 . 2 .. 5 mbox 6 mails inode 6 block 43 type: file blocks: 43 owner: root ref count: 1 /root/mbox System-Software WS 04/05

266 Case Study: UNIX, Volume Layout
A volume (partition) contains boot block bootstrap code super block size max file free space inodes data blocks boot block super block inode list data blocks System-Software WS 04/05

267 Case Study: UNIX, Functions
Core functions bread read block bwrite write block iget get inode from disk iput put inode to disk bmap map (inode, offset) to disk block namei convert path name to inode System-Software WS 04/05

268 Case Study: UNIX, namei namei (path) if (absolute path) inode = root;
else inode = current directory inode; while (more path to process) { read directory (inode); if match(directory, name component) { inode = directory[name component]; iget(inode); } else { return no inode; } return inode; System-Software WS 04/05

269 Case Study: FAT FATnn: nn corresponds to the FAT size in bits
FAT12, FAT16, FAT32 used by MS-DOS and Windows for disks and floppies Volume Layout boot block FAT1 FAT2 root directory data System-Software WS 04/05

270 Case Study: FAT, Example
1 2 EOF 3 4 12 5 FREE 6 9 7 BAD 8 11 10 13 disk size 6 9 11 10 File 1: 4 12 File 2: 8 3 File 3: System-Software WS 04/05

271 Case Study: FAT, Directory
Information about files is kept in the directory File name (8) Extension (3) A D V S H R Reserved (10) Time (2) Date (2) First block (2) File size (4) System-Software WS 04/05

272 Case Study: FAT, Max. Partition Size
Block size FAT-12 FAT-16 FAT-32 0.5 KB 2 MB 1 KB 4 MB 2 KB 8 MB 128 MB 4 KB 16 MB 256 MB 1 TB 8 KB 512 MB 2 TB 16 KB 1024 MB 32 KB 2048 MB System-Software WS 04/05

273 File System Mounting More than one volume mounted in the same directory tree. afs ethz.ch home corti / usr bin floppy mnt dos cd System-Software WS 04/05

274 Virtual File System Support for several file systems
disk based network special VFS: unifies the system calls Mirrors the traditional UNIX file system model Applications VFS ext3 FAT NFS AFS proc pts ext3 FAT NFS AFS proc pts System-Software WS 04/05

275 File System Mounting Each file system type has a method table
System calls are indirect function calls through the method table Common interface (open, write, readdir, lock, …) Each file is associated with a the method table System-Software WS 04/05

276 File System Mounting: Special Files
Devices disks memory USB devices serial ports Kernel communication (e.g., proc) Uniform interface (open, close, read, write) Uniform protection (user, groups) System-Software WS 04/05

277 File Systems: Protection
Restrict: access (who), operations (what), management FAT: flags in the directory e.g., read only execution based on name UNIX: restrictions in inodes based on users and groups operations: read, write, execute directories: manage files not so flexible VMS: access lists list of users and rights per file System-Software WS 04/05

278 Distributed File Systems

279 Distributed File Systems (DFS)
Clients, servers and storage are dispersed among machines in a distributed system. Client Client Client Server Server Server Client Client Client Server System-Software WS 04/05

280 Overview Naming (dynamic):
location transparency: file name does not reveal the file location location independence: file name does not change when storage is moved Caching (efficiency) write-through delayed-write write-on-close Consistency client-initiated: poll server for changes server-initiated: notify clients System-Software WS 04/05

281 Naming Simple approaches: Transparent Global name structure
file is identified by a host, path pair Ibis (host:path) SMB (\\host\path) Transparent remote directory are mounted in the local file system not uniform (the mount point is not defined) NFS (/mnt/home, /home/) SMB (\\host\path mounted on Z:) Global name structure uniform and transparent naming AFS (/afs/cell/path) System-Software WS 04/05

282 Caching Reduces network and disk load Consistency problems
Granularity: How much? Big/small chunks of data? Entire files? Big: +hit ratio, +hit penalty, +consistency problems Location: memory: +diskless stations, +speed disk: +cheaper, +persistent hybrid Space consumption on the clients System-Software WS 04/05

283 Caching Policies: write-through: +reliability, -performance (cache is effective only for read operations) delayed-write: +write speed, +unnecessary writes eliminated, -reliability write when the cache is full (+performance, -long time in the cache) regular intervals write-on-close System-Software WS 04/05

284 Consistency Is my cached copy up-to-date? Client-initiated approach:
the client performs validity checks when? open/fixed intervals/every access Server-initiated approach: the server keeps track of cached files (parts) notifies the clients when conflicts are detected should the server allow conflicts? System-Software WS 04/05

285 Stateless and Stateful Servers
Stateful: the server keeps track of each accessed file session IDs (e.g., identifying an inode on the server) fast simple requests caches fewer disk accesses read ahead volatile server crash: rebuild structures (recovery protocol) client crash: orphan detection and elimination System-Software WS 04/05

286 Stateless and Stateful Servers
Stateless: each request is self-contained request: file and position complex requests need for uniform low-level naming scheme (to avoid name translations) need idempotent operations (same results if repeated) absolute byte counts No locking possible System-Software WS 04/05

287 File Replication A file can be present on failure independent machines
Naming scheme manages the mapping same high-level name different low-level names Transparency Consistency System-Software WS 04/05

288 Distributed File-Systems (mainstream)
NFS: Network File System (Sun) AFS: Andrew File System (CMU) SMB: Server Message Block (Microsoft) NCFS: Network Computer FS (Oberon) System-Software WS 04/05

289 Network File System (NFS)
UNIX - based (Sun) mount file system from another machine into local directory stateless (no open/close) uses UDP to communicate based on RPC and XDR (External Data Representation) every operation is a remote procedure call known problems: no caching no disconnected mode efficiency security: IP based System-Software WS 04/05

290 mount -t nfs server:/home /home
NFS: Example exports /home/ client(rw) etc reali / home corti server client mount -t nfs server:/home /home etc etc reali / / home home corti System-Software WS 04/05

291 NFS No special servers (each machine can act as a server and as a client) Cascading mounts are allowed mount -t nfs server1:/home /home mount -t nfs server2:/projects/corti /home/corti/projects Limited scalability (limited number of exports) System-Software WS 04/05

292 NFS: Stateless Protocol
Each request contains a unique file identifier and an absolute offset No concurrency control (locking has to be performed by the applications) Committed information is assumed to be on disk (the server cannot cache writes) System-Software WS 04/05

293 Network File System (NFS)
System call layer Virtual file system layer Virtual file system layer Local file system NFS client NFS server Local file system RPC / XDR RPC / XDR network (UDP) System-Software WS 04/05

294 Remote Procedure Invocation: Overview
Problem send structured information from A to B A and B may have different memory layouts byte order problems How is 0x1234 (2 bytes) represented in memory? network byte-ordering 12 34 1 Big-endian: MSB before LSB IBM, Motorola, SPARC Little-endian: LSB before MSB VAX, Intel little end first System-Software WS 04/05

295 Marshalling / Serialization
Marshalling: packing one or more data items into a buffer using a standard representation Presentation layer (OSI) RPC + XDR (Sun) RFC 1014, June 1987 RFC 1057, June 1988 IIOP / CORBA (OMG) V2.0, February 1997 V3.0, August 2002 SOAP / XML (W3C) V1.1, May 2000 XDR Type System [unsigned] integer (32-bit) [unsigned] hyper-integer (64-bit) enumeration (unsigned int) boolean (enum) float / double (IEEE 32/64-bit) opaque string array (fix + variable size) structure union void System-Software WS 04/05

296 RPC Protocol Remote procedure call Marshalling of procedure parameters
Message format Authentication Naming Client Server procedure P(a, b, c) pack parameters send message to server await response unpack response Server unpack parameters find procedure invoke pack response send response P(a, b, c) System-Software WS 04/05

297 NFS Client RPC - protocol Server lookup lookup read read write write
System-Software WS 04/05

298 NFS Efficiency Stateless protocols are inherently slow Caching:
e.g., directory lookup Caching: file blocks (data) file attributes (inodes) read-ahead delayed write tradeoff between speed and consistency It is possible that two machines see different data System-Software WS 04/05

299 NFS: Security Exports based on IP addresses Data is not encrypted
low security low granularity Data is not encrypted Permissions based on user and group ID uniform naming needed (e.g., NIS) System-Software WS 04/05

300 Andrew File System (AFS)
1983 CMU (later IBM, now open source) Scalable (>5000 workstations): network divided in clusters (cells) Client/user mobility (files are accessible from everywhere) Security: encrypted communication (Kerberos) Protection: control access lists Heterogeneity: clear interface to the server System-Software WS 04/05

301 Andrew File System (AFS)
server provides a cell world-wide addressing scheme (name  cell) client caches a whole file server-synchronization on file open and close AFS is efficient low network overhead stateful: consistency is implemented with callbacks callback = client is in synch with server on store, server changes the callbacks System-Software WS 04/05

302 AFS: Logical View Private Space / Shared Space usr bin afs dir dir
Mount Point vol Volume bin f System-Software WS 04/05

303 AFS: Physical View network client sever ethz.ch epfl.ch cell cmu.edu
System-Software WS 04/05

304 AFS Client RPC - protocol Server open open Cache read write close
System-Software WS 04/05

305 AFS: Consistency Interaction only when opening and closing files.
Writes are not visible on other machines before a close. Clients assume that cached files are up-to-date. Servers keep track of caching by the clients (callbacks) clients are notified in case of changes System-Software WS 04/05

306 AFS: Kerberos Kerberos (Cerberos: three-headed dog guarding the Hades)
authentication accounting audit Needham-Schroeder shared key protocol Distributed AFS: communication is encrypted System-Software WS 04/05

307 AFS: Protection Access lists:
%> fs listacl thesis Access list for thesis is Normal rights: system:anyuser l trg rlidwk corti rlidwka It’s possible to allow (or deny) access to users or customized groups Restriction on: read, write, lookup, insert, administer, lock and delete. Supports UNIX control bits. System-Software WS 04/05

308 Network Fallacies The Eight Fallacies of Distributed Computing (Peter Deutsch) The network is reliable Latency is zero Bandwidth is infinite The network is secure The network topology doesn’t change There is one administrator Transport cost is zero The network is homogeneous System-Software WS 04/05

309 General Principles (Satyanarayan)
From DFSs we learned the following lessons: we should try to move computations to the clients use caching whenever possible special files (e.g., temporary) can be specially treated. make scalable systems. trust the fewest possible entities batch work if possible System-Software WS 04/05

310 Kernel Structure

311 Introduction Kernel performs “dangerous” operations
page table mapping scheduling Kernel must be protected against malign user code access to other processes’ data increasing own processes’ priority Kernel must have more rights than user code Solution: distinguish between kernel mode and user mode access kernel through system calls the system calls define the interface to the kernel System-Software WS 04/05

312 Kernel Protection application application application application
system calls application application drivers memory manager file systems System-Software WS 04/05

313 Kernel Protection Means: hardware support separate address spaces
privileged instructions supervisor mode separate address spaces user process has no access to kernel structures access memory / functions through symbolic names user has no access to hardware System-Software WS 04/05

314 Kernel Protection Privileged instructions in user mode generate a trap
Mode switch: interrupts gated calls (user generated sw interrupt calls) Parameters: stack registers Examples: Intel x86: 4 protection levels (code/segment attribute), interrupt PowerPC: 2 levels (CPU attribute), special instruction System-Software WS 04/05

315 Linux System Calls (Intel)
System calls are wrapped in libraries (e.g., libc) The library function writes the parameters in registers (5) writes the parameters on the stack (>5) writes the system call number in EAX calls int 0x80 The kernel jumps to the corresponding function in sys_call_table System-Software WS 04/05

316 Linux System Calls Examples: pid_t fork(void): creates a child process
ssize_t write(int fd, const void *buf, size_t count): writes count bytes from buf to fd int kill(pid_t pid, int sig): send signal to a process int gettimeofday(struct timeval *tv, struct timezone *tz): gets the current time int open(const char *pathname, int flags): opens a file int ioctl(int d, int request, ...): manipulates special devices System-Software WS 04/05

317 Windows System Calls Layered system: system call must be performed by a wrapper (NTDLL.DLL). The system call position in the KiSystemServiceTable is not known (depends on the build) call WriteFile() application NtWriteFile() KERNEL32.DLL int 0x2e NTDLL.DLL KiSystem Service Table System-Software WS 04/05

318 Kernel Design: API vs. System Calls
Linux system-calls are clearly specified (POSIX standard) system-calls do not change about 100 calls Windows system-calls are hidden only Win32 API is published Win32 is standard “thousands” of API calls, still growing some API calls are handled in user space More than one API: POSIX OS/2 System-Software WS 04/05

319 Protection and SMP What happens when two process (on two CPUs) enter in kernel mode? Big kernel lock: not allowed (OpenBSD, NetBSD) Fine grained locks in the kernel (FreeBSD 5, Linux 2.6) proc1: int 0x80 proc1: int 0x80 CPU 1 CPU 2 System-Software WS 04/05

320 Kernel Structure monolithic kernel layered system virtual machine
big mess, no structure, one big block, fast MS-DOS (no protection), original UNIX micro-kernel (AIX, OS X) layered system layern uses functions from layern-1 OS/2 (some degree of layering) virtual machine define artificial environment for programs client-server tiny communication microkernel to access various services System-Software WS 04/05

321 Monolithic Kernels Monolithic Micro-kernel user-level applications
scheduler signal handling file system swapping virtual memory scheduler signal handling file system swapping virtual memory terminal controllers device drivers memory controllers terminal controllers device drivers memory controllers System-Software WS 04/05

322 Layered Systems THE operating system
A layer uses only functions from below What goes where? Less efficient user programs buffering I/O console drivers memory management CPU scheduling hardware System-Software WS 04/05

323 Virtual Machines VM operating system (IBM)
slow and difficult to implement complete protection no sharing of resources useful for development and research compatibility procs procs procs virtual machine hardware System-Software WS 04/05

324 Design: Kernel or User Space?
Big monolithic kernel: fast (less switches) less protection Examples: HTTP server in the Linux kernel. graphic routines in Windows Modular and micro-kernels: structured more separation move code to user space less efficient more secure Example: user level drivers System-Software WS 04/05

325 Virtual Machines Machine specification in software
instruction set memory layout virtual devices .... JVM (Java Virtual Machine) .NET / Mono VMWare specified machine is a whole PC allows multiple PC environments on same machine IBM VM/370 System-Software WS 04/05

326 Case Study: JVM

327 Virtual Machines What is a machine? does something (...useful)
programmable concrete (hardware) What is a virtual machine? a machine that is not concrete a software emulation of a physical computing environment Reality is somewhat fuzzy! Is a Pentium-II a machine? Hardware and software are logically equivalent (A. Tanenbaum) instructions RISC Core decoder Op1 Op2 Op3 System-Software WS 04/05

328 Virtual Machine, Intermediate Language
Pascal P-Code (1975) stack-based processor strong type machine language compiler: one front end, many back ends UCSD Apple][ implementation, PDP 11, Z80 Modula M-Code (1980) high code density Lilith as microprogrammed virtual processor JVM – Java Virtual Machine (1995) Write Once – Run Everywhere interpreters, JIT compilers, Hot Spot Compiler Microsoft .NET (2000) language interoperability System-Software WS 04/05

329 JVM Case Study compiler (Java to bytecode)
interpreter, ahead-of-time compiler, JIT dynamic loading and linking exception Handling memory management, garbage collection OO model with single inheritance and interfaces system classes to provide OS-like implementation compiler class loader runtime system System-Software WS 04/05

330 JVM: Type System Primitive types Object types Single class inheritance
byte short int long float double char reference boolean mapped to int Object types classes interfaces arrays Single class inheritance Multiple interface implementation Arrays anonymous types subclasses of java.lang.Object System-Software WS 04/05

331 JVM: Java Byte-Code Memory access tload / tstore ttload / ttstore
tconst getfield / putfield getstatic / putstatic Operations tadd / tsub / tmul / tdiv tshifts Conversions f2i / i2f / i2l / .... dup / dup2 / dup_x1 / ... Control ifeq / ifne / iflt / .... if_icmpeq / if_acmpeq invokestatic invokevirtual invokeinterface athrow treturn Allocation new / newarray Casting checkcast / instanceof System-Software WS 04/05

332 JVM: Java Byte-Code Example
bipush Operation Push byte Format Forms bipush = 16 (0x10) Operand Stack ... => ..., value Description The immediate byte is sign-extended to an int value. That value is pushed onto the operand stack. bipush byte System-Software WS 04/05

333 JVM: Machine Organization
Virtual Processor stack machine no registers typed instructions no memory addresses, only symbolic names Runtime Data Areas pc register stack locals parameters return values heap method area code runtime constant pool native method stack System-Software WS 04/05

334 JVM: Execution Example
iload 5 iload 6 iadd istore 4 locals program v4 istore 4 v5 v5 iload 5 v6 v6 iload 6 iadd v5+v6 operand stack Time System-Software WS 04/05

335 JVM: Reflection Load and manipulate unknown classes at runtime.
java.lang.Class getFields getMethods getConstructors java.lang.reflect.Field setObject getObject setInt getInt setFloat getFloat ..... java.lang.reflect.Method getModifiers invoke java.lang.reflectConstructor System-Software WS 04/05

336 JVM: Reflection – Example
import java.lang.reflect.*; public class ReflectionExample { public static void main(String args[]) { try { Class c = Class.forName(args[0]); Method m[] = c.getDeclaredMethods(); for (int i = 0; i < m.length; i++) { System.out.println(m[i].toString()); } } catch (Throwable e) { System.err.println(e); System-Software WS 04/05

337 JVM: Java Weaknesses Transitive closure of java.lang.Object contains
1.1 47 5 (1.5) 280 classpath class String { public String toUpperCase(Locale loc); .... } class Object { public String toString(); .... } public final class Locale implements Serializable, Cloneable { .... } System-Software WS 04/05

338 JVM: Java Weaknesses Class static initialization Problem
T is a class and an instance of T is created T tmp = new T(); T is a class and a static method of T is invoked T.staticMethod(); A nonconstant static field of T is used or assigned (field is not static, not final, and not initialized with compile-time constant) T.someField = 42; Problem circular dependencies in static initialization code A static { x = B.f(); } B static { y = A.f(); } System-Software WS 04/05

339 JVM: Java Weaknesses hidden static initializer: Warning:
interface Example { final static String labels[] = {“A”, “B”, “C”} } hidden static initializer: labels = new String[3]; labels[0] = “A”; labels[1] = “B”; labels[2] = “C”; Warning: in Java final means write-once! interfaces may contain code System-Software WS 04/05

340 JVM: Memory Model The JVM specs define a memory model:
defines the relationship between variables and the underlying memory meant to guarantee the same behavior on every JVM The compiler is allowed to reorder operations unless synchronized or volatile is specified. System-Software WS 04/05

341 JVM: Reordering read and writes to ordinary variables can be reordered. public class Reordering { int x = 0, y = 0; public void writer() { x = 1; y = 2; } public void reader() { int r1 = y; int r2 = x; System-Software WS 04/05

342 JVM: Memory Model synchronized: in addition to specify a monitor it defines a memory barrier: acquiring the lock implies an invalidation of the caches releasing the lock implies a write back of the caches synchronized blocks on the same object are ordered. order among accesses to volatile variables is guaranteed (but not among volatile and other variables). System-Software WS 04/05

343 JVM: Double Checked Lock
Singleton public class SomeClass { private static Resource resource = null; public Resource synchronized getResource() { if (resource == null) { resource = new Resource(); } return resource; System-Software WS 04/05

344 JVM: Double Checked Lock
Double checked locking public class SomeClass { private static Resource resource = null; public Resource getResource() { if (resource == null) { synchronized (this) { resource = new Resource(); } return resource; System-Software WS 04/05

345 JVM: Double Checked Lock
Thread 1 Thread 2 public class SomeClass { private Resource resource = null; public Resource getResource() { if (resource == null) { synchronized { resource = new Resource(); } return resource; public class SomeClass { private Resource resource = null; public Resource getResource() { if (resource == null) { synchronized { resource = new Resource(); } return resource; The object is instantiated but not yet initialized! System-Software WS 04/05

346 JVM: Immutable Objects are not Immutable
all types are primitives or references to immutable objects all fieds are final Example (simplified): java.lang.String contains an array of characters the length an offset example: s = “abcd”, length = 2, offset = 2, string = “cd” String s1 = “/usr/tmp” String s2 = s1.substring(4); //should contain “/tmp” Sequence: s2 is instantiated, the fields are initialized (to 0), the array is copied, the fields are written by the constructor. What happens if instructions are reordered? System-Software WS 04/05

347 JVM: Reordering Volatile and Nonvolatile Stores
volatile reads and writes are totally ordered among threads but not among normal variables example Thread 1 Thread 2 volatile boolean initialized = false; SomeObject o = null; o = new SomeObject; initialized = true; ? while (!initialized) { sleep(); } o.field = 42; System-Software WS 04/05

348 JVM: JSR 133 Java Community Process Java memory model revision
Final means final Volatile fields cannot be reordered System-Software WS 04/05

349 Java JVM: Execution Interpreted (e.g., Sun JVM)
bytecode instructions are interpreted sequentially the VM emulates the Java Virtual Machine slower quick startup Just-in-time compilers (e.g., Sun JVM, IBM JikesVM) bytecode is compiled to native code at load time (or later) code can be optimized (at compile time or later) quicker slow startup Ahead-of time compilers (e.g., GCJ) bytecode is compiled to native code offline quick execution static compilation System-Software WS 04/05

350 JVM: Loader – The Classfile Format
version constant pool flags super class interfaces fields methods attributes } Constants: Values String / Integer / Float / ... References Field / Method / Class / ... Attributes: ConstantValue Code Exceptions System-Software WS 04/05

351 JVM: Class File Format class HelloWorld {
public static void printHello() { System.out.println("hello, world"); } public static void main (String[] args) { HelloWorld myHello = new HelloWorld(); myHello.printHello(); System-Software WS 04/05

352 JVM: Class File (Constant Pool)
String hello, world Class HelloWorld Class java/io/PrintStream Class java/lang/Object Class java/lang/System Methodref HelloWorld.<init>() Methodref java/lang/Object.<init>() Fieldref java/io/PrintStream java/lang/System.out Methodref HelloWorld.printHello() Methodref java/io/PrintStream.println(java/lang/String ) NameAndType <init> ()V NameAndType out Ljava/io/PrintStream; NameAndType printHello ()V NameAndType println (Ljava/lang/String;)V Unicode ()V Unicode (Ljava/lang/String;)V Unicode ([Ljava/lang/String;)V Unicode <init> Unicode Code Unicode ConstantValue Unicode Exceptions Unicode HelloWorld Unicode HelloWorld.java Unicode LineNumberTable Unicode Ljava/io/PrintStream; Unicode LocalVariables Unicode SourceFile Unicode hello, world Unicode java/io/PrintStream Unicode java/lang/Object Unicode java/lang/System Unicode main Unicode out Unicode printHello System-Software WS 04/05

353 JVM: Class File (Code) Methods 0 <init>() 0 ALOAD0
1 INVOKESPECIAL [7] java/lang/Object.<init>() 4 RETURN 1 PUBLIC STATIC main(java/lang/String []) 0 NEW [2] HelloWorld 3 DUP 4 INVOKESPECIAL [6] HelloWorld.<init>() 7 ASTORE1 8 INVOKESTATIC [9] HelloWorld.printHello() 11 RETURN 2 PUBLIC STATIC printHello() 0 GETSTATIC [8] java/io/PrintStream java/lang/System.out 3 LDC1 hello, world 5 INVOKEVIRTUAL [10] java/io/PrintStream.println(java/lang/String ) 8 RETURN System-Software WS 04/05

354 JVM: Compilation – Pattern Expansion
Each byte code is translated according to fix patterns easy limited knowledge Example (pseudocode) switch (o) { case ICONST<n>: generate(“push n”); PC++; break; case ILOAD<n>: generate(“push off_n[FP]”); PC++; break; case IADD: generate(“pop -> R1”); generate(“pop -> R2”); generate(“add R1, R2 -> R1”); generate(“push R1”); PC++; break; System-Software WS 04/05

355 JVM: Optimizing Pattern Expansion
Main Idea: use internal virtual stack stack values are consts / fields / locals / array fields / registers / ... flush stack as late as possible iload 4 iload 5 iadd istore 6 emitted code MOV EAX, off4[FP] ADD EAX, off5[FP] MOV off6[FP], EAX local5 local5 virtual stack local4 local4 EAX EAX iload4 iload5 iadd istore6 System-Software WS 04/05

356 JVM: Compiler Comparison
iload_4 iload_5 iadd istore_6 5 instructions 9 memory accesses 3 instructions 3 memory accesses pattern expansion push off4[FP] push off5[FP] pop EAX add 0[SP], EAX pop off6[FP] optimized mov EAX, off4[FP] add EAX, off5[FP] mov off6[FP], EAX System-Software WS 04/05

357 Linking (General) A compiled program contains references to external code (libraries) After loading the code the system need to link the code to the library identify the calls to external code locate the callees (and load them if necessary) patch the loaded code Two options: the code contains a list of sites for each callee the calls to external code are jumps to a procedure linkage table which is then patched (double indirection) System-Software WS 04/05

358 Linking (General) instr jump - instr jump 101 100 proc 0 proc 1 jump 1
instr 1 2 jump - 3 4 5 6 7 9 10 instr 1 2 jump 101 3 4 5 100 6 7 9 10 proc 0 5 proc 1 7 100 jump 101 System-Software WS 04/05

359 Linking (General) instr jump &p1 &p0 instr jump 101 100 proc 0 proc 1
instr 1 2 jump &p1 3 4 5 &p0 6 7 9 10 instr 1 2 jump 101 3 4 5 100 6 7 9 10 proc 0 5 proc 1 7 100 jump &p0 101 &p1 System-Software WS 04/05

360 JVM: Linking Bytecode interpreter Native code (ahead of time compiler)
references to other objects are made through the JVM (e.g., invokevirtual, getfield, …) Native code (ahead of time compiler) static linking classic native linking JIT compiler only some classes are compiled calls could reference classes that are not yet loaded or compiled (delayed compilation) code instrumentation System-Software WS 04/05

361 JVM: Methods and Fields Resolution
method and fields are accessed through special VM functions (e.g., invokevirtual, getfield, …) the parameters of the special call defines the target the parameters are indexes in the constant pool the VM checks id the call is legal and if the target is presentl System-Software WS 04/05

362 JVM: JIT – Linking and Instrumentation
Use code instrumentation to detect first access of static fields and methods class A { .... ...B.x } class B { int x; } B.x CheckClass(B); B.x IF ~B.initialized THEN Initialize(B) END; System-Software WS 04/05

363 Compilation and Linking Overview
C header C header C source Compiler Object File Object File Object File Object file Linker Loader Loaded Code System-Software WS 04/05

364 Compilation and Linking Overview
Oberon source Compiler Object File Object & Symbol Loader Linker Loaded Module Loaded Module Loaded Module Loaded Module System-Software WS 04/05

365 Compilation and Linking Overview
Java source Compiler Class File Reflection API Class Loader JIT Compiler Loader Linker Class Class System-Software WS 04/05

366 Jaos Jaos (Java on Active Object System) is a Java virtual machine for the Bluebottle system goals: implement a JVM for the Bluebottle system show that the Bluebottle kernel is generic enough to support more than one system interoperability between the Active Oberon and Java languages interoperability between the Oberon System and the Java APIs System-Software WS 04/05

367 Jaos (Interoperability Framework)
Oberon source Compiler Object & Symbol Metadata Loader Oberon Browser Java Reflection API Class File Loader Java Metadata Metadata JIT Compiler Loaded Class Linker Oberon Loader Linker Loaded Module Loader Linker Loaded Module Loaded Module System-Software WS 04/05

368 JVM: Verification Compiler generates “good” code....
.... that could be changed before reaching the JVM need for verification Verification makes the VM simpler (less run-time checks): no operand stack overflow load / stores are valid VM types are correct no pointer forging no violation of access restrictions access objects as they are (type) local variable initialized before load System-Software WS 04/05

369 JVM: Verification Pass1 (Loading): class file version check
class file format check class file complete Pass 2 (Linking): final classes are not subclassed every class has a superclass (but Object) constant pool references constant pool names System-Software WS 04/05

370 JVM: Verification Pass 3 (Linking): For each operation in code
Delayed for performance reasons Pass 3 (Linking): For each operation in code (independent of the path): operation stack size is the same accessed variable types are correct method parameters are appropriate field assignment with correct types opcode arguments are appropriate Pass 4 (RunTime): First time a type is referenced: load types when referenced check access visibility class initialization First member access: member exists member type same as declared current method has right to access member Byte-Code Verification System-Software WS 04/05

371 JVM: Byte-Code Verification
branch destination must exists opcodes must be legal access only existing locals code does not end in the middle of an instruction types in byte-code must be respected execution cannot fall of the end of the code exception handler begin and end are sound System-Software WS 04/05

372 Addendum: Security

373 Security internal protection external protection problems:
memory protection file system accesses external protection accessibility problems: program threats System-Software WS 04/05

374 Security: Program Threats
Trojan horses: a code segment that misuses its environment mail attachments web downloads (e.g., SEXY.EXE which formats your hard disk) programs with the same name as common utilities misleading names (e.g., README.TXT.EXE) Trap door (in programs or compilers): an intentional hole in the software System-Software WS 04/05

375 Security: System Threats
worms: a standalone program that spawns other processes (copies of itself) to reduce system performance example: Morris worm (1988) exploited holes in rsh, finger and sendmail to gain access to other machines once on the other machine it was able to replicate itself used by spammers to spread and distribute spamming applications viruses: similar to worms but embedded in other programs they usually infect other programs and the boot sector System-Software WS 04/05

376 Security: System Threats
Denial of service perform many requests to steal all the available resources often distributed (using worms) Example: SYN flooding attacks the attacker tries to connect the victim answers with a synchronize and acknowledge packet and waits for acknowledgment Countermeasures active filtering request dropping cookie based protocols (requests must be authenticated) stateless protocols System-Software WS 04/05

377 Security: System Threats
badly implemented and designed software: lpr (setuid) with an option to delete the printed file mkdir (first create the inode then change the owner) it was possible to change the inode before the chown … buffer overflows password in memory or swap files insecure protocols (FTP, SMTP) missing sanity checks (syscalls, command in input, …) short keys and passwords proprietary protocols System-Software WS 04/05

378 Bad design: A very recent example
Texas Instruments produces RFID tags offering cryptographic functionalities. used for cars and electronic payments 40 bit keys proprietary protocol Attack from Johns Hopkins University and RSA Labs less than 2 hours for 5 keys less than 3500$ System-Software WS 04/05

379 Security: Buffer Overflows
Overwrite a function’s return address p1 & p2 array function foo(int p1, int p2) { char array[10]; strcpy(array, someinput); } RET FP array Avoid strcpy and check the length, e.g., strncpy System-Software WS 04/05

380 Security: Monitoring check for suspicious patterns audit logs
login times audit logs periodic scans for security holes (bad passwords, set-uid programs, changes to system programs) system integrity checks (checksums for executable files) [tripwire] network services monitor network activity System-Software WS 04/05

381 Example: Firewalling Many applications use network sockets to communicate (even on a single machine) Many applications are not protected Solution: filter all the incoming connections by default and allow only the trusted ones System-Software WS 04/05

382 Security: (some) Design Principles
Open systems (programs and protocols) Default is deny access Check for current authority (timeouts, …) Give the least privilege possible Simple protection mechanisms Do not ask to much to the users (or they will avoid to protect themselves) System-Software WS 04/05

383 Security and Systems: Some Examples
Enhancements to memory management: Intel XD bit, AMD NX bit mark pages according to the content (data or code) an exception is generated if the PC is moved to a data address prevents some buffer overflow attacks dynamically generated code has to be generated through special system calls Windows XP SP2, Linux, BSD … System-Software WS 04/05

384 Security and Systems: Some Examples
SELinux National Security Agency (USA) patches to the Linux kernel to enforce mandory access control open source independent from the traditional UNIX roles (users and groups) configurable policies restricting what a program is able to do System-Software WS 04/05

385 Security and Systems: Some Examples
OpenBSD audit process (proactive bug search) random gaps in the stack ProPolice: gcc puts a random integer on the stack in a call prologue and checks it when returning W^X: pages are writable xor executable System-Software WS 04/05

386 Security and Systems: Some Examples
OpenBSD randomized shared library order and addresses mmap() and malloc() return randomized addresses guard pages between objects privilege separation and revocation System-Software WS 04/05

387 fork unprivileged child
Privilege Separation unprivileged child process to contain and restrict the effects of programming errors e.g., openssh listen *22 time network connection monitor network processing request auth auth result key exchange authentication fork unprivileged child monitor user request processing request PTY pass PTY user network data state export fork user child System-Software WS 04/05


Download ppt "Kernfach System Software WS04/05"

Similar presentations


Ads by Google