CS533 Concepts of Operating Systems Class 6 The Performance of Micro- Kernel Based Systems.

Slides:



Advertisements
Similar presentations
Threads, SMP, and Microkernels
Advertisements

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Improving IPC by Kernel Design Jochen Liedtke Slides based on a presentation by Rebekah Leslie.
Outline of the Paper Introduction. Overview Of L4. Design and Implementation Of Linux Server. Evaluating Compatibility Performance. Evaluating Extensibility.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 3: Input/output and co-processors dr.ir. A.C. Verschueren.
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Chapter 4: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads.
Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.
CS533 Concepts of Operating Systems Class 6 Micro-kernels Mach vs L3 vs L4.
Architectural Support for OS March 29, 2000 Instructor: Gary Kimura Slides courtesy of Hank Levy.
Operating System Kernels Peter Sirokman. Summary of First Paper The Performance of µ-Kernel-Based Systems (Hartig et al. 16th SOSP, Oct 1997) Evaluates.
Improving IPC by Kernel Design Jochen Liedtke Presented by Ahmed Badran.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
Improving IPC by Kernel Design Jochen Liedtke Proceeding of the 14 th ACM Symposium on Operating Systems Principles Asheville, North Carolina 1993.
3.5 Interprocess Communication
Improving IPC by Kernel Design Jochen Liedtke Shane Matthews Portland State University.
Threads CSCI 444/544 Operating Systems Fall 2008.
Microkernels: Mach and L4
1 Process Description and Control Chapter 3 = Why process? = What is a process? = How to represent processes? = How to control processes?
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
Threads CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
Process Description and Control A process is sometimes called a task, it is a program in execution.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Operating Systems CSE 411 CPU Management Sept Lecture 11 Instructor: Bhuvan Urgaonkar.
Improving IPC by Kernel Design
1 Micro-kernel. 2 Key points Microkernel provides minimal abstractions –Address space, threads, IPC Abstractions –… are machine independent –But implementation.
Three fundamental concepts in computer security: Reference Monitors: An access control concept that refers to an abstract machine that mediates all accesses.
CHAPTER 2: COMPUTER-SYSTEM STRUCTURES Computer system operation Computer system operation I/O structure I/O structure Storage structure Storage structure.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
CS533 Concepts of Operating Systems Jonathan Walpole.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
The Performance of Microkernel-Based Systems
CS533 Concepts of Operating Systems Jonathan Walpole.
The Performance of Micro-Kernel- Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Presentation by: Seungweon Park.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
The Mach System Abraham Silberschatz, Peter Baer Galvin, Greg Gagne Presentation By: Agnimitra Roy.
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.
Interrupt driven I/O. MIPS RISC Exception Mechanism The processor operates in The processor operates in user mode user mode kernel mode kernel mode Access.
The Performance of μ-Kernel-Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Presenter: Sunita Marathe.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.
M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.
The Mach System Silberschatz et al Presented By Anjana Venkat.
Full and Para Virtualization
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
The Performance of Micro-Kernel- Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Presentation by: Tim Hamilton.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Interrupt driven I/O Computer Organization and Assembly Language: Module 12.
Managing Processors Jeff Chase Duke University. The story so far: protected CPU mode user mode kernel mode kernel “top half” kernel “bottom half” (interrupt.
Efficient Software-Based Fault Isolation Authors: Robert Wahbe Steven Lucco Thomas E. Anderson Susan L. Graham Presenter: Gregory Netland.
Introduction to Operating Systems Concepts
CS 6560: Operating Systems Design
The Mach System Sri Ramkrishna.
Chapter 9: Virtual Memory
Chapter 9: Virtual-Memory Management
Page Replacement.
Improving IPC by Kernel Design
Fast Communication and User Level Parallelism
Architectural Support for OS
Improving IPC by Kernel Design
Translation Buffers (TLB’s)
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Improving IPC by Kernel Design
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Outline Operating System Organization Operating System Examples
Architectural Support for OS
System calls….. C-program->POSIX call
Operating Systems Structure
Presentation transcript:

CS533 Concepts of Operating Systems Class 6 The Performance of Micro- Kernel Based Systems

CS533 - Concepts of Operating Systems 2 Micro-kernels and Binary Compatibility  Emulation libraries o Trampoline mechanism  Single server architecture  Multi-server architecture o IPC overhead proportional to number of servers (independent protection domains)

CS533 - Concepts of Operating Systems 3 Micro-kernels must optimize IPC  Liedtke argues Mach’s overhead is due to poor implementation!  Optimized IPC implementation in L3 o Architectural level System Calls, Messages, Direct Transfer, Strict Process Orientation, Control Blocks. o Algorithmic level Thread Identifier, Virtual Queues, Timeouts/Wakeups, Lazy Scheduling, Direct Process Switch, Short Messages. o Interface level Unnecessary Copies, Parameter passing. o Coding level Cache Misses, TLB Misses, Segment Registers, General Registers, Jumps and Checks, Process Switch.

CS533 - Concepts of Operating Systems 4 L3 IPC Performance vs Mach IPC

CS533 - Concepts of Operating Systems 5 But Is That Enough?  What is the impact on overall system performance?  Haertig et al explore performance and extensibility of L4-based Linux OS vs Mach-based Linux and native Linux

CS533 - Concepts of Operating Systems 6 L 4 Linux – a micro-kernel based Linux  Fully binary compliant with Linux/X86  No changes to the architecture-independent parts of Linux  No Linux-specific modifications to the L4 kernel

L 4 Linux – Design & Implementation  Linux implemented as a single Linux server in a μ-kernel task  μ-kernel tasks used for Linux user processes  A single L4 thread in the Linux server handles system calls and page faults. This thread is multiplexed (treated as a virtual CPU)  On booting, the Linux server requests memory from its pager, which maps physical memory into the server’s address space  The Linux server then acts as the pager for the user processes it creates o L4 converts user-process page faults into an RPC to the Linux server, which maps pages from its address space to the user process

Interrupt Handling  Linux top halves are implemented as one server thread per interrupt source o L4 converts a hardware interrupt to a message to the appropriate thread  Linux bottom halves all execute in a single high priority thread  Linux interrupt threads have a higher priority than the main thread o avoids concurrent execution of Linux code on a uniprocessor

System Calls  System calls implemented as IPC between user process and the Linux server  Modified libc.so or libc.a avoid trap instructions and use L4 IPC instead to call the Linux server  User-level exception handler (trampoline) emulates the native system call ‘trap’ instruction for binary compatibility o L4 redirects trap to emulation library which then used L4 IPC to call the Linux server

Signals  Each user process has a separate signal-handler thread  Linux server’s delivers a signal by sending a message to the user process’s signal-handler thread  The signal-handler causes the user process’s main thread to save it’s state and enter Linux by manipulating the main thread’s SP and PC

Scheduling  All thread scheduling is done by the L4 kernel  The Linux server’s schedule() routine is only used for multiplexing the Linux server’s Main thread across concurrent Linux system calls

CS533 - Concepts of Operating Systems 12 Experiment  What is the penalty of using L4Linux? o Compare L4Linux to native Linux  Does the performance of the underlying micro- kernel matter? o Compare L4Linux to MkLinux  Does co-location improve performance? o Compare L4Linux to an in-kernel version of MkLinux

CS533 - Concepts of Operating Systems 13 Microbenchmarks  measured system call overhead on shortest system call “getpid()” Linux 223 cycles L4Linux 526 cycles L4Linux(trampoline) 753 cycles MkLinux (in kernel) 2050 cycles MkLinux (user)14710 cycles

CS533 - Concepts of Operating Systems 14 Microbenchmarks (cont.)  Measures specific system calls to determine basic performance.

CS533 - Concepts of Operating Systems 15 Macrobenchmarks  measured time to recompile Linux server

CS533 - Concepts of Operating Systems 16 Macrobenchmarks (cont.)  Next use a commercial test suite to simulate a system under full load.

CS533 - Concepts of Operating Systems 17 Performance Analysis  L4Linux is, on average 8.3% slower than native Linux. Only 6.8% slower at maximum load.  MkLinux: 49% slower on average, 60% at maximum.  Co-located MkLinux: 29% slower on average, 37% at maximum

CS533 - Concepts of Operating Systems 18 Conclusion?  Can hardware-based protection be made to work efficiently enough?  Did these experiments explore the cost of “fine grained” protection?

CS533 - Concepts of Operating Systems 19 Spare Slides

CS533 - Concepts of Operating Systems 20 The IPC Dilemma  IPC is very import in μ-kernel design o Increases modularity, flexibility, security and scalability.  Past implementations have been inefficient. o Message transfer takes μs.

CS533 - Concepts of Operating Systems 21 The L3 (μ-kernel based) OS  A task consists of: o Threads Communicate via messages that consist of strings and/or memory objects. o Dataspaces Memory objects. o Address space Where dataspaces are mapped.

CS533 - Concepts of Operating Systems 22 Redesign Principles  IPC performance is the Master.  All design decisions require a performance discussion.  If something performs poorly, look for new techniques.  Synergetic effects have to be taken into considerations.  The design has to cover all levels from architecture down to coding.  The design has to be made on a concrete basis.  The design has to aim at a concrete performance goal.

CS533 - Concepts of Operating Systems 23 Achievable Performance  A simple scenario o Thread A sends a null message to thread B o Minimum of 172 cycles  Will aim at 350 cycles (7 μs) o Will actually achieve 250 cycles (5 μs)

CS533 - Concepts of Operating Systems 24 Levels of the redesign  Architectural o System Calls, Messages, Direct Transfer, Strict Process Orientation, Control Blocks.  Algorithmic o Thread Identifier, Virtual Queues, Timeouts/Wakeups, Lazy Scheduling, Direct Process Switch, Short Messages.  Interface o Unnecessary Copies, Parameter passing.  Coding o Cache Misses, TLB Misses, Segment Registers, General Registers, Jumps and Checks, Process Switch.

CS533 - Concepts of Operating Systems 25 Architectural Level  System Calls o Expensive! So, require as few as possible. o Implement two calls: Call Reply & Receive Next o Combines sending an outgoing message with waiting for an incoming message. Schedulers can handle replies the same as requests.

CS533 - Concepts of Operating Systems 26 Messages  Complex Messages: o Direct String, Indirect Strings (optional) o Memory Objects  Used to combine sends if no reply is needed.  Can transfer values directly from sender’s variable to receiver’s variables. A Complex Message

CS533 - Concepts of Operating Systems 27 Direct Transfer  Each address space has a fixed kernel accessible part. o Messages transferred via the kernel part o User A space -> Kernel -> User B space o Requires 2 copies. o Larger Messages lead to higher costs User A User B Kernel

CS533 - Concepts of Operating Systems 28  Shared User Level memory (LRPC, SRC RPC) o Security can be penetrated. o Cannot check message’s legality. o Long messages -> address space becoming a critical resource. o Explicit opening of communication channels. o Not application friendly.

CS533 - Concepts of Operating Systems 29 Temporary Mapping  L3 uses a Communication Window o Only kernel accessible, and exists per address space. o Target region is temporarily mapped there. o Then the message is copied to the communication window and ends up in the correct place in the target address space. User A User B Kernel

CS533 - Concepts of Operating Systems 30 Temporary Mapping  Must be fast!  2 level page table only requires one word to be copied. o pdir A -> pdir B  TLB must be clean of entries relating to the use of the communication window by other operations. o One thread TLB is always “window clean”. o Multiple threads Interrupts – TLB is flushed Thread switch – Invalidate Communication window entries.

CS533 - Concepts of Operating Systems 31 Strict Process Orientation  Kernel mode handled in same way as User mode  One kernel stack per thread  May lead to a large number of stacks o Minor problem if stacks are objects in virtual memory

CS533 - Concepts of Operating Systems 32 Thread Control Blocks (tcb’s)  Hold kernel, hardware, and thread-specific data.  Stored in a virtual array in shared kernel space. User areaKernel area tcbKernel stack

CS533 - Concepts of Operating Systems 33 Tcb Benefits  Fast tcb access  Saves 3 TLB misses per IPC  Threads can be locked by unmapping the tcb  Helps make thread persistent  IPC independent from memory management

CS533 - Concepts of Operating Systems 34 Algorithmic Level  Thread ID’s o L3 uses a 64 bit unique identifier (uid) containing the thread number. o Tcb address is easily obtained anding the lower 32 bits with a bit mask and adding the tcb base address.  Virtual Queues o Busy queue, present queue, polling-me queue. o Unmapping the tcb includes removal from queues Prevents page faults from parsing/adding/deleting from the queues.

CS533 - Concepts of Operating Systems 35 Algorithmic Level  Timeouts and Wakeups o Operation fails if message transfer has not started t ms after invoking it. o Kept in n unordered wakeup lists. A new thread’s tcb is linked into the list τ mod n. o Thread with wakeups far away are kept in a long time wakeup list and reinserted into the normal lists when time approaches. o Scheduler will only have to check k/n entries per clock interrupt. o Usually costs less the 4% of ipc time.

CS533 - Concepts of Operating Systems 36 Algorithmic Level  Lazy Scheduling o Only a thread state variable is changed (ready/waiting). o Deletion from queues happens when queues are parsed. Reduces delete operations. Reduces insert operations when a thread needs to be inserted that hasn’t been deleted yet.

CS533 - Concepts of Operating Systems 37 Algorithmic Level  Short messages via registers o Register transfers are fast o 50-80% of messages ≥ 8 bytes o Up to 8 byte messages can be transferred by registers with a decent performance gain. o May not pay off for other processors.

CS533 - Concepts of Operating Systems 38 Interface Level  Unnecessary Copies o Message objects grouped by types o Send/receive buffers structured in the same way o Use same variable for sending and receiving Avoid unnecessary copies  Parameter Passing o Use registers whenever possible. Far more efficient Give compilers better opportunities to optimize code.

CS533 - Concepts of Operating Systems 39 Code Level  Cache Misses o Cache line fill sequence should match the usual data access sequence.  TLB Misses o Try and pack in one page: Ipc related kernel code Processor internal tables Start/end of Larger tables Most heavily used entries

CS533 - Concepts of Operating Systems 40 Coding Level  Registers o Segment register loading is expensive. o One flat segment coving the complete address space. On entry, kernel checks if registers contain the flat descriptor. Guarantees they contain it when returning to user level.  Jumps and Check o Basic code blocks should be arranged so that as few jumps are taken as possible.  Process switch o Save/restore of stack pointer and address space only invoked when really necessary.

CS533 - Concepts of Operating Systems 41 L4 Slides

CS533 - Concepts of Operating Systems 42 Introduction  μ-kernels have reputation for being too slow, inflexible  Can 2nd generation μ-kernel (L4) overcome limitations?  Experiment: o Port Linux to run on L4 (Mach 3.0) o Compared to native Linux, MkLinux (Linux on 1st gen Mach derived μ-kernel)

CS533 - Concepts of Operating Systems 43 Introduction (cont.)  Test speed of standard OS personality on top of fast μ-kernel: Linux implemented on L4  Test extensibility of system: o pipe-based communication implemented directly on μ-kernel o mapping-related OS extensions implemented as user tasks o user-level real-time memory management implemented  Test if L4 abstractions independent of platform

CS533 - Concepts of Operating Systems 44 L4 Essentials  Based on threads and address spaces  Recursive construction of address spaces by user-level servers o Initial address space σ 0 represents physical memory o Basic operations: granting, mapping, and unmapping.  Owner of address space can grant or map page to another address space  All address spaces maintained by user-level servers (pagers)

CS533 - Concepts of Operating Systems 45 L 4 Linux – Design & Implementation  Fully binary compliant with Linux/X86  Restricted modifications to architecture-dependent part of Linux  No Linux-specific modifications to L4 kernel

CS533 - Concepts of Operating Systems 46 L 4 Linux – Design & Implementation  Address Spaces o Initial address space σ 0 represents physical memory o Basic operations: granting, mapping, and unmapping. o L4 uses “flexpages”: logical memory ranging from one physical page up to a complete address space. o An invoker can only map and unmap pages that have been mapped into its own address space

CS533 - Concepts of Operating Systems 47 L 4 Linux – Design & Implementation

CS533 - Concepts of Operating Systems 48 L 4 Linux – Design & Implementation  Address Spaces (cont.) o I/O ports are parts of address spaces. o Hardware interrupts are handled by user-level processes. The L4 kernel will send a message via IPC.

CS533 - Concepts of Operating Systems 49 L 4 Linux – Design & Implementation  The Linux server o L4Linux will use a single-server approach. o A single Linux server will run on top of L4, multiplexing a single thread for system calls and page faults. o The Linux server maps physical memory into its address space, and acts as the pager for any user processes it creates. o The Server cannot directly access the hardware page tables, and must maintain logical pages in its own address space.

CS533 - Concepts of Operating Systems 50 L 4 Linux – Design & Implementation  Interrupt Handling o All interrupt handlers are mapped to messages. o The Linux server contains threads that do nothing but wait for interrupt messages. o Interrupt threads have a higher priority than the main thread.

CS533 - Concepts of Operating Systems 51 L 4 Linux – Design & Implementation  User Processes o Each different user process is implemented as a different L4 task: Has its own address space and threads. o The Linux Server is the pager for these processes. Any fault by the user-level processes is sent by RPC from the L4 kernel to the Server.

CS533 - Concepts of Operating Systems 52 L 4 Linux – Design & Implementation  System Calls o Three system call interfaces: A modified version of libc.so that uses L4 primitives. A modified version of libc.a A user-level exception handler (trampoline) calls the corresponding routine in the modified shared library. o The first two options are the fastest. The third is maintained for compatibility.

CS533 - Concepts of Operating Systems 53 L 4 Linux – Design & Implementation  Signalling o Each user-level process has an additional thread for signal handling. o Main server thread sends a message for the signal handling thread, telling the user thread to save it’s state and enter Linux

CS533 - Concepts of Operating Systems 54 L 4 Linux – Design & Implementation  Scheduling o All thread scheduling is down by the L4 kernel o The Linux server’s schedule() routine is only used for multiplexing it’s single thread. o After each system call, if no other system call is pending, it simply resumes the user process thread and sleeps.

CS533 - Concepts of Operating Systems 55 L 4 Linux – Design & Implementation  Tagged TLB & Small Space. o In order to reduce TLB conflicts, L4Linux has a special library to customize code and data for communicating with the Linux Server o The emulation library and signal thread are mapped close to the application, instead of default high-memory area.

CS533 - Concepts of Operating Systems 56 Performance  What is the penalty of using L4Linux? Compare L4Linux to native Linux  Does the performance of the underlying micro- kernel matter? Compare L4Linux to MkLinux  Does co-location improve performance? Compare L4Linux to an in-kernel version of MkLinux

CS533 - Concepts of Operating Systems 57 Microbenchmarks  measured system call overhead on shortest system call “getpid()”

CS533 - Concepts of Operating Systems 58 Microbenchmarks (cont.)  Measures specific system calls to determine basic performance.

CS533 - Concepts of Operating Systems 59 Macrobenchmarks  measured time to recompile Linux server

CS533 - Concepts of Operating Systems 60 Macrobenchmarks (cont.)  Next use a commercial test suite to simulate a system under full load.

CS533 - Concepts of Operating Systems 61 Performance Analysis  L4Linux is, on average 8.3% slower than native Linux. Only 6.8% slower at maximum load.  MkLinux: 49% average, 60% at maximum.  Co-located MkLinux: 29% average, 37% at maximum.

CS533 - Concepts of Operating Systems 62 Extensibility Performance  A micro-kernel must provide more than just the features of the OS running on top of it.  Specialization – improved implementation of Os functionality  Extensibility – permits implementation of new services that cannot be easily added to a conventional OS.

CS533 - Concepts of Operating Systems 63 Pipes and RPC First five (1) use the standard pipe mechanism of the Linux kernel. (2) Is asynchronous and uses only L4 IPC primitives. Emulates POSIX standard pipes, without signalling. Added thread for buffering and cross- address-space communication. (3) Is synchronous and uses blocking IPC without buffering data. (4) Maps pages into the receiver’s address space.

CS533 - Concepts of Operating Systems 64 Virtual Memory Operations  The “Fault” operation is an example of extensibility – measures the time to resolve a page fault by a user-defined pager in a separate address space.  “Trap” – Latency between a write operation to a protected page, and the invocation of related exception handler.  “Appel1” – Time to access a random protected page. The fault handler unprotects the page, protects some other page, and resumes.  “Appel2” – Time to access a random protected page where the fault handler only unprotects the page and resumes.

CS533 - Concepts of Operating Systems 65 Conclusion  Using the L4 micro-kernel imposes a 5-10% slowdown to native Linux. Much faster than previous micro- kernels.  Further optimizations such as co-locating the Linux Server, and providing extensibility could improve L4Linux even further.