Improving IPC by Kernel Design Jochen Liedtke Slides based on a presentation by Rebekah Leslie.

Slides:

Advertisements

Similar presentations

Computer-System Structures Er.Harsimran Singh

Advertisements

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.

G Robert Grimm New York University Virtual Memory.

MACHINE-INDEPENDENT VIRTUAL MEMORY MANAGEMENT FOR PAGED UNIPROCESSOR AND MULTIPROCESSOR ARCHITECTURES R. Rashid, A. Tevanian, M. Young, D. Golub, R. Baron,

Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented by Alana Sweat.

Computer Systems/Operating Systems - Class 8

Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.

CS533 Concepts of Operating Systems Class 6 Micro-kernels Mach vs L3 vs L4.

OS2-1 Chapter 2 Computer System Structures. OS2-2 Outlines Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.

User Level Interprocess Communication for Shared Memory Multiprocessor by Bershad, B.N. Anderson, A.E., Lazowska, E.D., and Levy, H.M.

Improving IPC by Kernel Design Jochen Liedtke Presented by Ahmed Badran.

Improving IPC by Kernel Design Jochen Liedtke Proceeding of the 14 th ACM Symposium on Operating Systems Principles Asheville, North Carolina 1993.

Inter Process Communication:  It is an essential aspect of process management. By allowing processes to communicate with each other: 1.We can synchronize.

Improving IPC by Kernel Design Jochen Liedtke Shane Matthews Portland State University.

Translation Buffers (TLB’s)

Microkernels: Mach and L4

Early OS security Overview by: Greg Morrisett Cornell University, Edited (by permission) for CSUS CSc250 by Bill Mitchell.

User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.

Threads CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University

CS533 Concepts of Operating Systems Class 6 The Performance of Micro- Kernel Based Systems.

File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.

General System Architecture and I/O.  I/O devices and the CPU can execute concurrently.  Each device controller is in charge of a particular device.

Improving IPC by Kernel Design

1 Micro-kernel. 2 Key points Microkernel provides minimal abstractions –Address space, threads, IPC Abstractions –… are machine independent –But implementation.

Operating System 4 THREADS, SMP AND MICROKERNELS

Three fundamental concepts in computer security: Reference Monitors: An access control concept that refers to an abstract machine that mediates all accesses.

CHAPTER 2: COMPUTER-SYSTEM STRUCTURES Computer system operation Computer system operation I/O structure I/O structure Storage structure Storage structure.

Chapter 2: Computer-System Structures

2: Computer-System Structures

From L3 to seL4 What Have We Learnt in 20 Years of L4 Microkernels? Embedded Lab. Kim Sewoog SOSP 2013.

Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,

The Performance of Microkernel-Based Systems

CS533 Concepts of Operating Systems Jonathan Walpole.

The Performance of Micro-Kernel- Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Presentation by: Seungweon Park.

Chapter 2: Computer-System Structures Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection Network Structure.

A summary by Nick Rayner for PSU CS533, Spring 2006

CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.

The Performance of μ-Kernel-Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Presenter: Sunita Marathe.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts Essentials – 9 th Edition Chapter 9: Virtual-Memory Management.

Silberschatz, Galvin and Gagne  Applied Operating System Concepts Chapter 2: Computer-System Structures Computer System Architecture and Operation.

The Mach System Silberschatz et al Presented By Anjana Venkat.

The Performance of Micro-Kernel- Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Presentation by: Tim Hamilton.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Efficient Software-Based Fault Isolation Authors: Robert Wahbe Steven Lucco Thomas E. Anderson Susan L. Graham Presenter: Gregory Netland.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

Computer System Structures

CMSC 611: Advanced Computer Architecture

Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.

Chapter 2: Computer-System Structures(Hardware)

Chapter 2: Computer-System Structures

The Mach System Sri Ramkrishna.

Memory Caches & TLB Virtual Memory

Chapter 9: Virtual Memory

CS533 Concepts of Operating Systems

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 2: Computer-System Structures Computer System Operation I/O Structure Storage.

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Chapter 9: Virtual-Memory Management

Module 2: Computer-System Structures

Improving IPC by Kernel Design

Improving IPC by Kernel Design

Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory

Translation Buffers (TLB’s)

Module 2: Computer-System Structures

Improving IPC by Kernel Design

Lecture 8: Efficient Address Translation

Chapter 2: Computer-System Structures

Chapter 2: Computer-System Structures

Module 2: Computer-System Structures

Translation Buffers (TLBs)

Module 2: Computer-System Structures

Presentation transcript:

Improving IPC by Kernel Design Jochen Liedtke Slides based on a presentation by Rebekah Leslie

Microkernels and IPC: Microkernel architectures introduce a heavy reliance on IPC  more modularity implies more IPC Mach pioneered microkernel architectures, but had poor IPC performance  poor performance leads people to avoid microkernels entirely  or architect their design to reduce IPC (combine or co-locate modules) This paper explores a performance oriented design approach and discusses specific IPC optimizations

Performance vs. Protection: Mach provided strong isolation between tasks  indirect IPC transfer via ports  limited access controlled by capabilities (port rights) L3 removes indirect transfer, capabilities, and RPC validation  achieves better performance  provides basic address space protection Recent L4 designs incorporate the use of capabilities to achieve isolation in security-critical systems

L3 System Architecture: Mach-like design for modular operating systems  Minimal kernel  User level servers for “traditional” OS features: page-fault handling, exception handling, device drivers System organized into tasks and threads IPC: direct data transfer between threads via thread ID

Design Philosophy: Focus on IPC  Any feature that will increase cost must be closely evaluated  When in doubt, design in favor of IPC Design for Performance  A poorly performing technique is unacceptable  Evaluate feature cost compared to concrete baseline  Aim for a concrete performance goal Comprehensive design  Consider synergistic effects of all methods and techniques  Cover all levels of implementation, from design to code

Performance Baseline: The cost of each feature must be evaluated relative to a concrete performance baseline For IPC, the theoretical minimum is an empty message: this measures the overhead without data transfer cost 127 cycles without prefetching delays or cache misses + 45 cycles for TLB misses = 172 cycle minimum time GOAL: 350 cycles (7  s) for short messages

Messages in L3: direct stringindirect stringsmemory objectstag Tag: Description of message contents Direct string: Data to be transferred directly from send buffer to receive buffer Indirect string: Location and size of data to be transferred by reference Memory object: Description of a region of memory to be mapped in receiver address space (shared memory) System calls: Send, receive, call (send and receive), reply/wait (receive and send)

Basic Message Optimizations: Ability to transfer long, complex messages reduces the number of messages that need to be sent (system calls) Indirect strings avoid copy operations at user level  User specifies data location, rather than copying data to buffer  Receiver specifies destination, rather than copying from buffer Memory objects transferred lazily, i.e., page table is not modified until access is required Combined send/receive calls reduce number of traps

Optimization - Direct Transfer via Temporary Mapping: Bkernel A Two copy message transfer costs n cycles L3 copies data once to a special communication window in kernel space Window is mapped to the receiver for the duration of the call (page directory entry) copy mapped with kernel-only permission add mapping to space B

Optimization - Transfer Short Messages in Registers: IPC messages are often very short  Example: Device driver ack or error replies  On average, between 50% and 80% of L3 messages are less than eight bytes long Even on the register poor x86, 2 registers can be set aside for short message transfer Register transfer implementation saved 2.4  s, even more than the overhead of temporary mapping (1.2  s)

Thread Scheduling in L3: Scheduler maintains several queues to keep track relevant thread-state information  Ready queue stores threads that are able to run  Wakeup queues store threads that are blocked waiting for an IPC operation to complete or timeout (organized by region)  Polling-me queue stores threads waiting to send to some thread Efficient representation of data structures  Queues are stored as doubly-linked lists distributed across TCBs  Scheduling never causes page faults

Optimization - Lazy Dequeueing Scheduler overhead (queue manipulation) is a significant component of IPC cost Threads doing IPC are often removed from a queue only to be inserted again a short while later  why not avoid the queue manipulation altogether and simply flag the thread control block?  i.e., move the overhead from IPC path to other scheduling paths Invariants on scheduling queues  The ready queue contains at least all ready threads  A wakeup queue contains at least all waiting threads

Optimization - Store Task Control Blocks in Virtual Arrays A task control block (TCB) stores kernel data for a particular thread Every operation on a thread requires lookup, and possibly modification, of that thread’s TCB Storing TCBs in a virtual array provides fast access to TCB structures

Optimization - Compact Structures with Good Locality Access TCBs through pointer to the center of the structure so that short displacements can be used  One-byte long registers reach twice as much TCB data as with a pointer to the start of a structure Group related TCB information on cache line boundaries to minimize cache misses Store frequently accessed kernel data in same page as hardware tables (IDT, GDT, TSS)

Performance Impact of Specific Optimizations: Large messages dominated by copy overhead Small messages get benefit of faster context switching, fewer system calls, and fast access to kernel structures

IPC Performance Compared to Mach (Short Message): Measured using pingpong micro-benchmark that makes use of unified send/receive calls For an n-byte message, the cost is n  s in L3

IPC Performance Compared to Mach (Long Messages): Same benchmark with larger messages. For n-byte messages larger than 2k, cache misses increase and the IPC time is n  s  Slightly higher base cost  Higher per-byte cost By comparison, Mach takes n  s

Comparison of L3 RPC to Previous Systems:

Conclusions: Well-performing IPC was essential in order for microkernels to gain wide adoption, which was a major limitation of Mach L3 demonstrates that good performance is attainable in a microkernel system with IPC performance that is 10 to 22 times better than Mach The performance-centric techniques demonstrated in the paper can be employed in any system, even if the specific optimizations cannot

Spare Slides

Optimization - Reduce Segment Register Loads Loading segment registers is expensive (9 cycles register), so many systems use a single, flat segment Kernel preservation of the segment registers requires 66 cycles for the naive approach (always reload registers) L3 instead checks if the flat value is still intact, and only does a load if not Checking alone costs 10 cycles