COREY: AN OPERATING SYSTEM FOR MANY CORES

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

Threads, SMP, and Microkernels

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT),

Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.

Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.

MACHINE-INDEPENDENT VIRTUAL MEMORY MANAGEMENT FOR PAGED UNIPROCESSOR AND MULTIPROCESSOR ARCHITECTURES R. Rashid, A. Tevanian, M. Young, D. Golub, R. Baron,

Corey – An Operating System for Many Cores 謝政宏.

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science

OS2-1 Chapter 2 Computer System Structures. OS2-2 Outlines Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.

Memory Management (II)

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

Page 1 Processes and Threads Chapter Processes 2.2 Threads 2.3 Interprocess communication 2.4 Classical IPC problems 2.5 Scheduling.

Introduction to Kernel

Inter Process Communication:  It is an essential aspect of process management. By allowing processes to communicate with each other: 1.We can synchronize.

Advanced OS Chapter 3p2 Sections 3.4 / 3.5. Interrupts These enable software to respond to signals from hardware. The set of instructions to be executed.

OPERATING SYSTEMS Introduction

Computer System Architectures Computer System Software

ITEC 325 Lecture 29 Memory(6). Review P2 assigned Exam 2 next Friday Demand paging –Page faults –TLB intro.

Microkernels, virtualization, exokernels Tutorial 1 – CSC469.

APPLICATION PERFORMANCE AND FLEXIBILITY ON EXOKERNEL SYSTEMS M. F. Kaashoek, D. R. Engler, G. R. Ganger H. M. Briceño, R. Hunt, D. Mazières, T. Pinckney,

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

Threads, Thread management & Resource Management.

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,

The Performance of Microkernel-Based Systems

A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

(a) What is the output generated by this program? In fact the output is not uniquely defined, i.e., it is not always the same. So please give three examples.

Processes CSCI 4534 Chapter 4. Introduction Early computer systems allowed one program to be executed at a time –The program had complete control of the.

Processes and Virtual Memory

Full and Para Virtualization

1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.

Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.

Threads. Readings r Silberschatz et al : Chapter 4.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

What is a Process ? A program in execution.

Silberschatz, Galvin and Gagne ©2011 Operating System Concepts Essentials – 8 th Edition Chapter 2: The Linux System Part 2.

Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.

Contents 1.Overview 2.Multithreading Model 3.Thread Libraries 4.Threading Issues 5.Operating-system Example 2 OS Lab Sun Suk Kim.

Introduction to Kernel

Introduction to Operating Systems

Processes and threads.

Distributed Shared Memory

Lecture 21 Synchronization

The Multikernel: A New OS Architecture for Scalable Multicore Systems

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Faster Data Structures in Transactional Memory using Three Paths

Task Scheduling for Multicore CPUs and NUMA Systems

KERNEL ARCHITECTURE.

Introduction to Operating Systems

The University of Adelaide, School of Computer Science

Presenter (Group 40) Asheen Richards / 鄧芷喬 S

Chapter 2: The Linux System Part 2

AN ANALYSIS OF LINUX SCALABILITY TO MANY CORES

Process Description and Control

Multithreaded Programming

Process Description and Control

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 23: Virtual Memory, Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Dirty COW Race Condition Attack

Presentation transcript:

COREY: AN OPERATING SYSTEM FOR MANY CORES S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, Z. Zhang MIT, Fudan U, Microsoft Research Asia, Xi’an Jiaotong U

Overview Paper presents a technique allowing multicore architectures to overcome memory access bottlenecks Key idea is that applications should control sharing of main memory and kernel resources Make them private by default Let each application specify which resources it want to share

What’s interesting It does increase system performance Measured by microbenchmarks and application benchmarks It lets applications tell the kernel how to manage the kernel resources they use Just the opposite of what is normally done Same approach as exokernels (xOK)

INTRODUCTION

Motivation (I) Most PCs have or will have multicore chips Cache-coherent shared memory hardware is the new standard Performance of some OS services scales very poorly with number of cores/processors Contention for OS queues, core directory lookups Can dominate performance of some applications

Motivation (II) Main source of poor scalability is concurrent updates to shared data structures Consistency overhead

An example Microbenchmark Creates multiple threads within a process Each thread creates a file descriptor then repeatedly duplicates it and close the result Shared resource is process file descriptor table Any modification to the table invalidates its cached copies

An example (II)

An example (III) Throughput of microbenchmark actually decreases with number of cores Problem caused by cache coherence protocol Each iteration results in a cache miss Resolving the miss requires access to a shared data structure protected by spin locks Increasing the number of threads attempting to update the table introduces queuing delays

Common approaches Avoiding shared data structures Allowing concurrent access to shared data structures through Fine-grain locking Wait-free primitives Can use atomic operations or transactional memory

Corey approach (I) Not all instances of a given resource type must be shared Depends of application requirements Corey lets application tell kernel which instances of a particular resource type must be shared Assumes that other instances can remain private OS does not incur unnecessary sharing costs

Corey Organized as an exokernel Corey kernel provides Address ranges Kernel cores Shares Most higher services are implemented as library operating systems

What is an exokernel? (I) In most operating systems only privileged servers and the kernel can manage system resources The exokernel architecture delegates resource management to user applications. Applications that do not want this responsibility communicate with the exokernel through a “library OS”

What is an exokernel? (II) library OS Exokernel protects but does not manage system resources User process

MULTICORE CHALLENGES

Multicore organizations Often involve multiple chips Say four chips with four cores per chip Have a cache hierarchy on each chip L1, L2, L3 Some caches are private, other are shared Accessing a cache on a chip is much faster than accessing a cache on another chip

Example (I) AMD 16-core system Sixteen cores on four chips Each core has a 64-KB L1 and a 512-KB L2 cache Each chip has a 2-MB shared L3 cache

X/Y where X is latency in cycles Y is bandwidth in bytes/cycle

Example (II) Observe that access times are non-uniform Takes more time to access L1 or L2 cache of another core than accessing shared L3 cache Takes more time to access caches in another chip than local caches Access times and bandwidths depend on chip interconnect topology

Performance issues (I) Linux spin locks Repeatedly access a shared lock variable MCS locks (Mellor-Crummey and Scott , 1991) Process requesting the lock inserts itself in a possibly empty queue Waiting processes do not interfere with each other

Performance issues (II)

Performance issues (III) Spinlocks are better at low contention rates Require three instructions to acquire and release a lock MCS locks require fifteen instructions MCS locks are much better at higher contention rates Less synchronization overhead

Motivation for address ranges Most OSes let applications chose between A single address space shared by all cores Threaded applications One private address space per core Applications forking full processes Neither of these two solutions is fully satisfactory

MapReduce applications Map phase: Processes read parts of application’s inputs Generate intermediary results and store them locally Reduce phase: Processors collate results produced by multiple map instances Produce the output

Single address space Bad for map phase, very good for reduce phase

Separate address spaces Very good for map phase, bad for reduce phase

Best of both worlds

COREY DESIGN

Address ranges (I) Let applications specify which parts of their address space are shared and which are private Private address ranges will not incur any consistency overhead Shared address ranges can share their hardware page tables Minimizes soft page faults (when page is in main memory but not mapped in the process page table)

Address ranges (II) Application A Application B Private Private Shared

Address ranges (III) Kernel-provided abstraction specifying a virtual-to-physical mapping for a range of virtual addresses If multiple cores include the same address range in their address space they will share the same mapping to the same physical pages Each core can freely manipulate and delete mapping in private address ranges w/o any consequences for other core performance

Kernel cores In most OS, system calls are executed on the core of the invoking process Bad idea if the system call needs to access large shared data structures Kernel cores let applications dedicate cores to run specific kernel functions Avoids inter-core contention over the data these functions access

Symmetric multiprocessing (I) Each core can execute both user and kernel code No execution bottleneck User or kernel code User or kernel code ...

Symmetric multiprocessing (II) Works very well unless too many kernel function instances access large shared data structures Contention User or kernel code User or kernel code Shared data

The solution Run kernel functions that access large shared data structures No inter-core contention Kernel code User or kernel code User or kernel code Shared data

Shares (I) Many kernel operations involve looking up identifiers in tables to obtain a pointer to a given kernel data structure (file descriptor entry, …) Lookup tables for kernel objects that let applications specify which object identifiers are visible to other cores Each application core has a root share that is private to that core Needs no locks since its private

Shares (II) If two cores want to “share a share,” they create one and add the share ID to either Their private root shares or A share reachable from these root shares Allows applications to restrict sharing to kernel structures that must be shared

Shares (II) A's table B's table Private: no locks Private: no locks Shared: must use locks to provide mutual exclusion

COREY KERNEL (not discussed)

SYSTEM SERVICES (not discussed)

Execution forking cfork(core_id) is an extension of UNIX fork() that creates a new process (pcore) on core core_id Application can specify multiple levels of sharing between parent and child Default is copy-on-write

Network Applications can decide to run Multiple network stacks A single shared network stack

Buffer cache Shared buffer like regular UNIX buffer cache Three modifications A lock-free tree allows multiple cores to locate cached blocks w/o contention A write scheme tries to minimize contention A scalable read/write lock

APPLICATIONS

MapReduce applications Modified the Phoenix MapReduce implementation to take advantage of Cory features (Metis) Each core has a separate address space with Private mappings for most data Address ranges to share the output of the map with other cores

Web server applications Corey web server is built from three components Web daemons: Process HTTP requests and have their own TCP/IP stack Also referred as webd cores Kernel cores (optional) Applications

IMPLEMENTATION

Current status Runs on AMD Opterons and Intel Xeons Implementation of address ranges is tailored to architectures with hardware page tables Would provide lesser benefits on architectures where the kernel manages the page tables

EVALUATION

Evaluation of address ranges (I) Two micro benchmarks memclone: Has each core allocate its own 100 MB array and modify each page of the array mempass: Allocates a single 100MB array on one of the clones, touches each buffer page and passes it to the next core which repeats the process

memclone

mempass

Evaluation of address ranges (II) Address ranges can support as well Multicore applications requiring private memory Multicore applications requiring shared memory

Evaluation of kernel cores (I) Simple TCP service Accepts incoming connection requests Writes 128 bytes to the connection then closes it Two configurations “Dedicated” uses a kernel core for all network processing “Polling” uses a kernel core only to poll for packet notifications and transmit completions

Throughput

L3 cache misses

Evaluation of kernel cores (II) “Dedicated” configuration Reaches network device upper bound of 110,000 connections per second with five cores Polling requires 11 cores Occasions much fewer L3 misses than “Polling” and regular Linux

Evaluation of shares (I) Two microbenchmarks Each core calls share_addobj()to add a per core segment to a global share then calls share_delobj()to delete that segment Same but per core segment is added to a local share

Throughput

L3 cache misses

More benchmarks Both MapReduce and Webd scale up much better with Corey than with Linux

CONCLUSIONS In order for applications to scale up on multicore architectures, they must control sharing Since this new requirement greatly complicates the design of applications, it is best suited to application frameworks such as MapReduce