Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.

Slides:



Advertisements
Similar presentations
SE-292: High Performance Computing
Advertisements

Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Distributed Shared Memory
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.
Multiprocessors CS 6410 Ashik Ratnani, Cornell University.
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
Chapter 101 Virtual Memory Chapter 10 Sections and plus (Skip:10.3.2, 10.7, rest of 10.8)
Shared Memory Multiprocessors Ravikant Dintyala. Trends Higher memory latencies Large write sharing costs Large secondary caches NUMA False sharing of.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
Virtual Memory Chapter 8. Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.
Lecture 11: Memory Management
Memory Management 2010.
Tornado: Maximizing Locality and Concurrency in a SMMP OS.
Chapter 17 Parallel Processing.
1 Supporting Hot-Swappable Components for System Software Kevin Hui, Jonathan Appavoo, Robert Wisniewski, Marc Auslander, David Edelsohn, Ben Gamsa Orran.
Experience with K42, an open- source, Linux-compatible, scalable operation-system kernel IBM SYSTEM JOURNAL, VOL 44 NO 2, 2005 J. Appovoo 、 M. Auslander.
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
 2004 Deitel & Associates, Inc. All rights reserved. Chapter 9 – Real Memory Organization and Management Outline 9.1 Introduction 9.2Memory Organization.
PRASHANTHI NARAYAN NETTEM.
CSE 490dp Resource Control Robert Grimm. Problems How to access resources? –Basic usage tracking How to measure resource consumption? –Accounting How.
Multiprocessors Deniz Altinbuken 09/29/09.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
CS533 Concepts of Operating Systems Jonathan Walpole.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocesor Operating System By: Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)
CS533 Concepts of Operating Systems Jonathan Walpole.
Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Distributed Shared Memory Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
TECHNIQUES FOR REDUCING CONSISTENCY- RELATED COMMUNICATION IN DISTRIBUTED SHARED-MEMORY SYSTEMS J. B. Carter University of Utah J. K. Bennett and W. Zwaenepoel.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard et al. Madhura S Rama.
Supporting Multi-Processors Bernard Wong February 17, 2003.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming Peter Pacheco.
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
1 Lecture 8: Virtual Memory Operating System Fall 2006.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Parallel Computing Presented by Justin Reschke
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
Distributed Shared Memory
Tornado Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.
Lecture 21 Synchronization
Architecture and Design of AlphaServer GS320
The University of Adelaide, School of Computer Science
Virtual Memory Chapter 8.
CMSC 611: Advanced Computer Architecture
Page Replacement.
Symmetric Multiprocessing (SMP)
Outline Midterm results summary Distributed file systems – continued
Consistency and Replication
Outline Announcements Lab2 Distributed File Systems 1/17/2019 COP5611.
TORNADO OPERATING SYSTEM
CS510 - Portland State University
The University of Adelaide, School of Computer Science
Outline Review of Quiz #1 Distributed File Systems 4/20/2019 COP5611.
CS510 Operating System Foundations
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
CS533 - Concepts of Operating Systems
The University of Adelaide, School of Computer Science
Presentation transcript:

Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm

Locality What do they mean by locality? – locality of reference? – temporal locality? – spatial locality?

Temporal Locality Recently accessed data and instructions are likely to be accessed in the near future

Spatial Locality Data and instructions close to recently accessed data and instructions are likely to be accessed in the near future

Locality of Reference If we have good locality of reference, is that a good thing for multiprocessors?

Locality in Multiprocessors Good performance depends on data being local to a CPU – Each CPU uses data from its own cache cache hit rate is high each CPU has good locality of reference – Once data is brought into cache it stays there cache contents not invalidated by other CPUs different CPUs have different locality of reference

Example: Shared Counter Memory CPU Cache CPU Cache Counter

Example: Shared Counter Memory CPU 0

Example: Shared Counter Memory CPU 0 0

Example: Shared Counter Memory CPU 1 1

Example: Shared Counter Memory CPU Read : OK

Example: Shared Counter Memory CPU 2 2 Invalidate

Performance

Problems Counter bounces between CPU caches – cache miss rate is high Why not give each CPU its own piece of the counter to increment? – take advantage of commutativity of addition – counter updates can be local – reads require all counters

Array-based Counter Memory CPU 00

Array-based Counter Memory CPU 1 10

Array-based Counter Memory CPU

Array-based Counter Memory CPU Read Counter Add All Counters (1 + 1)

Performance Performs no better than ‘shared counter’!

Problem: False Sharing Caches operate at the granularity of cache lines – if two pieces of the counter are in the same cache line they can not be cached (for writing) on more than one CPU at a time

False Sharing Memory CPU 0,0

False Sharing Memory CPU 0,0 CPU 0,0

False Sharing Memory CPU 0,0 CPU 0,0 Sharing

False Sharing Memory CPU 1,0 CPU 1,0 Invalidate

False Sharing Memory CPU 1,0 CPU 1,0 Sharing

False Sharing Memory CPU 1,1 Invalidate

Solution? Spread the counter components out in memory: pad the array

Padded Array Memory CPU 00

Padded Array Memory CPU Updates independent of each other

Performance Works better

Locality in OS Serious performance impact Difficult to retrofit Tornado – Ground up design – Object Oriented approach (natural locality)

Tornado Object oriented approach Clustered objects Protected procedure call Semi-automatic garbage collection – Simplifies locking protocols

Object Oriented Structure Each resource is represented by an object Requests to virtual resources handled independently – No shared data structure access – No shared locks

Why Object Oriented? Process 1 Process 2 … … Process Table

Why Object Oriented? Coarse-grain locking: Process 1 Process 2 … … Process Table Process 1 Lock

Why Object Oriented? Coarse-grain locking: Process 1 Process 2 … … Process Table Process 1 Lock Process 2

Object Oriented Approach Class ProcessTableEntry{ data lock code }

Object Oriented Approach Fine-grain, instance locking: Process 1 Process 2 … … Process Table Process 1 Lock Process 2 Lock

Clustered Objects Problem: how to improve locality for widely shared objects? A single logical object can be composed of multiple local representatives – the reps coordinate with each other to manage the object’s state – they share the object’s reference

Clustered Objects

Clustered Object References

Clustered Objects : Implementation A translation table per processor – Located at same virtual address – Pointer to rep Clustered object reference is just a pointer into the table – created on demand when first accessed – global miss handling object

Clustered Objects Degree of clustering Management of state – partitioning – distribution – replication (how to maintain consistency?) Coordination between reps? – Shared memory – Remote PPCs

Counter: Clustered Object Counter – Clustered Object CPU rep 1 Object Reference

Counter: Clustered Object Counter – Clustered Object CPU 1 1 rep 1 Object Reference

Counter: Clustered Object Counter – Clustered Object CPU 2 1 rep 2rep 1 Object Reference Update independent of each other

Counter: Clustered Object Counter – Clustered Object CPU 1 1 rep 1 Object Reference

Counter: Clustered Object rep 1 Object Reference Counter – Clustered Object CPU 1 1 rep 1 Read Counter

Counter: Clustered Object rep 1 Object Reference Counter – Clustered Object CPU 1 1 rep 1 Add All Counters (1 + 1)

Synchronization Two distinct locking issues – Locking mutually exclusive access to objects – Existence guarantees making sure an object is not freed while still in use

Locking in Tornado Encapsulate locking within individual objects Uses clustered objects to limit contention Uses spin-then-block locks

Existence Guarantees: the problem Use a lock to protect all references to an object? – eliminates races where one thread is accessing the object and another is deallcoating it – results in complex global hierarchy of locks Tornado - semi automatic garbage collection – Clustered object reference can be used any time – Eliminates needs for locks

Existence Guarantees in Tornado Semi-automatic garbage collection: – programmer decides what to free, system decided when to free it – guarantees that object references can be used safely – eliminates needs for reference locks

How does it work? Programmer removes all persistent references – Normal cleanup done manually System tracks all temporary references – Event driven kernel – Maintain an activity counter for each processor – Delete object only when activity counter is zero

Performance Scalability

Conclusion Object-oriented approach and clustered objects exploit locality to improve concurrency OO design has some overhead, but it is low compared to the performance advantages Tornado scales extremely well and achieves high performance on shared-memory multiprocessors