Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA Babak Falsafi and David A. Wood University of Wisconsin, Madison, 1997 Presented by: Jie Xiao.

Slides:



Advertisements
Similar presentations
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Advertisements

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Nikos Hardavellas, Northwestern University
Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood UW-Madison Computer Sciences 2/19/2014.
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
© 2003 IBM Corporation IBM Systems and Technology Group Operating System Attributes for High Performance Computing Ken Rozendal Distinguished Engineer.
Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University
Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Introduction to MIMD architectures
DDM – A Cache Only Memory Architecture Hagersten, Landin, and Haridi (1991) Presented by Patrick Eibl.
1 Introduction to MIMD Architectures Sima, Fountain and Kacsuk Chapter 15 CSE462.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Bugnion et al. Presented by: Ahmed Wafa.
Multiprocessing Memory Management
Chapter 17 Parallel Processing.
(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,
1 Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine, and Mendel Rosenblum, Stanford University, 1997.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Operating System Support for improving data locality on CC-NUMA machines CSE597A Presentation By V.N.Murali.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)
CS533 Concepts of Operating Systems Jonathan Walpole.
Lecture 3. Directory-based Cache Coherence Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Supporting Multi-Processors Bernard Wong February 17, 2003.
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)
COS 318: Operating Systems Virtual Memory and Its Address Translations.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.
+ Clusters Alternative to SMP as an approach to providing high performance and high availability Particularly attractive for server applications Defined.
MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.
WildFire: A Scalable Path for SMPs Erick Hagersten and Michael Koster Sun Microsystems Inc. Presented by Terry Arnold II.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Shared Memory MPs – COMA & Beyond Copyright 2004 Daniel J. Sorin Duke.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
Cellular Disco Resource management using virtual clusters on shared-memory multiprocessors.
CS161 – Design and Architecture of Computer
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
CS5102 High Performance Computer Systems Thread-Level Parallelism
ASR: Adaptive Selective Replication for CMP Caches
143A: Principles of Operating Systems Lecture 6: Address translation (Paging) Anton Burtsev October, 2017.
Reactive NUMA A Design for Unifying S-COMA and CC-NUMA
Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
What we need to be able to count to tune programs
Directory-based Protocol
Death Match ’92: NUMA v. COMA
Outline Midterm results summary Distributed file systems – continued
(Architectural Support for) Semantically-Smart Disk Systems
Virtual Memory Hardware
CSE451 Virtual Memory Paging Autumn 2002
Co-designed Virtual Machines for Reliable Computer Systems
Lecture 23: Virtual Memory, Multiprocessors
Virtual Memory: Working Sets
Presentation transcript:

Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA Babak Falsafi and David A. Wood University of Wisconsin, Madison, 1997 Presented by: Jie Xiao Feb 6, 2008

Outline: Introduction CC-NUMA, S-COMA, R-NUMA Theoretical Results Simulation Results Pros & Cons

Introduction DSM clusters Remote misses latency > Local misses latency Looking for the best remote caching strategy!

Introduction Looking for the best remote caching strategy! Solutions: CC-NUMA: Cache-coherent Non-Uniform Memory Access S-COMA: Simple Cache-Only Memory Architecture Our approach: R-NUMA: Reactive NUMA

CC-NUMA block cache: small & fast

S-COMA page cache: sufficiently large (part of the local node’s main memory) page granularity OS handles allocation and migration

CC-NUMA vs S-COMA Looking for the best remote caching strategy! Which one is better? Answer: Depends on the application! (1) Communication pages (2) Reuse pages

R-NUMA  Dynamically switching from CC-NUMA to S-COMA  Refetch times: per-node, per-page (hardware: counter)  Each node to independently choose the best protocol for a particular page  Greater performance stability  Not much extra hardware

R-NUMA CC-NUMA R-NUMA S-COMA

R-NUMA CC-NUMA S-COMA

Theoretical Results Worst case analysis: R-NUMA performs no more than 3 times worse than either a CC-NUMA or S-COMA.

Simulation Results Base line: CC-NUMA: infinite block cache CC-NUMA: 32 KB block cache S-COMA: 320KB page cache R-NUMA: 128B block cache, 320KB page cache, relocation threshold 64

Simulation Results

R-NUMA is only sensitive to block cache size for applications whose reuse working set does not fit in the page cache (e.g. ocean) A large fraction of reuse pages in an application favor a smaller threshold value (e.g. choleshy, fmm, lu and ocean) R-NUMA is not very sensitive to page-fault and TLB invalidation overheads

Pros & Cons Pros + Flexible: per-page per-node + Exploit the best remote caching strategy without much extra work Cons - Threshold: 64? Change according to the applications?