Westin Harbour Castle, August 24, 2000 The NUMAchine Multiprocessor ICPP 2000.

Westin Harbour Castle, August 24, 2000 The NUMAchine Multiprocessor ICPP 2000

University of Toronto 2 Outline  Architecture  System Overview  Key Features  Fast ring routing  Hardware Cache Coherence  Memory Model: Sequential Consistency  Simulation Studies  Ring performance  Network Cache performance  Coherence overhead  Prototype Performance  Hardware Status  Conclusion Presentation Overview

University of Toronto 3 Arch:Sys  Hierarchical ring network, based on clusters ( NUMAchine’s ‘Stations’) which are themselves bus-based SMPs System Architecture

University of Toronto 4 Arch:Features  Hierachical rings  Allow for very fast and simple routing  Provide good support for broadcast and multicast  Hardware Cache Coherence  Hierarchical, directory-based, CC-NUMA system  Writeback/Invalidate protocol, designed to use the broadcast/ordering properties of rings  Sequentially Consistent Memory Model  The most intuitive model for programmer’s trained on uniprocessors  Simple, low cost, but with good flexibility, scalability and performance NUMAchine’s Key Features

University of Toronto 5 Arch:Fmask  Fast ring routing is achieved by the use of Filtermasks (I.e. simple bit-masks) to store cache-line location information (imprecision reduces directory storage requirements)  These Filtermasks are used directly by the routing hardware in the ring interfaces Fast Ring Routing: Filtermasks

University of Toronto 6 CC  Hierarchical, directory-based, writeback/invalidate  Directory entries are stored in both the per-station memory (‘home’ location), and cached in the network interfaces (hence the name, Network Cache)  The Network Cache stores both the remotely cached directory information, as well as the cache lines themselves, and allows the network interface to perform coherence operations locally (on-Station), avoiding remote accesses to the home directory  Filtermasks indicate which Stations (I.e. clusters) may potentially have a copy of a cache line (with the fuzziness due to the imprecise nature of the filter masks)  Processor Masks are used only within a Station, to indicates which particular caches may contain a copy (with the fuzziness here due to Shared lines that may have been silently ejected) Hardware Cache Coherence

University of Toronto 7 SC  The most intuitive model for the normally trained programmer: increases the usability of the system  Easily supported by NUMAchine’s ring network: the only change necessary is to force invalidates to pass through a global ‘sequencing point’ on the ring, increasing the average invalidation latency by 2 ring hops (40 ns with our default 50 MHz rings) Memory Model: Sequential Consistency

University of Toronto 8 SS:RP1  Use the SPLASH-2 benchmarks suite, and a cycle-accurate hardware simulator with full modeling of the coherence protocol  Applications with high communication-to-computation ratios (e.g. FFT, Radix) show high utilizations, particularly in the Central Ring (indicating that a faster Central Ring would help) Simulation Studies: Ring Performance 1

University of Toronto 9 SS:RP2  Maximum and average ring interface queue depths indicate the network congestion, which correlates to bursty traffic  Large differences between the maximum and average values indicates large variability in burst size Simulation Studies: Ring Performance 2

University of Toronto 10 SS:NC  Graphs show a measure of the Network Cache’s effect by looking at the hit rate (I.e. reduction in remote data and coherence traffic)  By categorizing the hits by the coherence directory state, we also see where the benefits come from: caching shared data, or reducing invalidations and coherence traffic Simulation Studies: Network Cache

University of Toronto 11 SS:CO  Measure the overhead due to cache coherence, by allowing all writes to succeed immediately without checking cache-line state, and comparing against runs with the full cache coherence protocol in place (both using infinite-capacity Network Caches to avoid measurement noise due to capacity effects)  Results indicate that in many cases it is basic data locality and/or poor parallelizability that are impeding performance, not cache coherence Simulation Studies: Coherence Overhead

University of Toronto 12 PP Prototype Performance  Speedups from the hardware prototype, compared against estimates from the simulator

University of Toronto 13 Status  Fully operational running the custom Tornado OS, with a 32- processor system shown below Hardware Prototype Status

University of Toronto 14 Fin  4- and 8-way SMPs are fast becoming commodity items  The NUMAchine project has shown that a simple, cost-effective, CC-NUMA multiprocessor can be built using these SMP building blocks and a simple ring network, and still achieve good performance and scalability  In the medium-scale range (a few tens to hundreds of processors), rings are a good choice for a multiprocessor interconnect  We have demonstrated an efficient hardware cache coherence scheme, which is designed to make use of the natural ordering and broadcast capabilities of rings  NUMAchine’s architecture efficiently supports a sequentially consistent memory model, which we feel is essential for increasing the ease of use and programmability of multiprocessors Conclusion

University of Toronto 15 Ack  Operating Systems  Prof. Michael Stumm  Orran Krieger (IBM)  Ben Gamsa  Jonathon Appavoo  Robert Ho  Compilers  Prof. Tarek Abdelrahman  Prof. Naraig Manjikian (Queens)  Applications  Prof. Ken Sevcik Acknowledgments: The NUMAchine Team  Hardware  Prof. Zvonko Vranesic  Prof. Stephen Brown  Robin Grindley (SOMA Networks)  Alex Grbic  Prof. Zeljko Zilic (McGill)  Steve Caranci (Altera)  Derek DeVries (OANDA)  Guy Lemieux  Kelvin Loveless (GNNettest)  Prof. Sinisa Srbljic (Zagreb)  Paul McHardy  Mitch Gusat (IBM)

Westin Harbour Castle, August 24, 2000 The NUMAchine Multiprocessor ICPP 2000.

Similar presentations

Presentation on theme: "Westin Harbour Castle, August 24, 2000 The NUMAchine Multiprocessor ICPP 2000."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Westin Harbour Castle, August 24, 2000 The NUMAchine Multiprocessor ICPP 2000.

Similar presentations

Presentation on theme: "Westin Harbour Castle, August 24, 2000 The NUMAchine Multiprocessor ICPP 2000."— Presentation transcript:

Similar presentations

About project

Feedback