Download presentation
Presentation is loading. Please wait.
Published byJoseph Jenkins Modified over 8 years ago
1
Westin Harbour Castle, August 24, 2000 The NUMAchine Multiprocessor ICPP 2000
2
University of Toronto 2 Outline Architecture System Overview Key Features Fast ring routing Hardware Cache Coherence Memory Model: Sequential Consistency Simulation Studies Ring performance Network Cache performance Coherence overhead Prototype Performance Hardware Status Conclusion Presentation Overview
3
University of Toronto 3 Arch:Sys Hierarchical ring network, based on clusters ( NUMAchine’s ‘Stations’) which are themselves bus-based SMPs System Architecture
4
University of Toronto 4 Arch:Features Hierachical rings Allow for very fast and simple routing Provide good support for broadcast and multicast Hardware Cache Coherence Hierarchical, directory-based, CC-NUMA system Writeback/Invalidate protocol, designed to use the broadcast/ordering properties of rings Sequentially Consistent Memory Model The most intuitive model for programmer’s trained on uniprocessors Simple, low cost, but with good flexibility, scalability and performance NUMAchine’s Key Features
5
University of Toronto 5 Arch:Fmask Fast ring routing is achieved by the use of Filtermasks (I.e. simple bit-masks) to store cache-line location information (imprecision reduces directory storage requirements) These Filtermasks are used directly by the routing hardware in the ring interfaces Fast Ring Routing: Filtermasks
6
University of Toronto 6 CC Hierarchical, directory-based, writeback/invalidate Directory entries are stored in both the per-station memory (‘home’ location), and cached in the network interfaces (hence the name, Network Cache) The Network Cache stores both the remotely cached directory information, as well as the cache lines themselves, and allows the network interface to perform coherence operations locally (on-Station), avoiding remote accesses to the home directory Filtermasks indicate which Stations (I.e. clusters) may potentially have a copy of a cache line (with the fuzziness due to the imprecise nature of the filter masks) Processor Masks are used only within a Station, to indicates which particular caches may contain a copy (with the fuzziness here due to Shared lines that may have been silently ejected) Hardware Cache Coherence
7
University of Toronto 7 SC The most intuitive model for the normally trained programmer: increases the usability of the system Easily supported by NUMAchine’s ring network: the only change necessary is to force invalidates to pass through a global ‘sequencing point’ on the ring, increasing the average invalidation latency by 2 ring hops (40 ns with our default 50 MHz rings) Memory Model: Sequential Consistency
8
University of Toronto 8 SS:RP1 Use the SPLASH-2 benchmarks suite, and a cycle-accurate hardware simulator with full modeling of the coherence protocol Applications with high communication-to-computation ratios (e.g. FFT, Radix) show high utilizations, particularly in the Central Ring (indicating that a faster Central Ring would help) Simulation Studies: Ring Performance 1
9
University of Toronto 9 SS:RP2 Maximum and average ring interface queue depths indicate the network congestion, which correlates to bursty traffic Large differences between the maximum and average values indicates large variability in burst size Simulation Studies: Ring Performance 2
10
University of Toronto 10 SS:NC Graphs show a measure of the Network Cache’s effect by looking at the hit rate (I.e. reduction in remote data and coherence traffic) By categorizing the hits by the coherence directory state, we also see where the benefits come from: caching shared data, or reducing invalidations and coherence traffic Simulation Studies: Network Cache
11
University of Toronto 11 SS:CO Measure the overhead due to cache coherence, by allowing all writes to succeed immediately without checking cache-line state, and comparing against runs with the full cache coherence protocol in place (both using infinite-capacity Network Caches to avoid measurement noise due to capacity effects) Results indicate that in many cases it is basic data locality and/or poor parallelizability that are impeding performance, not cache coherence Simulation Studies: Coherence Overhead
12
University of Toronto 12 PP Prototype Performance Speedups from the hardware prototype, compared against estimates from the simulator
13
University of Toronto 13 Status Fully operational running the custom Tornado OS, with a 32- processor system shown below Hardware Prototype Status
14
University of Toronto 14 Fin 4- and 8-way SMPs are fast becoming commodity items The NUMAchine project has shown that a simple, cost-effective, CC-NUMA multiprocessor can be built using these SMP building blocks and a simple ring network, and still achieve good performance and scalability In the medium-scale range (a few tens to hundreds of processors), rings are a good choice for a multiprocessor interconnect We have demonstrated an efficient hardware cache coherence scheme, which is designed to make use of the natural ordering and broadcast capabilities of rings NUMAchine’s architecture efficiently supports a sequentially consistent memory model, which we feel is essential for increasing the ease of use and programmability of multiprocessors Conclusion
15
University of Toronto 15 Ack Operating Systems Prof. Michael Stumm Orran Krieger (IBM) Ben Gamsa Jonathon Appavoo Robert Ho Compilers Prof. Tarek Abdelrahman Prof. Naraig Manjikian (Queens) Applications Prof. Ken Sevcik Acknowledgments: The NUMAchine Team Hardware Prof. Zvonko Vranesic Prof. Stephen Brown Robin Grindley (SOMA Networks) Alex Grbic Prof. Zeljko Zilic (McGill) Steve Caranci (Altera) Derek DeVries (OANDA) Guy Lemieux Kelvin Loveless (GNNettest) Prof. Sinisa Srbljic (Zagreb) Paul McHardy Mitch Gusat (IBM)
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.