Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis.

Similar presentations


Presentation on theme: "Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis."— Presentation transcript:

1 Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis

2 Overview Common Features CC-NUMA COMA Cache Misses Performance Expectations Simulation & Results COMA-F

3 Common Features CC-NUMA DASHAlewife COMA DDMKSR1 Large-scale multiprocessors Single address space Distributed main memory Directory-based cache coherence Scalable interconnection network Examples:

4 Cache-Coherent Non-Uniform-Memory-Access Machines Network independent Write-invalidate cache coherence protocol 2 hop miss 3 hop miss CC-NUMA

5 COMA Cache-Only Memory Architectures Attraction memory – per-node memory acts as secondary/tertiary cache Data is distributed and mobile Directory is dynamically distributed in a hierarchy Combining – can optimize multiple reads –LU - 47%, Barnes Hut - 6%, remaining < 1% Reduces the average cache latency Increased overhead for directory structure COMA

6 Cache Misses Cold miss Capacity miss Coherence miss Which architecture has lower latency? CC-NUMACOMA

7 Figure 1

8 Performance Expectations Application Characteristics Low Miss Rates High Miss Rates Mostly Coherence Misses Mostly Capacity Misses Coarse Grained Data Access Fine Grained Data Access CC-NUMACOMA

9 Simulation 16 processors Cache lines = 16 bytes Cache size of 4 Kbytes –(Small – to force capacity misses)

10 Results

11 MP3D – Particle-based wind tunnel simulation PTHOR – Distributed-time logic simulation LocusRoute – VLSI standard cell router Water – Molecular dynamics code: Water Cholesky – Cholesky factorization of sparse matrix LU – LU decomposition of dense matrix Barnes-Hut – N-body problem solver O(NlogN) Ocean – Ocean basin simulation CC-NUMACOMA

12 Page Migration – Page Size Introduces additional overhead Node hit rate increases as page size decreases –Reduces false sharing –Fewer pages accessed by multiple processors Likely won’t work if data chunks are much smaller than pages (example - LU) NUMA-M performs better for Cholesky

13 Initial Placement Implemented as page migration with a max of 1 time that a page can be migrated LU does significantly better Ocean does the same for single vs. multiple migrations Requires increased work for compiler and programmer

14 Cache Size/Network Variations Cache Size Variations –Increasing the cache size causes coherence misses to dominate –With 64KB cache, CC-NUMA (without migration) is better for everything except Ocean. Network Latency Variations –Even with aggressive implementations of directory structure, COMA can’t compensate in applications with significant coherence miss rate

15 COMA-F Data directory information has a home node (CC-NUMA) Supports replication and migration of data blocks (COMA-H) Attempts to reduce the coherence miss penalty

16 Conclusion Application Characteristics Low Miss Rates High Miss Rates Mostly Coherence Misses Mostly Capacity Misses Coarse Grained Data Access Fine Grained Data Access CC-NUMACOMA CC-NUMA and COMA perform well for different application characteristics


Download ppt "Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis."

Similar presentations


Ads by Google