Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis.

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis

Overview Common Features CC-NUMA COMA Cache Misses Performance Expectations Simulation & Results COMA-F

Common Features CC-NUMA DASHAlewife COMA DDMKSR1 Large-scale multiprocessors Single address space Distributed main memory Directory-based cache coherence Scalable interconnection network Examples:

Cache-Coherent Non-Uniform-Memory-Access Machines Network independent Write-invalidate cache coherence protocol 2 hop miss 3 hop miss CC-NUMA

COMA Cache-Only Memory Architectures Attraction memory – per-node memory acts as secondary/tertiary cache Data is distributed and mobile Directory is dynamically distributed in a hierarchy Combining – can optimize multiple reads –LU - 47%, Barnes Hut - 6%, remaining < 1% Reduces the average cache latency Increased overhead for directory structure COMA

Cache Misses Cold miss Capacity miss Coherence miss Which architecture has lower latency? CC-NUMACOMA

Figure 1

Performance Expectations Application Characteristics Low Miss Rates High Miss Rates Mostly Coherence Misses Mostly Capacity Misses Coarse Grained Data Access Fine Grained Data Access CC-NUMACOMA

Simulation 16 processors Cache lines = 16 bytes Cache size of 4 Kbytes –(Small – to force capacity misses)

Results

MP3D – Particle-based wind tunnel simulation PTHOR – Distributed-time logic simulation LocusRoute – VLSI standard cell router Water – Molecular dynamics code: Water Cholesky – Cholesky factorization of sparse matrix LU – LU decomposition of dense matrix Barnes-Hut – N-body problem solver O(NlogN) Ocean – Ocean basin simulation CC-NUMACOMA

Page Migration – Page Size Introduces additional overhead Node hit rate increases as page size decreases –Reduces false sharing –Fewer pages accessed by multiple processors Likely won’t work if data chunks are much smaller than pages (example - LU) NUMA-M performs better for Cholesky

Initial Placement Implemented as page migration with a max of 1 time that a page can be migrated LU does significantly better Ocean does the same for single vs. multiple migrations Requires increased work for compiler and programmer

Cache Size/Network Variations Cache Size Variations –Increasing the cache size causes coherence misses to dominate –With 64KB cache, CC-NUMA (without migration) is better for everything except Ocean. Network Latency Variations –Even with aggressive implementations of directory structure, COMA can’t compensate in applications with significant coherence miss rate

COMA-F Data directory information has a home node (CC-NUMA) Supports replication and migration of data blocks (COMA-H) Attempts to reduce the coherence miss penalty

Conclusion Application Characteristics Low Miss Rates High Miss Rates Mostly Coherence Misses Mostly Capacity Misses Coarse Grained Data Access Fine Grained Data Access CC-NUMACOMA CC-NUMA and COMA perform well for different application characteristics

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis.

Similar presentations

Presentation on theme: "Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis.

Similar presentations

Presentation on theme: "Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis."— Presentation transcript:

Similar presentations

About project

Feedback