Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis.

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.

Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA Babak Falsafi and David A. Wood University of Wisconsin, Madison, 1997 Presented by: Jie Xiao.

Introduction to MIMD architectures

DDM – A Cache Only Memory Architecture Hagersten, Landin, and Haridi (1991) Presented by Patrick Eibl.

OGO 2.1 SGI Origin 2000 Robert van Liere CWI, Amsterdam TU/e, Eindhoven 11 September 2001.

1 Introduction to MIMD Architectures Sima, Fountain and Kacsuk Chapter 15 CSE462.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Bugnion et al. Presented by: Ahmed Wafa.

1 Lecture 20: Coherence protocols Topics: snooping and directory-based coherence protocols (Sections )

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

CS 258 Spring An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing Per Stenström, Mats Brorsson, and Lars Sandberg Presented by Allen.

Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Multiprocessor Cache Coherency

DDM - A Cache-Only Memory Architecture Erik Hagersten, Anders Landlin and Seif Haridi Presented by Narayanan Sundaram 03/31/2008 1CS258 - Parallel Computer.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Building Expressive, Area-Efficient Coherence Directories Michael C. Huang Guofan Jiang Zhejiang University University of Rochester IBM 1 Lei Fang, Peng.

1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.

Using Prediction to Accelerate Coherence Protocols Shubu Mukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer.

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.

RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.

Additional Material CEG 4131 Computer Architecture III

An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.

“An Evaluation of Directory Schemes for Cache Coherence” Presented by Scott Weber.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

The University of Adelaide, School of Computer Science

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Centralized Multiprocessor.

Architecture and Design of AlphaServer GS320

Reactive NUMA A Design for Unifying S-COMA and CC-NUMA

Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA

Lecture 11: Consistency Models

Morgan Kaufmann Publishers

Multiprocessor Cache Coherency

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

CMSC 611: Advanced Computer Architecture

Lecture 21: Memory Hierarchy

Directory-based Protocol

Shared Memory Multiprocessors

The Stanford FLASH Multiprocessor

Death Match ’92: NUMA v. COMA

Interconnect with Cache Coherency Manager

Course Outline Introduction in algorithms and applications

Lecture 8: Directory-Based Cache Coherence

Lecture 7: Directory-Based Cache Coherence

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

DDM – A Cache-Only Memory Architecture

/ Computer Architecture and Design

Lecture 22: Cache Hierarchies, Memory

The University of Adelaide, School of Computer Science

Multiprocessor System Interconnects

Presentation transcript:

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis

Overview Common Features CC-NUMA COMA Cache Misses Performance Expectations Simulation & Results COMA-F

Common Features CC-NUMA DASHAlewife COMA DDMKSR1 Large-scale multiprocessors Single address space Distributed main memory Directory-based cache coherence Scalable interconnection network Examples:

Cache-Coherent Non-Uniform-Memory-Access Machines Network independent Write-invalidate cache coherence protocol 2 hop miss 3 hop miss CC-NUMA

COMA Cache-Only Memory Architectures Attraction memory – per-node memory acts as secondary/tertiary cache Data is distributed and mobile Directory is dynamically distributed in a hierarchy Combining – can optimize multiple reads –LU - 47%, Barnes Hut - 6%, remaining < 1% Reduces the average cache latency Increased overhead for directory structure COMA

Cache Misses Cold miss Capacity miss Coherence miss Which architecture has lower latency? CC-NUMACOMA

Figure 1

Performance Expectations Application Characteristics Low Miss Rates High Miss Rates Mostly Coherence Misses Mostly Capacity Misses Coarse Grained Data Access Fine Grained Data Access CC-NUMACOMA

Simulation 16 processors Cache lines = 16 bytes Cache size of 4 Kbytes –(Small – to force capacity misses)

Results

MP3D – Particle-based wind tunnel simulation PTHOR – Distributed-time logic simulation LocusRoute – VLSI standard cell router Water – Molecular dynamics code: Water Cholesky – Cholesky factorization of sparse matrix LU – LU decomposition of dense matrix Barnes-Hut – N-body problem solver O(NlogN) Ocean – Ocean basin simulation CC-NUMACOMA

Page Migration – Page Size Introduces additional overhead Node hit rate increases as page size decreases –Reduces false sharing –Fewer pages accessed by multiple processors Likely won’t work if data chunks are much smaller than pages (example - LU) NUMA-M performs better for Cholesky

Initial Placement Implemented as page migration with a max of 1 time that a page can be migrated LU does significantly better Ocean does the same for single vs. multiple migrations Requires increased work for compiler and programmer

Cache Size/Network Variations Cache Size Variations –Increasing the cache size causes coherence misses to dominate –With 64KB cache, CC-NUMA (without migration) is better for everything except Ocean. Network Latency Variations –Even with aggressive implementations of directory structure, COMA can’t compensate in applications with significant coherence miss rate

COMA-F Data directory information has a home node (CC-NUMA) Supports replication and migration of data blocks (COMA-H) Attempts to reduce the coherence miss penalty

Conclusion Application Characteristics Low Miss Rates High Miss Rates Mostly Coherence Misses Mostly Capacity Misses Coarse Grained Data Access Fine Grained Data Access CC-NUMACOMA CC-NUMA and COMA perform well for different application characteristics