DDM – A Cache-Only Memory Architecture

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
To Include or Not to Include? Natalie Enright Dana Vantrease.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis.
DDM – A Cache Only Memory Architecture Hagersten, Landin, and Haridi (1991) Presented by Patrick Eibl.
1 Lecture 16: Virtual Memory Today: DRAM innovations, virtual memory (Sections )
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
DDM - A Cache-Only Memory Architecture Erik Hagersten, Anders Landlin and Seif Haridi Presented by Narayanan Sundaram 03/31/2008 1CS258 - Parallel Computer.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
MIMD Shared Memory Multiprocessors. MIMD -- Shared Memory u Each processor has a full CPU u Each processors runs its own code –can be the same program.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
Lecture 13: Multiprocessors Kai Bu
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Lecture 13: Multiprocessors Kai Bu
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
Lecture 8: Snooping and Directory Protocols
Parallel Architecture
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Scalable Cache Coherent Systems
Reactive NUMA A Design for Unifying S-COMA and CC-NUMA
Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA
A Study on Snoop-Based Cache Coherence Protocols
Multiprocessor Cache Coherency
Cache Memory Presentation I
Scalable Cache Coherent Systems
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Lecture 12: Cache Innovations
Parallel and Multiprocessor Architectures – Shared Memory
Cache Coherence (controllers snoop on bus transactions)
Lecture 2: Snooping-Based Coherence
Scalable Cache Coherent Systems
Lecture 1: Parallel Architecture Intro
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Lecture 08: Memory Hierarchy Cache Performance
Death Match ’92: NUMA v. COMA
CMSC 611: Advanced Computer Architecture
Multiprocessors - Flynn’s taxonomy (1966)
Lecture 24: Memory, VM, Multiproc
Lecture 8: Directory-Based Cache Coherence
Lecture 7: Directory-Based Cache Coherence
11 – Snooping Cache and Directory Based Multiprocessors
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
/ Computer Architecture and Design
Scalable Cache Coherent Systems
High Performance Computing
Lecture: Cache Hierarchies
Lecture 24: Virtual Memory, Multiprocessors
Scalable Cache Coherent Systems
Lecture 23: Virtual Memory, Multiprocessors
Scalable Cache Coherent Systems
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
CPE 631 Lecture 20: Multiprocessors
CSE 486/586 Distributed Systems Cache Coherence
Scalable Cache Coherent Systems
Scalable Cache Coherent Systems
Scalable Cache Coherent Systems
Presentation transcript:

DDM – A Cache-Only Memory Architecture Erik Hagersten et al. Summary by Victor Wen 4/3/2002 This paper builds an hierarchical COMA machine called Data Diffusion Machine, DDM.

Glossary UMA NUMA COMA single bus based, shared memory organization uniform access time to memory example: Sequent and Encore NUMA basically distributed shared memory arbitrary interconnect network example: BBN Butterfly, IBM RP3 COMA non-static partition of memory address distributed memory acts like cache, called attraction memory DDM -- COMA machine with a hierarchy UMA architecture problem bus bandwidth limit scalability Memory bandwidth becomes critical NUMA each process node contains a portion of the shared address space, thus access time to different addresses vary COMA stands for cache-only memory architecture

Motivation Why COMA? Why the hierarchy? good locality adapts to dynamic scheduling run NUMA and UMA optimized program well useful when communication latency long Why the hierarchy? for scalability and use of snoopy protocol

Now, details… processor+attraction memory at leaves directory in interior nodes use asynchronous split Xaction bus to increase throughput outstanding requests encoded in coherence protocol states Use fat tree to alleviate bandwidth contention up stream Directory contains only meta-data, attraction memory contains actual data Has inclusion property, contains superset of all meta-data below Transient coherence states: Reading, Waiting, Reading+Waiting, Answering RW is used to handle write races. R waiting for data to return W waiting to become exclusive of the item A promised to answer a read request (can’t do right now because of split Xaction bus)

More details… Minimal data unit: item VM address translated to item identifier coherence state associated with each item home bus for item statically allocated, used for Replacement State info includes tag and coherence state higher overhead for smaller item size (6-16% for 32-256 Ps) but high false sharing if item size gets large Address translation in this case is simply VM-PM mapping using MMU

Coherence Protocol Simplified transition diagram without replacement Why RW? - to solve write races

Multilevel Read

ML Write Races

Replacement happens when attraction memory is full and need space results in an Out transaction (if in S) terminates when finding other items in S, R, W or A state turns to Inject if last copy results in Inject transaction (if in E) inject tries local DDM bus first, then home bus if home bus also full, evict foreigners

Prototype TP881V system Custom designed DDM node controller 4 Motorola MC88100 20 MHz proc. 2 16 KB caches and MMU 8 or 32 MB DRAM Custom designed DDM node controller 4 phases in DDM bus, 20 MHz Xaction code Snoop Selection Data Memory system 16 percent slower than original TP881V system

Performance Benchmarks written for UMA Speed up normalized to single DDM node with 100% hit in attrac. mem

Summary and Discussion 6-16% memory overhead with 16% slower access time to memory COMA works well for programs with high locality (ie Water) Introduced transient states in coherence protocol to handle replacement and split Xaction bus Introduced hierarchy to allow scalability Simulated performance sketchy. Why not compare against UMA or NUMA machines using same hardware parameters? Is network+cache access latencies longer than memory?