DDM – A Cache-Only Memory Architecture

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

To Include or Not to Include? Natalie Enright Dana Vantrease.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Cache Optimization Summary

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis.

DDM – A Cache Only Memory Architecture Hagersten, Landin, and Haridi (1991) Presented by Patrick Eibl.

1 Lecture 16: Virtual Memory Today: DRAM innovations, virtual memory (Sections )

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

DDM - A Cache-Only Memory Architecture Erik Hagersten, Anders Landlin and Seif Haridi Presented by Narayanan Sundaram 03/31/2008 1CS258 - Parallel Computer.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

MIMD Shared Memory Multiprocessors. MIMD -- Shared Memory u Each processor has a full CPU u Each processors runs its own code –can be the same program.

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

Lecture 13: Multiprocessors Kai Bu

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Lecture 13: Multiprocessors Kai Bu

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Lecture 8: Snooping and Directory Protocols

Parallel Architecture

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

Scalable Cache Coherent Systems

Reactive NUMA A Design for Unifying S-COMA and CC-NUMA

Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA

A Study on Snoop-Based Cache Coherence Protocols

Multiprocessor Cache Coherency

Cache Memory Presentation I

Scalable Cache Coherent Systems

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

Lecture 12: Cache Innovations

Parallel and Multiprocessor Architectures – Shared Memory

Cache Coherence (controllers snoop on bus transactions)

Lecture 2: Snooping-Based Coherence

Scalable Cache Coherent Systems

Lecture 1: Parallel Architecture Intro

Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.

Lecture 08: Memory Hierarchy Cache Performance

Death Match ’92: NUMA v. COMA

CMSC 611: Advanced Computer Architecture

Multiprocessors - Flynn’s taxonomy (1966)

Lecture 24: Memory, VM, Multiproc

Lecture 8: Directory-Based Cache Coherence

Lecture 7: Directory-Based Cache Coherence

11 – Snooping Cache and Directory Based Multiprocessors

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

/ Computer Architecture and Design

Scalable Cache Coherent Systems

High Performance Computing

Lecture: Cache Hierarchies

Lecture 24: Virtual Memory, Multiprocessors

Scalable Cache Coherent Systems

Lecture 23: Virtual Memory, Multiprocessors

Scalable Cache Coherent Systems

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

CPE 631 Lecture 20: Multiprocessors

CSE 486/586 Distributed Systems Cache Coherence

Scalable Cache Coherent Systems

Scalable Cache Coherent Systems

Scalable Cache Coherent Systems

Presentation transcript:

DDM – A Cache-Only Memory Architecture Erik Hagersten et al. Summary by Victor Wen 4/3/2002 This paper builds an hierarchical COMA machine called Data Diffusion Machine, DDM.

Glossary UMA NUMA COMA single bus based, shared memory organization uniform access time to memory example: Sequent and Encore NUMA basically distributed shared memory arbitrary interconnect network example: BBN Butterfly, IBM RP3 COMA non-static partition of memory address distributed memory acts like cache, called attraction memory DDM -- COMA machine with a hierarchy UMA architecture problem bus bandwidth limit scalability Memory bandwidth becomes critical NUMA each process node contains a portion of the shared address space, thus access time to different addresses vary COMA stands for cache-only memory architecture

Motivation Why COMA? Why the hierarchy? good locality adapts to dynamic scheduling run NUMA and UMA optimized program well useful when communication latency long Why the hierarchy? for scalability and use of snoopy protocol

Now, details… processor+attraction memory at leaves directory in interior nodes use asynchronous split Xaction bus to increase throughput outstanding requests encoded in coherence protocol states Use fat tree to alleviate bandwidth contention up stream Directory contains only meta-data, attraction memory contains actual data Has inclusion property, contains superset of all meta-data below Transient coherence states: Reading, Waiting, Reading+Waiting, Answering RW is used to handle write races. R waiting for data to return W waiting to become exclusive of the item A promised to answer a read request (can’t do right now because of split Xaction bus)

More details… Minimal data unit: item VM address translated to item identifier coherence state associated with each item home bus for item statically allocated, used for Replacement State info includes tag and coherence state higher overhead for smaller item size (6-16% for 32-256 Ps) but high false sharing if item size gets large Address translation in this case is simply VM-PM mapping using MMU

Coherence Protocol Simplified transition diagram without replacement Why RW? - to solve write races

Multilevel Read

ML Write Races

Replacement happens when attraction memory is full and need space results in an Out transaction (if in S) terminates when finding other items in S, R, W or A state turns to Inject if last copy results in Inject transaction (if in E) inject tries local DDM bus first, then home bus if home bus also full, evict foreigners

Prototype TP881V system Custom designed DDM node controller 4 Motorola MC88100 20 MHz proc. 2 16 KB caches and MMU 8 or 32 MB DRAM Custom designed DDM node controller 4 phases in DDM bus, 20 MHz Xaction code Snoop Selection Data Memory system 16 percent slower than original TP881V system

Performance Benchmarks written for UMA Speed up normalized to single DDM node with 100% hit in attrac. mem

Summary and Discussion 6-16% memory overhead with 16% slower access time to memory COMA works well for programs with high locality (ie Water) Introduced transient states in coherence protocol to handle replacement and split Xaction bus Introduced hierarchy to allow scalability Simulated performance sketchy. Why not compare against UMA or NUMA machines using same hardware parameters? Is network+cache access latencies longer than memory?