Presentation is loading. Please wait.

Presentation is loading. Please wait.

DDM – A Cache-Only Memory Architecture

Similar presentations


Presentation on theme: "DDM – A Cache-Only Memory Architecture"— Presentation transcript:

1 DDM – A Cache-Only Memory Architecture
Erik Hagersten et al. Summary by Victor Wen 4/3/2002 This paper builds an hierarchical COMA machine called Data Diffusion Machine, DDM.

2 Glossary UMA NUMA COMA single bus based, shared memory organization
uniform access time to memory example: Sequent and Encore NUMA basically distributed shared memory arbitrary interconnect network example: BBN Butterfly, IBM RP3 COMA non-static partition of memory address distributed memory acts like cache, called attraction memory DDM -- COMA machine with a hierarchy UMA architecture problem bus bandwidth limit scalability Memory bandwidth becomes critical NUMA each process node contains a portion of the shared address space, thus access time to different addresses vary COMA stands for cache-only memory architecture

3 Motivation Why COMA? Why the hierarchy? good locality
adapts to dynamic scheduling run NUMA and UMA optimized program well useful when communication latency long Why the hierarchy? for scalability and use of snoopy protocol

4 Now, details… processor+attraction memory at leaves
directory in interior nodes use asynchronous split Xaction bus to increase throughput outstanding requests encoded in coherence protocol states Use fat tree to alleviate bandwidth contention up stream Directory contains only meta-data, attraction memory contains actual data Has inclusion property, contains superset of all meta-data below Transient coherence states: Reading, Waiting, Reading+Waiting, Answering RW is used to handle write races. R waiting for data to return W waiting to become exclusive of the item A promised to answer a read request (can’t do right now because of split Xaction bus)

5 More details… Minimal data unit: item
VM address translated to item identifier coherence state associated with each item home bus for item statically allocated, used for Replacement State info includes tag and coherence state higher overhead for smaller item size (6-16% for Ps) but high false sharing if item size gets large Address translation in this case is simply VM-PM mapping using MMU

6 Coherence Protocol Simplified transition diagram without replacement
Why RW? - to solve write races

7 Multilevel Read

8 ML Write Races

9 Replacement happens when attraction memory is full and need space
results in an Out transaction (if in S) terminates when finding other items in S, R, W or A state turns to Inject if last copy results in Inject transaction (if in E) inject tries local DDM bus first, then home bus if home bus also full, evict foreigners

10 Prototype TP881V system Custom designed DDM node controller
4 Motorola MC MHz proc. 2 16 KB caches and MMU 8 or 32 MB DRAM Custom designed DDM node controller 4 phases in DDM bus, 20 MHz Xaction code Snoop Selection Data Memory system 16 percent slower than original TP881V system

11 Performance Benchmarks written for UMA
Speed up normalized to single DDM node with 100% hit in attrac. mem

12 Summary and Discussion
6-16% memory overhead with 16% slower access time to memory COMA works well for programs with high locality (ie Water) Introduced transient states in coherence protocol to handle replacement and split Xaction bus Introduced hierarchy to allow scalability Simulated performance sketchy. Why not compare against UMA or NUMA machines using same hardware parameters? Is network+cache access latencies longer than memory?


Download ppt "DDM – A Cache-Only Memory Architecture"

Similar presentations


Ads by Google