CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Cache Optimization Summary

The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis.

Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

EECC756 - Shaaban #1 lec # 13 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.

1 Lecture 20: Coherence protocols Topics: snooping and directory-based coherence protocols (Sections )

1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.

1 Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations.

1 Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections )

CS252/Patterson Lec /28/01 CS 213 Lecture 9: Multiprocessor: Directory Protocol.

CS 258 Spring An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing Per Stenström, Mats Brorsson, and Lars Sandberg Presented by Allen.

1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

EECC756 - Shaaban #1 lec # 12 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

April 18, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

DDM - A Cache-Only Memory Architecture Erik Hagersten, Anders Landlin and Seif Haridi Presented by Narayanan Sundaram 03/31/2008 1CS258 - Parallel Computer.

The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor Computer System Laboratory Stanford University Daniel Lenoski, James Laudon, Kourosh.

EECS 252 Graduate Computer Architecture Lec 13 – Snooping Cache and Directory Based Multiprocessors David Patterson Electrical Engineering and Computer.

Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.

RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Additional Material CEG 4131 Computer Architecture III

1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

The University of Adelaide, School of Computer Science

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Lecture 8: Snooping and Directory Protocols

Presented by: Nick Kirchem Feb 13, 2004

Architecture and Design of AlphaServer GS320

CS 704 Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture

Krste Asanovic Electrical Engineering and Computer Sciences

Example Cache Coherence Problem

Lecture 2: Snooping-Based Coherence

The DASH Prototype: Implementation and Performance

Multiprocessors - Flynn’s taxonomy (1966)

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

DDM – A Cache-Only Memory Architecture

/ Computer Architecture and Design

Lecture 25: Multiprocessors

High Performance Computing

Latency Tolerance: what to do when it just won’t go away

Lecture 8: Directory-Based Examples

Lecture 25: Multiprocessors

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Lecture 24: Multiprocessors

Lecture 18: Coherence and Synchronization

Presentation transcript:

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et. al. “The DASH Prototype: Implementation and Performance”, Proceedings of the International symposium on Computer Architecture, March 17, 2008 Rhishikesh Limaye

Lec /17/08Kubiatowicz CS258 ©UCB Spring 2008 DASH objectives Demonstrates large-scale shared memory multiprocessor using directory-based cache coherence. Prototype with processors. Argument is that: for performance and programmability, a parallel architecture should: –Scale to 100s-1000s of processors –Have high performance individual processors –Have single shared address space

Lec /17/08Kubiatowicz CS258 ©UCB Spring 2008 Two-level architecture Cluster: –Uses bus-based shared memory with snoopy cache coherence –4 processors per cluster Inter-cluster: –Scalable interconnect network –Directory-based cache coherence

Lec /17/08Kubiatowicz CS258 ©UCB Spring 2008 Cluster level Minor modifications to off-the-shelf 4D/340 cluster 4 MIPS R3000 processors + 4 R3010 floating point coprocessors L1 write-through, L2 write-back. Cache coherence: –MESI i.e. Illinois »Cache-to-cache transfers good for cached remote locations –L1 cache is write-through => inclusion property Pipelined bus with maximum bandwidth 64MB/s.

Lec /17/08Kubiatowicz CS258 ©UCB Spring 2008 Inter-cluster directory protocol Three states per 16B memory chunk: invalid, shared, dirty. Memory is distributed across clusters. Directory bits: –Simple scheme of 1 bit per cluster + 1 dirty bit. »This is good for the prototype which has maximum 16 clusters. Should be replaced by limited-pointer/sparse directory for more clusters. Replies are sent directly between clusters and not through the home cluster. –i.e. invalidation acks are collected at the requester node and not the home node of a memory location.

Lec /17/08Kubiatowicz CS258 ©UCB Spring 2008 Extra hardware for directory For each cluster, we have the following: Directory bits: DRAM –17 bits per 16-byte cache line Directory controller: snoops every bus transaction within cluster, accesses directory bits and takes action. Reply controller Remote access cache: SRAM. 128KB, 16B line –Snoops remote accesses on the local bus –Stores state of on-going remote accesses made by local processors –Lock-up free: handle multiple outstanding requests –QUESTION: what happens if two remote requests collide in this direct mapped cache? Pseudo-CPU: –for requests for local memory by remote nodes. Performance monitor

Lec /17/08Kubiatowicz CS258 ©UCB Spring 2008 Memory performance Best caseWorst case MB/sClock/wordMB/sClock/word Read from L Fill from L Fill from local bus Fill from remote Fill from dirty remote Write to cache324 4 Write to local bus Write to remote Write to dirty-remote level memory hierarchy: (L1, L2), (local L2s + memory), directory home, remote cluster

Lec /17/08Kubiatowicz CS258 ©UCB Spring 2008 Hardware cost of directory [table 2 in the paper] 13.7% DRAM – directory bits –For larger systems, sparse representation needed. 10% SRAM – remote access cache 20% logic gates – controllers and network interfaces Clustering is important: –For uniprocessor node, directory logic is 44%. Compare to message passing: –Message passing has about 10% logic + ~0 memory cost. –Thus, hardware coherence costs 10% more logic and 10% more memory. –Later argued that, the performance improvement is much greater than 10% X.

Lec /17/08Kubiatowicz CS258 ©UCB Spring 2008 Performance monitor Configurable events. SRAM-based counters: –2 banks of 16K x 32 SRAM. –Addressed by events (i.e. event0, event1, event2… form address bit 0, 1, 2…) »Thus, can track (log 16K) = 14 events with each bank. Trace buffer made of DRAM: –Can store 2M memory ops. –With software support, can log all memory operations.

Lec /17/08Kubiatowicz CS258 ©UCB Spring 2008 Performance results 9 applications Good speed-up on 5 – without specially optimizing for DASH. MP3D has bad locality. PSIM4 is enhanced version of MP3D. Cholesky: more processors => too fine granularity, unless problem size is increased unreasonably. Note: dip after P = 4.

Lec /17/08Kubiatowicz CS258 ©UCB Spring 2008 Detailed study of 3 applications What to read from tables 4, 5, 6: Busy pclks between proc. stallsmore => remote caching works, ability to hide memory latencies. Fraction of reads local, fraction of remote reads dirty-remote locality Bus, network utilizationHigher utilization => congestion => higher latencies Water and LocusRoute have equal fraction of reads local, but Water scales well, and LocusRoute doesn’t. Remote caching works: Water and LocusRoute have remote reference every 20 and 11 instructions, but busy pclks between processor stalls is 506 and 181.

Lec /17/08Kubiatowicz CS258 ©UCB Spring 2008 Conclusions Locality is still important, because of higher remote latencies However, for applications, natural locality can be enough (Barnes-Hut, Radiosity, Water). Thus, good speed-ups can be achieved without difficult programming model (i.e. message passing) For higher performance, have to worry about the extended memory hierarchy – but only for critical data structures. Analogous to argument in the uniprocessor world: caches vs. scratchpad memories/stream buffers.