CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et. al. “The DASH Prototype: Implementation and Performance”, Proceedings of the International symposium on Computer Architecture, 1992. March 17, 2008 Rhishikesh Limaye

Lec 15.1.2 3/17/08Kubiatowicz CS258 ©UCB Spring 2008 DASH objectives Demonstrates large-scale shared memory multiprocessor using directory-based cache coherence. Prototype with 16-64 processors. Argument is that: for performance and programmability, a parallel architecture should: –Scale to 100s-1000s of processors –Have high performance individual processors –Have single shared address space

Lec 15.1.3 3/17/08Kubiatowicz CS258 ©UCB Spring 2008 Two-level architecture Cluster: –Uses bus-based shared memory with snoopy cache coherence –4 processors per cluster Inter-cluster: –Scalable interconnect network –Directory-based cache coherence

Lec 15.1.4 3/17/08Kubiatowicz CS258 ©UCB Spring 2008 Cluster level Minor modifications to off-the-shelf 4D/340 cluster 4 MIPS R3000 processors + 4 R3010 floating point coprocessors L1 write-through, L2 write-back. Cache coherence: –MESI i.e. Illinois »Cache-to-cache transfers good for cached remote locations –L1 cache is write-through => inclusion property Pipelined bus with maximum bandwidth 64MB/s.

Lec 15.1.5 3/17/08Kubiatowicz CS258 ©UCB Spring 2008 Inter-cluster directory protocol Three states per 16B memory chunk: invalid, shared, dirty. Memory is distributed across clusters. Directory bits: –Simple scheme of 1 bit per cluster + 1 dirty bit. »This is good for the prototype which has maximum 16 clusters. Should be replaced by limited-pointer/sparse directory for more clusters. Replies are sent directly between clusters and not through the home cluster. –i.e. invalidation acks are collected at the requester node and not the home node of a memory location.

Lec 15.1.6 3/17/08Kubiatowicz CS258 ©UCB Spring 2008 Extra hardware for directory For each cluster, we have the following: Directory bits: DRAM –17 bits per 16-byte cache line Directory controller: snoops every bus transaction within cluster, accesses directory bits and takes action. Reply controller Remote access cache: SRAM. 128KB, 16B line –Snoops remote accesses on the local bus –Stores state of on-going remote accesses made by local processors –Lock-up free: handle multiple outstanding requests –QUESTION: what happens if two remote requests collide in this direct mapped cache? Pseudo-CPU: –for requests for local memory by remote nodes. Performance monitor

Lec 15.1.7 3/17/08Kubiatowicz CS258 ©UCB Spring 2008 Memory performance Best caseWorst case MB/sClock/wordMB/sClock/word Read from L11331 1 Fill from L2304.5915 Fill from local bus178529 Fill from remote5261.3101 Fill from dirty remote4341132 Write to cache324 4 Write to local bus187817 Write to remote5251.589 Write to dirty-remote4331120 4-level memory hierarchy: (L1, L2), (local L2s + memory), directory home, remote cluster

Lec 15.1.8 3/17/08Kubiatowicz CS258 ©UCB Spring 2008 Hardware cost of directory [table 2 in the paper] 13.7% DRAM – directory bits –For larger systems, sparse representation needed. 10% SRAM – remote access cache 20% logic gates – controllers and network interfaces Clustering is important: –For uniprocessor node, directory logic is 44%. Compare to message passing: –Message passing has about 10% logic + ~0 memory cost. –Thus, hardware coherence costs 10% more logic and 10% more memory. –Later argued that, the performance improvement is much greater than 10% -- 3-4X.

Lec 15.1.9 3/17/08Kubiatowicz CS258 ©UCB Spring 2008 Performance monitor Configurable events. SRAM-based counters: –2 banks of 16K x 32 SRAM. –Addressed by events (i.e. event0, event1, event2… form address bit 0, 1, 2…) »Thus, can track (log 16K) = 14 events with each bank. Trace buffer made of DRAM: –Can store 2M memory ops. –With software support, can log all memory operations.

Lec 15.1.10 3/17/08Kubiatowicz CS258 ©UCB Spring 2008 Performance results 9 applications Good speed-up on 5 – without specially optimizing for DASH. MP3D has bad locality. PSIM4 is enhanced version of MP3D. Cholesky: more processors => too fine granularity, unless problem size is increased unreasonably. Note: dip after P = 4.

Lec 15.1.11 3/17/08Kubiatowicz CS258 ©UCB Spring 2008 Detailed study of 3 applications What to read from tables 4, 5, 6: Busy pclks between proc. stallsmore => remote caching works, ability to hide memory latencies. Fraction of reads local, fraction of remote reads dirty-remote locality Bus, network utilizationHigher utilization => congestion => higher latencies Water and LocusRoute have equal fraction of reads local, but Water scales well, and LocusRoute doesn’t. Remote caching works: Water and LocusRoute have remote reference every 20 and 11 instructions, but busy pclks between processor stalls is 506 and 181.

Lec 15.1.12 3/17/08Kubiatowicz CS258 ©UCB Spring 2008 Conclusions Locality is still important, because of higher remote latencies However, for applications, natural locality can be enough (Barnes-Hut, Radiosity, Water). Thus, good speed-ups can be achieved without difficult programming model (i.e. message passing) For higher performance, have to worry about the extended memory hierarchy – but only for critical data structures. Analogous to argument in the uniprocessor world: caches vs. scratchpad memories/stream buffers.

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

Similar presentations

Presentation on theme: "CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

Similar presentations

Presentation on theme: "CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et."— Presentation transcript:

Similar presentations

About project

Feedback