CALTECH cs184c Spring2001 -- DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

The University of Adelaide, School of Computer Science
Cache Coherence Mechanisms (Research project) CSCI-5593
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
SE-292 High Performance Computing
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
The University of Adelaide, School of Computer Science
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Lecture 3. Directory-based Cache Coherence Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 15: May 9, 2003 Distributed Shared Memory.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
“An Evaluation of Directory Schemes for Cache Coherence” Presented by Scott Weber.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
Multiprocessor Cache Coherency
Ivy Eva Wu.
Lecture 13: Large Cache Design I
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Lecture 8: Directory-Based Cache Coherence
Lecture 7: Directory-Based Cache Coherence
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Lecture 25: Multiprocessors
High Performance Computing
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Multiprocessors
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory

CALTECH cs184c Spring DeHon Reading Tuesday: Synchronization –HP 8.5 –Alewife paper (if haven’t already read) Thursday: SIMD (SPMD) –Hillis and Steele (definitely) –Bolotski et. al. (scan, concrete)

CALTECH cs184c Spring DeHon Last Time Shared Memory –Programming Model –Architectural Model –Shared-Bus Implementation –Caching Possible w/ Care for Coherence Memory P$P$P$P$

CALTECH cs184c Spring DeHon Today Distributed Shared Memory –No broadcast –Memory distributed among nodes –Directory Schemes –Built on Message Passing Primitives

CALTECH cs184c Spring DeHon Snoop Cache Review Why did we need broadcast in Snoop- Bus protocol?

CALTECH cs184c Spring DeHon Snoop Cache Review Why did we need broadcast in Snoop- Bus protocol? –Detect sharing –Get authoritative answer when dirty

CALTECH cs184c Spring DeHon Scalability Problem? Why can’t we use Snoop protocol with more general/scalable network? –Mesh –fat-tree –multistage network Single memory bottleneck?

CALTECH cs184c Spring DeHon Misses #s are cache line size [Culler/Singh/Gupta 5.23]

CALTECH cs184c Spring DeHon Sub Problems Exclusive owner know when sharing created Know every user –know who needs invalidation Find authoritative copy –when dirty and cached

CALTECH cs184c Spring DeHon Distributed Memory Could use Banking to provide memory bandwidth –have network between processor nodes and memory banks Already need network connecting processors Unify interconnect and modules –each node gets piece of “main” memory

CALTECH cs184c Spring DeHon Distributed Memory P$ MemCC P$ MemCC P$ MemCC Network

CALTECH cs184c Spring DeHon “Directory” Solution Main memory keeps track of users of memory location Main memory acts as rendezvous point On write, –inform all users only need to inform users, not everyone On dirty read, –forward to owner

CALTECH cs184c Spring DeHon Directory Initial Ideal –main memory/home location knows state (shared, exclusive, unused) all sharers

CALTECH cs184c Spring DeHon Directory Behavior On read: –unused give (exclusive) copy to requester record owner –(exclusive) shared (send share message to current exclusive owner) record owner return value

CALTECH cs184c Spring DeHon Directory Behavior On read: –exclusive dirty forward read request to exclusive owner

CALTECH cs184c Spring DeHon Directory Behavior On Write –send invalidate messages to all hosts caching values On Write-Thru/Write-back –update value

CALTECH cs184c Spring DeHon Directory [HP 8.24 and 8.25]

CALTECH cs184c Spring DeHon Representation How do we keep track of readers (owner) ? –Represent –Manage in Memory

CALTECH cs184c Spring DeHon Directory Representation Simple: –bit vector of readers –scalability? State requirements scale as square of number of processors Have to pick maximum number of processors when committing hardware design

CALTECH cs184c Spring DeHon Directory Representation Limited: –Only allow a small (constant) number of readers –Force invalidation to keep down –Common case: little sharing –weakness: yield thrashing/excessive traffic on heavily shared locations –e.g. synchronization variables

CALTECH cs184c Spring DeHon Directory Representation LimitLESS –Common case: small number sharing in hardware –Overflow bit –Store additional sharers in central memory –Trap to software to handle –TLB-like solution common case in hardware software trap/assist for rest

CALTECH cs184c Spring DeHon Alewife Directory Entry [Agarwal et. al. ISCA’95]

CALTECH cs184c Spring DeHon Alewife Timings [Agarwal et. al. ISCA’95]

CALTECH cs184c Spring DeHon Alewife Nearest Neighbor Remote Access Cycles [Agarwal et. al. ISCA’95]

CALTECH cs184c Spring DeHon Alewife Performance [Agarwal et. al. ISCA’95]

CALTECH cs184c Spring DeHon Alewife “Software” Directory Claim: Alewife performance only 2-3x worse with pure software directory management Only on memory side –still have cache mechanism on requesting processor side

CALTECH cs184c Spring DeHon Alewife Primitive Op Performance [Chaiken+Agarwal, ISCA’94]

CALTECH cs184c Spring DeHon Alewife Software Data [y: speedup x: hardware pointers] [Chaiken+Agarwal, ISCA’94]

CALTECH cs184c Spring DeHon Caveat We’re looking at simplified version Additional care needed –write (non) atomicity what if two things start a write at same time? –Avoid thrashing/livelock/deadlock –Network blocking? –…–… Real protocol states more involved –see HP, Chaiken, Culler and Singh...

CALTECH cs184c Spring DeHon Common Case Fast Common case –data local and in cache –satisfied like any cache hit Only go to messaging on miss –minority of accesses (few percent)

CALTECH cs184c Spring DeHon Model Benefits Contrast with completely software “Uniform Addressable Memory” in pure MP –must form/send message in all cases Here: –shared memory captured in model –allows hardware to support efficiently –minimize cost of “potential” parallelism incl. “potential” sharing

CALTECH cs184c Spring DeHon Apply to Other things? I-structure read/write Frame allocation Pass result (inlet) Data following computation

CALTECH cs184c Spring DeHon General Alternative? This requires including the semantics of the operation deeply in the model Very specific hardware support Can we generalize? Provide more broadly useful mechanism? Allows software/system to decide? –(idea of Active Messages)

CALTECH cs184c Spring DeHon Maybe... Expose cache (local) misses to processor Selective thread spawn on miss General non-common-case redirect? –Full/empty data … How use w/ AM for SM?

CALTECH cs184c Spring DeHon Big Ideas Model –importance of strong model –capture semantic intent –provides opportunity to satisfy in various ways Common case –handle common case efficiently –locality

CALTECH cs184c Spring DeHon Big Ideas Hardware/Software tradeoff –perform common case fast in hardware –handoff uncommon case to software