ECE669 L17: Memory Systems April 1, 2004 ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

The University of Adelaide, School of Computer Science
CSE 502: Computer Architecture
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
SE-292 High Performance Computing
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 25: Distributed Shared Memory All slides © IG.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan
ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
ECE669 L19: Processor Design April 8, 2004 ECE 669 Parallel Computer Architecture Lecture 19 Processor Design.
CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:
Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Sunita Marathe.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Computer System Architectures Computer System Software
Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.
Lecture 13: Multiprocessors Kai Bu
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Page 1 Distributed Shared Memory Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content of this presentation.
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
The University of Adelaide, School of Computer Science
CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
CS 704 Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture
Distributed Shared Memory
Memory Consistency Models
CS 704 Advanced Computer Architecture
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Memory Consistency Models
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Outline Midterm results summary Distributed file systems – continued
Shared Memory Consistency Models: A Tutorial
Multiprocessors - Flynn’s taxonomy (1966)
Multiprocessor Highlights
/ Computer Architecture and Design
Lecture 25: Multiprocessors
High Performance Computing
Lecture 25: Multiprocessors
Lecture 26: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
CPE 631 Lecture 20: Multiprocessors
Lecture: Coherence and Synchronization
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

ECE669 L17: Memory Systems April 1, 2004 ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems

ECE669 L17: Memory Systems April 1, 2004 Memory Characteristics °Caching performance important for system performance °Caching tightly integrated with networking °Physcial properties Consider topology and distribution of memory °Develop an effective coherency strategy °Limitless approach to caching Allow scalable caching

ECE669 L17: Memory Systems April 1, 2004 Perspectives °Programming model and caching. or: the meaning of shared memory Sequential consistency: Final state (of memory) is as if all RDs and WRTs were executed in some given serial order (per processor order maintained) [This notion borrows from similar notions of sequential consistency in transaction processing systems.] Memory w1r1r1w1r1r1 r3r3w3w3r3r3w3w3 w2w3r2w2w3r2 r 1 r 2 r 1 w 2 w 2 w Lamport

ECE669 L17: Memory Systems April 1, 2004 Coherent Cache Implementation °Twist: On write to shared location -Invalidation sent in background -Processor proceeds M A = O C A = 0 P A = 1 1 Proceed

ECE669 L17: Memory Systems April 1, 2004 Does caching violate this model? C P1P1 P2P2 M A=0 M x=0 C A=0 x=0 A=0 x=0

ECE669 L17: Memory Systems April 1, 2004 Does caching violate this model? If b = = 0 at the end, sequential consistency is violated A= 1 x= 1 LOOP:If (x= = 0) GOTO LOOP; b=A C P1P1 P2P2 M A=0 M x=0 C A=0 x=0 A=0 x=0

ECE669 L17: Memory Systems April 1, A= 1 x= LOOP:If (x= = 0) GOTO LOOP; b=A C P1P1 P2P2 M A=0 M x=0 C A=0 x=0 A=0 x=0 Does caching violate this model? If b = = 0 at the end, sequential consistency is violated

ECE669 L17: Memory Systems April 1, 2004 If b = = 0 at the end, sequential consistency is violated A= 1 x= A= 1 x= b=0!VIOLATION! 5 3 inv x 4 x delay LOOP:If (x= = 0) GOTO LOOP; b=A C P1P1 P2P2 M A=0 M x=0 C A=0 x=0 A=0 x=0 Does caching violate this model?

ECE669 L17: Memory Systems April 1, 2004 Does caching violate this model? LOOP:If (x= = 0) GOTO LOOP; b=A b= 1 ! o.k. C P1P1 P2P2 M A=0 M x=0 C A=0 x=0 A=0 x=0 x= 1 3 delay... A= 1 ACK A= 1 x= 1 A= 1 x= fence inv x 3

ECE669 L17: Memory Systems April 1, 2004 Does caching violate this model? °Not if we are careful. Ensure that at time instant t, no two processors see different values of a given variable. On a write: -Lock datum -Invalidate all copies of datum -Update central copy of datum -Release lock on datum Do not proceed till write completes (ack got) How do we implement an update protocol? Hard! -Lock central copy of datum -Mark all copies as unreadable -Update all copies --- release read lock on each copy after each update -Unlock central copy

ECE669 L17: Memory Systems April 1, 2004 Writes are looooong -- latency ops. °Solutions - 1. Build latency tolerant processors - Alewife 2. Change shared-memory semantics [solve a different problem!] 3. Notion of weaker memory semantics Basic idea - Guarantee completion of write only on “fence” operations Typical fence is synchronization point (or programmer puts fences in) Use: Modify shared data only within critical sections Propagate changes at end of critical section, before releasing lock Higher level locking protocols must guarantee that others do not try to read/write an object that has been modified and read by someone else. For most parallel programs -- no problem see later

ECE669 L17: Memory Systems April 1, 2004 Memory Systems Memory storage Communication Processing °Programmer’s view °Physically,... Memory wrt read PPP P M M M Memory Network M M M M Monolithic Distributed Distributed - local P P P P P P P P P P P...

ECE669 L17: Memory Systems April 1, 2004 Addressing °I. Like uniprocessors Could include a translation phase for virtual memory systems °II. Object-oriented models Address Loc ID Object-ID, Address Table Node ID Address Offset M MM...

ECE669 L17: Memory Systems April 1, 2004 Issues in virtual memory (also naming) °Goals: Illusion of a lot more memory than physically exists. Protection - allows multiprogramming Mobility of data: indirection allows ease of migration °Premise: Want a large, virtualized, single address space But, physically distributed, local

ECE669 L17: Memory Systems April 1, 2004 Memory Performance Parameters Size (per node) Bandwidth (accesses per second) Latency (access time) °Size: Issue of cost. Uniprocessors 1 MByte per MIPS Multiprossors? Raging debate Eg. Alewife 1/8 MByte memory per MIPS Firefly 2 MByte per MIPS What affects memory size decision? Key issues Communication bandwidth memory size tradeoffs Balanced design --- All components roughly equally utilized

ECE669 L17: Memory Systems April 1, 2004 No VM PM... PPP Relatively small address space Address: VA = PA Processor # Offset

ECE669 L17: Memory Systems April 1, 2004 Virtual Memory PM... PPP Large address space Straightforward extension from uniprocessors Xlate in software, in cache, or TLBs At source translation... VA xlate PA

ECE669 L17: Memory Systems April 1, 2004 VM – At Destination Translation PM... PPP °On page fault at destination Fetch page/obj from a local disk Send msg to appropriate disk node VA xlate (or miss) PA node # memory address

ECE669 L17: Memory Systems April 1, 2004 Next, bandwidth and latency °In the interests of keeping the memory system as simple as possible, and because distributed memory provides high peak bandwidth, we will not consider interleaved memories as in vector processors °Instead, look at Reducing bandwidth demand of processors Reducing latency of memory Exploit locality Property of reuse °Caches

ECE669 L17: Memory Systems April 1, 2004 M M M C C C P P P Caching Techniques for multiprocessors How are caches different from local memory? -Fine-grain relocation of blocks -HW support for management, esp. for coherence -Smaller, faster, integrable Otherwise have similiar properties as local memory Network

ECE669 L17: Memory Systems April 1, 2004 M M M C C C P P P Caching Techniques for multiprocessors How are caches different from local memory? –Fine-grain relocation of blocks –HW support for managment, esp. for coherence –Smaller, faster, integrable Otherwise have similiar properties as local memory Network Say, no caching rd

ECE669 L17: Memory Systems April 1, 2004 M M M C C C P P P Caching Techniques for multiprocessors How are caches different from local memory? –Fine-grain relocation of blocks –HW support for managment, esp. for coherence –Smaller, faster, integrable Otherwise have similiar properties as local memory Network with caches

ECE669 L17: Memory Systems April 1, 2004 M M M C C C P P P Caching Techniques for multiprocessors How are caches different from local memory? –Fine-grain relocation of blocks –HW support for managment, esp. for coherence –Smaller, faster, integrable Otherwise have similiar properties as local memory Network wrt ? Network req on: - wrt to clean - read of remote dirty Coherence problem

ECE669 L17: Memory Systems April 1, 2004 Summary °Understand how delay affects cache performance °Maintain sequential consistency °Physcial properties Consider topology and distribution of memory °Develop an effective coherency strategy °Simplicity and software maintenance are keys