1 Lecture 1: Introduction Course organization:  13 lectures on parallel architectures  ~5 lectures on cache coherence, consistency  ~3 lectures on TM.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) SE 292: High Performance.
Multiple Processor Systems
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
Distributed Processing, Client/Server, and Clusters
1 Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, intro to multi-thread programming models.
1 Lecture 19: Shared-Memory Multiprocessors Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections )
Lecture 18: Multiprocessors
1 Lecture 18: Large Caches, Multiprocessors Today: NUCA caches, multiprocessors (Sections ) Reminder: assignment 5 due Thursday (don’t procrastinate!)
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
1 Lecture 24: Multiprocessors Today’s topics:  Directory-based cache coherence protocol  Synchronization  Consistency  Writing parallel programs Reminder:
1 Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections )
1 Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Done by programmer or system software (compiler, runtime,...)
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
1 Lecture 2: Intro and Snooping Protocols Topics: multi-core cache organizations, programming models, cache coherence (snooping-based)
1 Lecture 18: Shared-Memory Multiprocessors Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections )
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
1 Lecture 25: Multi-core Processors Today’s topics:  Writing parallel programs  SMT  Multi-core examples Reminder:  Assignment 9 due Tuesday.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.
CS 470/570:Introduction to Parallel and Distributed Computing.
1 Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Distributed Shared Memory Systems and Programming
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Lecture 7: PCM, Cache coherence
1 Lecture 2: Parallel Programs Topics: parallel applications, parallelization process, consistency models.
SE-292 High Performance Computing Intro. to Parallelization Sathish Vadhiyar.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Lecture 1: Parallel Architecture Intro Course organization:  ~18 parallel architecture lectures (based on text)  ~10 (recent) paper presentations 
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
1 Lecture 25: Multiprocessors Today’s topics:  Synchronization  Consistency  Shared memory vs message-passing  Simultaneous multi-threading (SMT)
1 Lecture: Memory Technology Innovations Topics: state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile cells, photonics Multiprocessor.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.
Background Computer System Architectures Computer System Software.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I.
Distributed Shared Memory
CS5102 High Performance Computer Systems Thread-Level Parallelism
Lecture 18: Coherence and Synchronization
Multiprocessors Oracle SPARC M core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB,
Lecture 26: Multiprocessors
Lecture 26: Multiprocessors
Lecture 1: Parallel Architecture Intro
Lecture 2: Parallel Programs
Lecture: Coherence Protocols
Lecture 27: Pot-Pourri Today’s topics:
Parallelization of An Example Program
Lecture: Coherence Protocols
Lecture 25: Multiprocessors
Lecture 27: Multiprocessors
Lecture 17: Multi-threaded Applications
Lecture 25: Multiprocessors
Lecture 26: Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 3: Coherence Protocols
Lecture 19: Coherence Protocols
Lecture 18: Cache Coherence
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
Presentation transcript:

1 Lecture 1: Introduction Course organization:  13 lectures on parallel architectures  ~5 lectures on cache coherence, consistency  ~3 lectures on TM  ~2 lectures on interconnection networks  ~2 lectures on large cache hierarchies  ~1 lecture on parallel algorithms  10 lectures on memory systems (taught by Al)  5 lectures: student presentations related to course project

2 Logistics Texts: Parallel Computer Architecture, Culler, Singh, Gupta Principles and Practices of Interconnection Networks, Dally & Towles Introduction to Parallel Algorithms and Architectures, Leighton Transactional Memory, Larus & Rajwar Memory Systems: Cache, DRAM, Disk, Jacob et al. Multi-threaded programming assignment due in Feb Final project report due in early May (will undergo conference-style peer reviewing)

3 More Logistics Sign up for 7810 (3 credits) or for 7960 (2 credits) Projects: simulation-based, creative, be prepared to spend time towards end of semester – more details on simulators in a few weeks Grading:  50% project  20% multi-thread programming assignment  10% paper presentation  20% take-home final

4 Parallel Architecture Trends Source: Mark Hill, Ravi Rajwar

5 CMP/SMT Papers CMP/SMT/Multiprocessor papers in recent conferences:  ISCA:  HPCA:

6 Bottomline Can’t escape multi-cores today: it is the baseline architecture Performance stagnates unless we learn to transform traditional applications into parallel threads It’s all about the data! Data management: distribution, coherence, consistency It’s also about the programming model: onus on application writer / compiler / hardware It’s also about managing on-chip communication

7 Multi-Core Cache Organizations P C P C P C P C P C P C P C P C CCCCCC CCCCCC Private L1 caches Shared L2 cache Bus between L1s and single L2 cache controller Snooping-based coherence between L1s

8 Multi-Core Cache Organizations Private L1 caches Shared L2 cache, but physically distributed Scalable network Directory-based coherence between L1s P C P C P C P C P C P C P C P C

9 Multi-Core Cache Organizations Private L1 caches Shared L2 cache, but physically distributed Bus connecting the four L1s and four L2 banks Snooping-based coherence between L1s P C P C P C P C

10 Multi-Core Cache Organizations Private L1 caches Private L2 caches Scalable network Directory-based coherence between L2s (through a separate directory) P C P C P C P C P C P C P C P C D

11 Shared-Memory Vs. Message Passing Shared-memory  single copy of (shared) data in memory  threads communicate by reading/writing to a shared location Message-passing  each thread has a copy of data in its own private memory that other threads cannot access  threads communicate by passing values with SEND/ RECEIVE message pairs

12 Shared Memory Architectures Key differentiating feature: the address space is shared, i.e., any processor can directly address any memory location and access them with load/store instructions Cooperation is similar to a bulletin board – a processor writes to a location and that location is visible to reads by other threads

13 Shared Address Space Shared Private Process P1 Process P2 Process P3 Shared Pvt P1 Pvt P2 Pvt P3 Virtual address space of each process Physical address space

14 Message Passing Programming model that can apply to clusters of workstations, SMPs, and even a uniprocessor Sends and receives are used for effecting the data transfer – usually, each process ends up making a copy of data that is relevant to it Each process can only name local addresses, other processes, and a tag to help distinguish between multiple messages A send-receive match is a synchronization event – hence, we no longer need locks or barriers to co-ordinate

15 Models for SEND and RECEIVE Synchronous: SEND returns control back to the program only when the RECEIVE has completed Blocking Asynchronous: SEND returns control back to the program after the OS has copied the message into its space -- the program can now modify the sent data structure Nonblocking Asynchronous: SEND and RECEIVE return control immediately – the message will get copied at some point, so the process must overlap some other computation with the communication – other primitives are used to probe if the communication has finished or not

16 Deterministic Execution Need synch after every anti-diagonal Potential load imbalance Shared-memory vs. message passing Function of the model for SEND-RECEIVE Function of the algorithm: diagonal, red-black ordering

17 Ocean Kernel Procedure Solve(A) begin diff = done = 0; while (!done) do diff = 0; for i  1 to n do for j  1 to n do temp = A[i,j]; A[i,j]  0.2 * (A[i,j] + neighbors); diff += abs(A[i,j] – temp); end for if (diff < TOL) then done = 1; end while end procedure

18 Shared Address Space Model int n, nprocs; float **A, diff; LOCKDEC(diff_lock); BARDEC(bar1); main() begin read(n); read(nprocs); A  G_MALLOC(); initialize (A); CREATE (nprocs,Solve,A); WAIT_FOR_END (nprocs); end main procedure Solve(A) int i, j, pid, done=0; float temp, mydiff=0; int mymin = 1 + (pid * n/procs); int mymax = mymin + n/nprocs -1; while (!done) do mydiff = diff = 0; BARRIER(bar1,nprocs); for i  mymin to mymax for j  1 to n do … endfor LOCK(diff_lock); diff += mydiff; UNLOCK(diff_lock); BARRIER (bar1, nprocs); if (diff < TOL) then done = 1; BARRIER (bar1, nprocs); endwhile

19 Message Passing Model main() read(n); read(nprocs); CREATE (nprocs-1, Solve); Solve(); WAIT_FOR_END (nprocs-1); procedure Solve() int i, j, pid, nn = n/nprocs, done=0; float temp, tempdiff, mydiff = 0; myA  malloc(…) initialize(myA); while (!done) do mydiff = 0; if (pid != 0) SEND(&myA[1,0], n, pid-1, ROW); if (pid != nprocs-1) SEND(&myA[nn,0], n, pid+1, ROW); if (pid != 0) RECEIVE(&myA[0,0], n, pid-1, ROW); if (pid != nprocs-1) RECEIVE(&myA[nn+1,0], n, pid+1, ROW); for i  1 to nn do for j  1 to n do … endfor if (pid != 0) SEND(mydiff, 1, 0, DIFF); RECEIVE(done, 1, 0, DONE); else for i  1 to nprocs-1 do RECEIVE(tempdiff, 1, *, DIFF); mydiff += tempdiff; endfor if (mydiff < TOL) done = 1; for i  1 to nprocs-1 do SEND(done, 1, I, DONE); endfor endif endwhile

20 Title Bullet