Multiprocessors Oracle SPARC M7 - 32 core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB,

Slides:



Advertisements
Similar presentations
The University of Adelaide, School of Computer Science
Advertisements

SE-292 High Performance Computing
1 Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, intro to multi-thread programming models.
1 Lecture 19: Shared-Memory Multiprocessors Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections )
Lecture 18: Multiprocessors
1 Lecture 18: Large Caches, Multiprocessors Today: NUCA caches, multiprocessors (Sections ) Reminder: assignment 5 due Thursday (don’t procrastinate!)
Chapter 6 Multiprocessors and Thread-Level Parallelism 吳俊興 高雄大學資訊工程學系 December 2004 EEF011 Computer Architecture 計算機結構.
1 Lecture 24: Multiprocessors Today’s topics:  Directory-based cache coherence protocol  Synchronization  Consistency  Writing parallel programs Reminder:
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 Lecture 2: Intro and Snooping Protocols Topics: multi-core cache organizations, programming models, cache coherence (snooping-based)
1 Lecture 18: Shared-Memory Multiprocessors Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections )
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Lecture 1: Introduction Course organization:  13 lectures on parallel architectures  ~5 lectures on cache coherence, consistency  ~3 lectures on TM.
1 Lecture 25: Multi-core Processors Today’s topics:  Writing parallel programs  SMT  Multi-core examples Reminder:  Assignment 9 due Tuesday.
1 Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols.
1 Lecture 2: Parallel Programs Topics: parallel applications, parallelization process, consistency models.
Lecture 13: Multiprocessors Kai Bu
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
1 Lecture 25: Multiprocessors Today’s topics:  Synchronization  Consistency  Shared memory vs message-passing  Simultaneous multi-threading (SMT)
1 Lecture: Memory Technology Innovations Topics: state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile cells, photonics Multiprocessor.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13: Multiprocessors Kai Bu
Multiprocessors – Locks
CS 704 Advanced Computer Architecture
COMP 740: Computer Architecture and Implementation
CS5102 High Performance Computer Systems Thread-Level Parallelism
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Lecture 26: Multiprocessors
Flynn’s Taxonomy Flynn classified by data and control streams in 1966
The University of Adelaide, School of Computer Science
Lecture 26: Multiprocessors
Chapter 6 Multiprocessors and Thread-Level Parallelism
Chip-Multiprocessor.
Lecture 2: Parallel Programs
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
CMSC 611: Advanced Computer Architecture
Multiprocessors - Flynn’s taxonomy (1966)
Lecture: Coherence Protocols
Lecture 27: Pot-Pourri Today’s topics:
Lecture: Coherence Protocols
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Lecture 25: Multiprocessors
Lecture 27: Multiprocessors
Lecture 17: Multi-threaded Applications
Lecture 25: Multiprocessors
Lecture 26: Multiprocessors
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture 3: Coherence Protocols
Lecture 19: Coherence Protocols
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 18: Cache Coherence
CPE 631 Lecture 20: Multiprocessors
Lecture 19: Coherence and Synchronization
CSL718 : Multiprocessors 13th April, 2006 Introduction
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Multiprocessors Oracle SPARC M7 - 32 core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB, 8-way SA L2 DCache, 0.5 TB/s.

Outline Flynn's Classification Symmetric Multiprocessors, Distributed Memory Machines Shared Memory, Message Passing Programming models Cache coherence: Snooping, Directory based

Flynn's Classification Single instruction stream, single data stream (SISD) Uniprocessor. Single instruction stream, multiple data streams (SIMD) Data-level parallelism Applying same operations to multiple items of data in parallel Eg. Multimedia extensions, Vector architectures Applications: Gaming, 3-dimensional, real-time virtual environments. Multiple instruction streams, single data stream (MISD) Multiple instruction streams, multiple data streams (MIMD) Thread-level parallelism

Vector Length Register SIMD Instructions ADDVV V2, V0, V1 [0] [1] [2] . . . [VLR-1] V1 V0 + + + + + + + + V2 [0] [1] [2] . . . [VLR-1] Vector Length Register VLR

Multithreading

Motivation for Multiprocessing Performance Clusters, Software as a Service Data intensive applications Natural parallelism in large scientific applications More return on investments by replicating current designs

Symmetric Multiprocessor (SMP) One or more levels of Cache One or more levels of Cache One or more levels of Cache One or more levels of Cache Shared Cache Main Memory I/O System Symmetric Shared Memory Centralized Shared Memory Uniform Memory Access

Distributed Shared Memory Multicore MP Multicore MP Multicore MP Memory I/O Memory I/O Memory I/O Interconnection Network Memory I/O Memory I/O Memory I/O Multicore MP Multicore MP Multicore MP Non–Uniform Memory Access

Distributed Shared Memory Scalability High memory bandwidth demands Low memory access latency to local memory Communication infrastructure is complex

Example 𝐶𝑃𝐼=𝐵𝑎𝑠𝑒𝐶𝑃𝐼+𝑅𝑒𝑚𝑜𝑡𝑒𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑟𝑎𝑡𝑒×𝑅𝑒𝑚𝑜𝑡𝑒𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑐𝑜𝑠𝑡

Shared Memory vs. Message Passing Shared Memory Machine: processors share the same physical address space Implicit communication via loads and stores Implicit synchronization needed via Fences, Locks, and Flags No need to know Destination on generation of data (can store in memory and user of data can pick up later) Easy for multiple threads accessing a shared table Needs Locks and critical sections to synchronize access Message Passing Machine Memory is private Explicit send/receive to communicate Message contains data and synchronization Need to know Destination on generation of data (send) Easy for Producer-Consumer

Shared Memory vs. Message Passing Read A P M Interconnect A A A Read A

Ocean Kernel Procedure Solve(A) begin diff = done = 0; while (!done) do diff = 0; for i  1 to n do for j  1 to n do temp = A[i,j]; A[i,j]  0.2 * (A[i,j] + neighbors); diff += abs(A[i,j] – temp); end for if (diff < TOL) then done = 1; end while end procedure

Shared Address Space Model procedure Solve(A) int i, j, pid, done=0; float temp, mydiff=0; int mymin = 1 + (pid * n/procs); int mymax = mymin + n/nprocs -1; while (!done) do mydiff = diff = 0; BARRIER(bar1,nprocs); for i  mymin to mymax for j  1 to n do … endfor LOCK(diff_lock); diff += mydiff; UNLOCK(diff_lock); BARRIER (bar1, nprocs); if (diff < TOL) then done = 1; endwhile int n, nprocs; float **A, diff; LOCKDEC(diff_lock); BARDEC(bar1); main() begin read(n); read(nprocs); A  G_MALLOC(); initialize (A); CREATE (nprocs,Solve,A); WAIT_FOR_END (nprocs); end main

Shared Address Space Model procedure Solve(A) int i, j, pid, done=0; float temp, mydiff=0; int mymin = 1 + (pid * n/procs); int mymax = mymin + n/nprocs -1; while (!done) do mydiff = diff = 0; BARRIER(bar1,nprocs); for i  mymin to mymax for j  1 to n do … endfor LOCK(diff_lock); diff += mydiff; UNLOCK(diff_lock); BARRIER (bar1, nprocs); if (diff < TOL) then done = 1; endwhile

Message Passing Model Thread m-1 myA[1] myA[2] Thread m myA[3] myA[4] MPI_Send() myA[2] Thread m myA[3] myA[4] Thread m+1 MPI_Receive()

Message Passing Model for i  1 to nn do for j  1 to n do … main() read(n); read(nprocs); CREATE (nprocs-1, Solve); Solve(); WAIT_FOR_END (nprocs-1); procedure Solve() int i, j, pid, nn = n/nprocs, done=0; float temp, tempdiff, mydiff = 0; myA  malloc(…) initialize(myA); while (!done) do mydiff = 0; if (pid != 0) SEND(&myA[1,0], n, pid-1, ROW); if (pid != nprocs-1) SEND(&myA[nn,0], n, pid+1, ROW); RECEIVE(&myA[0,0], n, pid-1, ROW); RECEIVE(&myA[nn+1,0], n, pid+1, ROW); for i  1 to nn do for j  1 to n do … endfor if (pid != 0) SEND(mydiff, 1, 0, DIFF); RECEIVE(done, 1, 0, DONE); else for i  1 to nprocs-1 do RECEIVE(tempdiff, 1, *, DIFF); mydiff += tempdiff; if (mydiff < TOL) done = 1; for i  1 to nprocs-1 do SEND(done, 1, i, DONE); endif endwhile

Multiprocessor Cache Coherence

Multiprocessor Cache Coherence A read by a processor P to a location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P. A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses. Writes to the same location are serialized; that is, two writes to the same location by any two processors are seen in the same order by all processors.

Cache Coherence Consistency When should a written value be available to read Memory Consistency Models Coherence Which value to return on a read A memory system is coherent if: Write Propagation A write is visible after a sufficient time lapse Write Serialization All writes to a location are seen by every processor in the same order

Cache Coherence Directory based protocols Sharing status maintained in a directory Snooping protocols Sharing status is stored in the cache controller Cache controller snoops broadcast medium Write Invalidate protocols Invalidates other processors' copies on a write Write Update protocols Updates all data copies on a write Sharing Status Invalid (I), Shared (S) (or Clean), Modified (M) (or Dirty)

SMP - Write Invalidate CPU A CPU B Memory Shared CPU A reads X Cache Miss Cache Miss Cache Miss Memory Cache Miss CPU A reads X

SMP - Write Invalidate CPU A CPU B Memory Shared Shared CPU B reads X Cache Miss Cache Miss Cache Miss Memory Cache Miss CPU B reads X

SMP - Write Invalidate CPU A CPU B Memory Modified Shared Invalid Write Invalidate X Write Invalidate X Write Invalidate X Memory CPU A writes X

SMP - Write Invalidate CPU A CPU B Memory Modified Invalid Cache Miss Cache Miss Cache Miss Memory CPU B reads X

SMP - Write Invalidate CPU A CPU B Memory Shared Invalid Shared Write Back Memory CPU B reads X

Write Invalidate Coherence Protocol Writeback / Writethrough Contention for tags Enforcing write serialization Bus Arbitration

SMP Example A: Rd X B: Rd X C: Rd X A: Wr X C: Wr X A: Rd Y B: Wr X B: Rd Y B: Wr Y Processor A Processor B Processor C Processor D Caches Caches Caches Caches Main Memory I/O System

SMP Cache Coherence MSI Protocol MESI Protocol Exclusive state: No invalidate messages on writes. Intel i7 uses MESIF MOESI Protocol Owned state: Only valid copy in the system. Main memory copy is stale. Owner supplied data on a miss.

Directory Based Cache Coherence Physical memory is distributed among all processors Directory is distributed Keeps track of sharing status Physical address determines data location Point-to-point messages between nodes are sent over an ICN

Directory Based Cache Coherence Shared CPU A Private Cache CPU B Private Cache CPU C Private Cache M D M D M D S: A Interconnection Network A: Read X

Directory Based Cache Coherence Shared Shared CPU A Private Cache CPU B Private Cache CPU C Private Cache M D M D M D S: A, B S: A Read X X B: Read X

Directory Based Cache Coherence Modified Shared Shared Invalidate CPU A Private Cache CPU B Private Cache CPU C Private Cache M D M D M D S: A, B M: A ACK Inv X A: Write X

Directory Based Cache Coherence Modified Shared Shared Invalidate CPU A Private Cache CPU B Private Cache CPU C Private Cache M D M D M D M: A S: A, B C: Write X B: Read X C, A: Write X

Multiprocessor Performance Amdahl's Law Coherence miss (apart from the 3Cs) Would not have occured if another processor did not write to the same cache line Not a miss in a uniprocessor False coherence miss Another word in the same cache line is written into by another processor Not a miss if Cache line = 1 word

References D J. Sorin, M D. Hill, D A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis lectures on computer architecture, Morgan and Claypool. 2011. R Balasubramonian, N Jouppi, N Muralimanohar. Multi-Core Cache Hierarchies. SLoCA, M&C. 2011.

Slides Contents Rajeev Balasubramonian, CS6810, University of Utah. David Wentzlaff. ELE475. Princeton University. Matthew T Jacob, “High Performance Computing”, IISc/NPTEL. Hennessy and Patterson. CA. 5ed.

Shared Memory vs. Message Passing Shared Memory Machine: processors share the same physical address space Implicit Communication, Hardware controlled cache coherence Message Passing Machine Explicit communication – programmed No cache coherence (simpler hardware) Message passing libraries: MPI P P P P P P P P C C C C M M M M Main Memory Interconnect