Multiprocessors Oracle SPARC M7 - 32 core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB,

Multiprocessors Oracle SPARC M core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB, 8-way SA L2 DCache, 0.5 TB/s.

Outline Flynn's Classification
Symmetric Multiprocessors, Distributed Memory Machines Shared Memory, Message Passing Programming models Cache coherence: Snooping, Directory based

Flynn's Classification
Single instruction stream, single data stream (SISD) Uniprocessor. Single instruction stream, multiple data streams (SIMD) Data-level parallelism Applying same operations to multiple items of data in parallel Eg. Multimedia extensions, Vector architectures Applications: Gaming, 3-dimensional, real-time virtual environments. Multiple instruction streams, single data stream (MISD) Multiple instruction streams, multiple data streams (MIMD) Thread-level parallelism

Vector Length Register
SIMD Instructions ADDVV V2, V0, V1 [0] [1] [2] . . . [VLR-1] V1 V0 + + + + + + + + V2 [0] [1] [2] . . . [VLR-1] Vector Length Register VLR

Multithreading

Motivation for Multiprocessing
Performance Clusters, Software as a Service Data intensive applications Natural parallelism in large scientific applications More return on investments by replicating current designs

Symmetric Multiprocessor (SMP)
One or more levels of Cache One or more levels of Cache One or more levels of Cache One or more levels of Cache Shared Cache Main Memory I/O System Symmetric Shared Memory Centralized Shared Memory Uniform Memory Access

Distributed Shared Memory
Multicore MP Multicore MP Multicore MP Memory I/O Memory I/O Memory I/O Interconnection Network Memory I/O Memory I/O Memory I/O Multicore MP Multicore MP Multicore MP Non–Uniform Memory Access

Distributed Shared Memory
Scalability High memory bandwidth demands Low memory access latency to local memory Communication infrastructure is complex

Example 𝐶𝑃𝐼=𝐵𝑎𝑠𝑒𝐶𝑃𝐼+𝑅𝑒𝑚𝑜𝑡𝑒𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑟𝑎𝑡𝑒×𝑅𝑒𝑚𝑜𝑡𝑒𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑐𝑜𝑠𝑡

Shared Memory vs. Message Passing
Shared Memory Machine: processors share the same physical address space Implicit communication via loads and stores Implicit synchronization needed via Fences, Locks, and Flags No need to know Destination on generation of data (can store in memory and user of data can pick up later) Easy for multiple threads accessing a shared table Needs Locks and critical sections to synchronize access Message Passing Machine Memory is private Explicit send/receive to communicate Message contains data and synchronization Need to know Destination on generation of data (send) Easy for Producer-Consumer

Read A P M Interconnect A A A Read A

Ocean Kernel Procedure Solve(A) begin diff = done = 0;
while (!done) do diff = 0; for i  1 to n do for j  1 to n do temp = A[i,j]; A[i,j]  0.2 * (A[i,j] + neighbors); diff += abs(A[i,j] – temp); end for if (diff < TOL) then done = 1; end while end procedure

Shared Address Space Model
procedure Solve(A) int i, j, pid, done=0; float temp, mydiff=0; int mymin = 1 + (pid * n/procs); int mymax = mymin + n/nprocs -1; while (!done) do mydiff = diff = 0; BARRIER(bar1,nprocs); for i  mymin to mymax for j  1 to n do … endfor LOCK(diff_lock); diff += mydiff; UNLOCK(diff_lock); BARRIER (bar1, nprocs); if (diff < TOL) then done = 1; endwhile int n, nprocs; float **A, diff; LOCKDEC(diff_lock); BARDEC(bar1); main() begin read(n); read(nprocs); A  G_MALLOC(); initialize (A); CREATE (nprocs,Solve,A); WAIT_FOR_END (nprocs); end main

Shared Address Space Model
procedure Solve(A) int i, j, pid, done=0; float temp, mydiff=0; int mymin = 1 + (pid * n/procs); int mymax = mymin + n/nprocs -1; while (!done) do mydiff = diff = 0; BARRIER(bar1,nprocs); for i  mymin to mymax for j  1 to n do … endfor LOCK(diff_lock); diff += mydiff; UNLOCK(diff_lock); BARRIER (bar1, nprocs); if (diff < TOL) then done = 1; endwhile

Message Passing Model Thread m-1 myA[1] myA[2] Thread m myA[3] myA[4]
MPI_Send() myA[2] Thread m myA[3] myA[4] Thread m+1 MPI_Receive()

Message Passing Model for i  1 to nn do for j  1 to n do …
main() read(n); read(nprocs); CREATE (nprocs-1, Solve); Solve(); WAIT_FOR_END (nprocs-1); procedure Solve() int i, j, pid, nn = n/nprocs, done=0; float temp, tempdiff, mydiff = 0; myA  malloc(…) initialize(myA); while (!done) do mydiff = 0; if (pid != 0) SEND(&myA[1,0], n, pid-1, ROW); if (pid != nprocs-1) SEND(&myA[nn,0], n, pid+1, ROW); RECEIVE(&myA[0,0], n, pid-1, ROW); RECEIVE(&myA[nn+1,0], n, pid+1, ROW); for i  1 to nn do for j  1 to n do … endfor if (pid != 0) SEND(mydiff, 1, 0, DIFF); RECEIVE(done, 1, 0, DONE); else for i  1 to nprocs-1 do RECEIVE(tempdiff, 1, *, DIFF); mydiff += tempdiff; if (mydiff < TOL) done = 1; for i  1 to nprocs-1 do SEND(done, 1, i, DONE); endif endwhile

Multiprocessor Cache Coherence

Multiprocessor Cache Coherence
A read by a processor P to a location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P. A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses. Writes to the same location are serialized; that is, two writes to the same location by any two processors are seen in the same order by all processors.

Cache Coherence Consistency
When should a written value be available to read Memory Consistency Models Coherence Which value to return on a read A memory system is coherent if: Write Propagation A write is visible after a sufficient time lapse Write Serialization All writes to a location are seen by every processor in the same order

Cache Coherence Directory based protocols
Sharing status maintained in a directory Snooping protocols Sharing status is stored in the cache controller Cache controller snoops broadcast medium Write Invalidate protocols Invalidates other processors' copies on a write Write Update protocols Updates all data copies on a write Sharing Status Invalid (I), Shared (S) (or Clean), Modified (M) (or Dirty)

SMP - Write Invalidate CPU A CPU B Memory Shared CPU A reads X
Cache Miss Cache Miss Cache Miss Memory Cache Miss CPU A reads X

SMP - Write Invalidate CPU A CPU B Memory Shared Shared CPU B reads X
Cache Miss Cache Miss Cache Miss Memory Cache Miss CPU B reads X

SMP - Write Invalidate CPU A CPU B Memory Modified Shared Invalid
Write Invalidate X Write Invalidate X Write Invalidate X Memory CPU A writes X

SMP - Write Invalidate CPU A CPU B Memory Modified Invalid
Cache Miss Cache Miss Cache Miss Memory CPU B reads X

SMP - Write Invalidate CPU A CPU B Memory Shared Invalid Shared
Write Back Memory CPU B reads X

Write Invalidate Coherence Protocol
Writeback / Writethrough Contention for tags Enforcing write serialization Bus Arbitration

SMP Example A: Rd X B: Rd X C: Rd X A: Wr X C: Wr X A: Rd Y B: Wr X
B: Rd Y B: Wr Y Processor A Processor B Processor C Processor D Caches Caches Caches Caches Main Memory I/O System

SMP Cache Coherence MSI Protocol MESI Protocol
Exclusive state: No invalidate messages on writes. Intel i7 uses MESIF MOESI Protocol Owned state: Only valid copy in the system. Main memory copy is stale. Owner supplied data on a miss.

Directory Based Cache Coherence
Physical memory is distributed among all processors Directory is distributed Keeps track of sharing status Physical address determines data location Point-to-point messages between nodes are sent over an ICN

Shared CPU A Private Cache CPU B Private Cache CPU C Private Cache M D M D M D S: A Interconnection Network A: Read X

Shared Shared CPU A Private Cache CPU B Private Cache CPU C Private Cache M D M D M D S: A, B S: A Read X X B: Read X

Modified Shared Shared Invalidate CPU A Private Cache CPU B Private Cache CPU C Private Cache M D M D M D S: A, B M: A ACK Inv X A: Write X

Modified Shared Shared Invalidate CPU A Private Cache CPU B Private Cache CPU C Private Cache M D M D M D M: A S: A, B C: Write X B: Read X C, A: Write X

Multiprocessor Performance
Amdahl's Law Coherence miss (apart from the 3Cs) Would not have occured if another processor did not write to the same cache line Not a miss in a uniprocessor False coherence miss Another word in the same cache line is written into by another processor Not a miss if Cache line = 1 word

References D J. Sorin, M D. Hill, D A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis lectures on computer architecture, Morgan and Claypool R Balasubramonian, N Jouppi, N Muralimanohar. Multi-Core Cache Hierarchies. SLoCA, M&C

Slides Contents Rajeev Balasubramonian, CS6810, University of Utah.
David Wentzlaff. ELE475. Princeton University. Matthew T Jacob, “High Performance Computing”, IISc/NPTEL. Hennessy and Patterson. CA. 5ed.

Shared Memory Machine: processors share the same physical address space Implicit Communication, Hardware controlled cache coherence Message Passing Machine Explicit communication – programmed No cache coherence (simpler hardware) Message passing libraries: MPI P P P P P P P P C C C C M M M M Main Memory Interconnect

Multiprocessors Oracle SPARC M7 - 32 core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB,

Similar presentations

Presentation on theme: "Multiprocessors Oracle SPARC M7 - 32 core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiprocessors Oracle SPARC M7 - 32 core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB,

Similar presentations

Presentation on theme: "Multiprocessors Oracle SPARC M7 - 32 core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB,"— Presentation transcript:

Similar presentations

About project

Feedback