Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 5 Multiprocessor and Thread-Level Parallelism

Similar presentations


Presentation on theme: "Chapter 5 Multiprocessor and Thread-Level Parallelism"— Presentation transcript:

1 Chapter 5 Multiprocessor and Thread-Level Parallelism
Introduction and Taxonomy SMP Architectures and Snooping Protocols Distributed Shared-Memory Architectures Performance Evaluations Synchronization Issues Memory Consistency

2 The University of Adelaide, School of Computer Science
22 November 2018 Introduction Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model Targeted for tightly-coupled shared-memory multiprocessors For n processors, need n threads Amount of computation assigned to each thread = grain size Threads can be used for data-level parallelism, but the overheads may outweigh the benefit Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

3 The University of Adelaide, School of Computer Science
22 November 2018 Types Introduction Symmetric multiprocessors (SMP) Small number of cores Share single memory with uniform memory latency Distributed shared memory (DSM) Memory distributed among processors Non-uniform memory access/latency (NUMA) Processors connected via direct (switched) and non-direct (multi-hop) interconnection networks Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

4 Parallel Computers Definition: “A parallel computer is a collection of processiong elements that cooperate and communicate to solve large problems fast.” Almasi and Gottlieb, Highly Parallel Computing ,1989 Questions about parallel computers: How large a collection? How powerful are processing elements? How do they cooperate and communicate? How are data transmitted? What type of interconnection? What are HW and SW primitives for programmer? Does it translate into performance?

5 What level Parallelism?
Bit level parallelism: 1970 to ~1985 4 bits, 8 bit, 16 bit, 32 bit microprocessors Instruction level parallelism (ILP): ~1985 to today Pipelining Superscalar, Out-of-order execution VLIW Limits to benefits of ILP? Process Level or Thread level parallelism; mainstream for general purpose computing? Servers are parallel Highend Desktop dual-core PC (more cores today!) What about future CMP?

6 Flynn’s Taxonomy Flynn classified by data and control streams in 1966
SIMD  Data Level Parallelism MIMD  Thread Level Parallelism MIMD popular because Flexible: N pgms and 1 multithreaded pgm Cost-effective: same MPU in desktop & MIMD Single Instruction Single Data (SISD) (Uniprocessor) Single Instruction Multiple Data SIMD (single PC: Vector, CM-2) Multiple Instruction Single Data (MISD) (????) Multiple Instruction Multiple Data MIMD (Clusters, SMP servers)

7 Back to Basics “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Parallel Architecture = Computer Architecture + Communication Architecture 2 classes of multiprocessors WRT memory: Centralized Memory Multiprocessor (SMP) < few dozen processor chips (and < 100 cores) in 2006 Small enough to share single, centralized memory Physically Distributed-Memory multiprocessor Larger number chips and cores than 1. BW demands  Memory distributed among processors

8 Centralized vs. Distributed Memory
Scale P 1 $ Inter connection network n Mem P 1 $ Inter connection network n Mem Centralized Memory Distributed Memory

9 2 Models for Communication and Memory Architecture
Communication occurs by explicitly passing messages among the processors: message-passing multiprocessors Communication occurs through a shared address space (via loads and stores): shared memory multiprocessors either UMA (Uniform Memory Access time) for shared address, centralized memory MP NUMA (Non Uniform Memory Access time) for shared address, distributed memory MP In past, confusion whether “sharing” means sharing physical memory (Symmetric MP) or sharing address space (SVM)

10 Challenges of Parallel Processing
First challenge is % of program inherently sequential Suppose 80X speedup from 100 processors. What fraction of original program can be sequential? (Amdahl's Law) 10% 5% 1% <1%

11 Amdahl’s Law Answers

12 Challenges of Parallel Processing
Second challenge is long latency to remote memory (much longer to access remote memory) Suppose 32 CPU MP, 2GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles.) What is performance impact if 0.2% instructions involve remote access? 1.5X 2.0X 2.5X

13 CPI Equation CPI = Base CPI Remote request rate x Remote request cost CPI = % x 400 = = 1.3 No communication is 1.3/0.5 or 2.6 faster than only 0.2% instructions involve slow remote accesses  out-of-order helps somewhat…

14 Challenges of Parallel Processing
Application parallelism  primarily via new algorithms that have better parallel performance Long remote latency impact  both by architect and by the programmer For example, reduce frequency of remote accesses either by Caching shared data (HW) Restructuring the data layout to make more accesses local (SW) Data prefetching  Today’s lecture on HW solution to help latency in MP systems via caches

15 The University of Adelaide, School of Computer Science
22 November 2018 Cache Coherence Processors may see different values through their private caches: Centralized Shared-Memory Architectures Stale Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

16 Example Cache Coherence Problem
1 2 P 3 4 u = ? 3 u = 7 5 u = ? $ $ $ 1 u :5 2 u :5 I/O devices u :5 Memory Processors see different values for u after event 3 With write back caches, value written back to memory depends on happenstance of cache flushes or writes back value Processes accessing main memory may see very stale value Unacceptable for programming, and its frequent!

17 Cache Coherence and Consistency
The University of Adelaide, School of Computer Science 22 November 2018 Cache Coherence and Consistency Coherence All reads by any processor must return the most recently written value Writes to the same location by any two processors are seen in the same order by all processors Issues when accessing the same memory location Consistency When a written value will be returned by a read If a processor writes location A followed by location B, any processor that sees the new value of B must also see the new value of A Involve in more than one memory location Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

18 What Does Coherency Mean?
Informally: “Any read must return the most recent write” Too strict and too difficult to implement (no global clock) Better: “Any write must eventually be seen by a read” All writes are seen in proper order (“serialization”) Two rules to ensure this: “If P writes x and P1 reads it, P’s write will be seen by P1 if the read and write are sufficiently far apart” Writes to a single location are serialized: seen in one order by all processors Latest write will be seen Otherwise could see writes in illogical order (could see older value after a newer value)

19 Potential HW Coherency Solutions
Snooping Solution (Snoopy Bus): Send all requests for data to all processors Processors snoop to see if they have a copy and respond accordingly Requires broadcast, since caching information is at processors Works well with bus (natural broadcast medium) Dominates for small scale machines (most of the market) Directory-Based Schemes (discuss later) Keep track of what is being shared in 1 centralized place (logically) Distributed memory => distributed directory for scalability (avoids bottlenecks) Send point-to-point requests to processors via network Scales better than Snooping Actually existed BEFORE Snooping-based schemes

20 Basic Snoopy Protocols
Write Invalidate Protocol: Multiple readers, single writer Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies Read Miss: Write-through: memory is always up-to-date Write-back: snoop in caches to find most recent copy Write Broadcast Protocol (typically write through): Write to shared data: broadcast on bus, processors snoop, and update any copies Read miss: memory is always up-to-date Write serialization: Shared bus serializes requests! Bus is single point of arbitration

21 Snoopy Cache-Coherence Protocols
State Address Data Cache directory Cache Controller “snoops” all transactions on the shared medium (bus or switch) relevant transaction if for a block it contains take action to ensure coherence invalidate, update, or supply value depends on state of the block and the protocol Either get exclusive access before write via write invalidate or update all copies on write

22 Example: Write-thru Invalidate
2 P 1 3 4 u = ? 3 u = 7 5 u = ? $ $ $ 1 u :5 2 u :5 I/O devices u = 7 u :5 Memory Must invalidate before step 3! Write update uses more broadcast medium BW  all recent MPUs use write invalidate

23 Snoopy Coherence Protocols
The University of Adelaide, School of Computer Science 22 November 2018 Snoopy Coherence Protocols Write invalidate On write, invalidate all other copies Use bus itself to serialize Write cannot complete until bus access is obtained Write update On write, update all copies Centralized Shared-Memory Architectures Cache-cache transfer Memory Reflection Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

24 An Example Snoopy Protocol
Write invalidation protocol, write-back cache Each block of memory is in one state: Clean in all caches and up-to-date in memory (Shared) OR Dirty in exactly one cache (Exclusive) OR Not in any caches Each cache block in cache in one state (track these): Shared : block can be read may be in multiple caches Or Exclusive : cache has only copy, its writeable, and dirty Or Invalid : block contains no data Read misses: cause all caches to snoop bus Writes to clean line are treated as misses Cache must serve requests from CPU and from Bus!!

25 Snoopy-Cache State Machine-I
CPU Read hit State machine: for CPU requests for each cache block CPU Read Shared (read/only) Invalid Place read miss on bus Cache Block State, E,S,I CPU Write CPU Read miss Write back block, Place read miss on bus CPU Read miss Place read miss on bus (original state for the replaced block!) Place Write Miss on bus CPU Write miss (replace Shared) Place Write Miss on Bus CPU Write hit (hit shared) Place Invalidate on Bus Exclusive (read/write) CPU Read hit CPU Write hit CPU Write Miss (replace Exc.) Write back cache block Place write miss on bus

26 Snoopy-Cache State Machine-II
State machine: for bus requests for each cache block Write miss for this block Shared (read/only) Invalid Write miss for this block Write Back Block; (abort memory access) Read miss for this block Write Back Block; (abort memory access) Exclusive (read/write)

27 Snoopy-Cache State Machine-III
CPU Read hit State machine: for CPU requests for each cache block and for bus requests for each cache block Write miss for this block Shared (read/only) Invalid CPU Read Place read miss on bus CPU Write Place Write Miss on bus Write miss for this block CPU Read miss Place read miss on bus CPU read miss Write back block, Place read miss on bus Write Back Block; (abort Mem. access) CPU Write Place Write Miss or Inv on Bus Cache Block State E, S, I Read miss for this block Write Back Block; (abort mem. access) Exclusive (read/write) CPU read hit CPU write hit CPU Write Miss Write back cache block Place write miss on bus

28 Example Assumes initial cache state
Processor 1 Processor 2 Bus Memory Assumes initial cache state is invalid and A1 and A2 map to same cache block (frame), but A1 != A2 Remote Write Write Back Remote Write Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back CPU Read Miss CPU Write Miss Write Back

29 Example: Step 1 Assumes initial cache state
is invalid and A1 and A2 map to same cache block, but A1 != A2. Active mark = Remote Write Write Back Remote Write Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back CPU Read Miss CPU Write Miss Write Back

30 Example: Step 2 Assumes initial cache state
is invalid and A1 and A2 map to same cache block, but A1 != A2 Remote Write Write Back Remote Write Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back CPU Read Miss CPU Write Miss Write Back

31 Example: Step 3 Assumes initial cache state
is invalid and A1 and A2 map to same cache block, but A1 != A2. Remote Write Write Back Remote Write Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back CPU Read Miss CPU Write Miss Write Back

32 Example: Step 4 Assumes initial cache state
WrInv Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1 != A2 Remote Write Write Back Remote Write Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back CPU Read Miss CPU Write Miss Write Back

33 Example: Step 5 Assumes initial cache state
WrInv Remote Write Write Back Remote Write Invalid Shared Exclusive CPU Read hit Read miss on bus Write miss on bus CPU Write Place Write Miss on Bus CPU read hit CPU write hit Remote Read Write Back Assumes initial cache state is invalid and A1 and A2 map to same cache frame, but A1 != A2; A2 replaces A1 CPU Read Miss CPU Write Miss Write Back

34 Snoopy Coherence Protocols
The University of Adelaide, School of Computer Science 22 November 2018 Snoopy Coherence Protocols Complications for the basic MSI protocol: Operations are not atomic E.g. detect miss, acquire bus, receive a response Creates possibility of deadlock and races One solution: processor that sends invalidate can hold bus until other processors receive the invalidate – hurt performance Extensions: Add exclusive state (make two states, M, E) to indicate clean block in only one cache (MESI protocol) Prevents needing to write invalidate on a write Owned state (E) – own the clean block  Can you extend the MSI state transition to MESI? Centralized Shared-Memory Architectures Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

35 MESI Coherence Protocols
The University of Adelaide, School of Computer Science 22 November 2018 MESI Coherence Protocols Centralized Shared-Memory Architectures No writeback global traffic No global traffic Advantages?? Bus request Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

36 Implementation Complications
Write Races: Cannot update cache until bus is obtained Otherwise, another processor may get bus first, and then write the same cache block! Two step process: Arbitrate for bus Place miss on bus and complete operation If miss occurs to block while waiting for bus, handle miss (invalidate may be needed) and then restart. Split transaction bus: Bus transaction is not atomic: can have multiple outstanding transactions for a block Multiple misses can interleave, allowing two caches to grab block in the Exclusive state Must track and prevent multiple misses for one block Must support interventions and invalidations Many protocol options to reduce coherence activities (e.g. MESI, MOESI)

37 Implementing Snooping Caches
Multiple processors must be on bus, access to both addresses and data Add a few new commands to perform coherency, in addition to read and write Processors continuously snoop on address bus If address matches tag, either invalidate or update Since every bus transaction checks cache tags, could interfere with CPU cache access: solution 1: duplicate set of tags for L1 caches just to allow checks in parallel with CPU (shadow L1 cache) solution 2: L2 cache already duplicate, (unless exclusive) provided L2 obeys inclusion with L1 cache (i.e. L1 blocks MUST reside in L2) block size, associativity of L2 affects L1

38 Review: Inclusive vs. Exclusive L1/L2
Inclusive: L1 blocks must be in L2 Simplify cache coherence, miss L2 bypass L1 Capacity = L2 Must back-invalidation to get rid of L1 copy when L2 block is repalced. Hurt L1 hit. Exclusive: L1 an L2 blocks are disjoint Must search both L1 and L2 for coherence Capacity = L1 + L2


Download ppt "Chapter 5 Multiprocessor and Thread-Level Parallelism"

Similar presentations


Ads by Google