Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Similar presentations


Presentation on theme: "1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism."— Presentation transcript:

1 1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism

2 2Outline  Motivation  Multiprocessors SISD, SIMD, MIMD, and MISD SISD, SIMD, MIMD, and MISD Memory organization Memory organization Communication mechanisms Communication mechanisms  Multithreading Reading: HP3 6.1, 6.3 (snooping), and 6.9

3 3Motivation Instruction-Level Parallelism (ILP): What all we have covered so far: simple pipelining simple pipelining dynamic scheduling: scoreboarding and Tomasulo’s alg. dynamic scheduling: scoreboarding and Tomasulo’s alg. dynamic branch prediction dynamic branch prediction multiple-issue architectures: superscalar, VLIW multiple-issue architectures: superscalar, VLIW hardware-based speculation hardware-based speculation compiler techniques and software approaches compiler techniques and software approaches Bottom line: There just aren’t enough instructions that can actually be executed in parallel! instruction issue: limit on maximum issue count instruction issue: limit on maximum issue count branch prediction: imperfect branch prediction: imperfect # registers: finite # registers: finite functional units: limited in number functional units: limited in number data dependencies: hard to detect dependencies via memory data dependencies: hard to detect dependencies via memory

4 4 So, What do we do? Key Idea: Increase number of running processes multiple processes: at a given “point” in time multiple processes: at a given “point” in time  i.e., at the granularity of one (or a few) clock cycles  not sufficient to have multiple processes at the OS level! Two Approaches: multiple CPU’s: each executing a distinct process multiple CPU’s: each executing a distinct process  “Multiprocessors” or “Parallel Architectures” single CPU: executing multiple processes (“threads”) single CPU: executing multiple processes (“threads”)  “Multi-threading” or “Thread-level parallelism”

5 5 Taxonomy of Parallel Architectures Flynn’s Classification: SISD: Single instruction stream, single data stream SISD: Single instruction stream, single data stream  uniprocessor SIMD: Single instruction stream, multiple data streams SIMD: Single instruction stream, multiple data streams  same instruction executed by multiple processors  each has its own data memory  Ex: multimedia processors, vector architectures MISD: Multiple instruction streams, single data stream MISD: Multiple instruction streams, single data stream  successive functional units operate on the same stream of data  rarely found in general-purpose commercial designs  special-purpose stream processors (digital filters etc.) MIMD: Multiple instruction stream, multiple data stream MIMD: Multiple instruction stream, multiple data stream  each processor has its own instruction and data streams  most popular form of parallel processing –single-user: high-performance for one application –multiprogrammed: running many tasks simultaneously (e.g., servers)

6 6 Multiprocessor: Memory Organization Centralized, shared-memory multiprocessor: usually few processors usually few processors share single memory & bus share single memory & bus use large caches use large caches

7 7 Multiprocessor: Memory Organization Distributed-memory multiprocessor: can support large processor counts can support large processor counts  cost-effective way to scale memory bandwidth  works well if most accesses are to local memory node requires interconnection network requires interconnection network  communication between processors becomes more complicated, slower

8 8 Multiprocessor: Hybrid Organization  Use distributed-memory organization at top level  Each node itself may be a shared-memory multiprocessor (2-8 processors)

9 9 Communication Mechanisms  Shared-Memory Communication around for a long time, so well understood and standardized around for a long time, so well understood and standardized  memory-mapped ease of programming when communication patterns are complex or dynamically varying ease of programming when communication patterns are complex or dynamically varying better use of bandwidth when items are small better use of bandwidth when items are small Problem: cache coherence harder Problem: cache coherence harder  use “Snoopy” and other protocols  Message-Passing Communication simpler hardware because keeping caches coherent is easier simpler hardware because keeping caches coherent is easier communication is explicit, simpler to understand communication is explicit, simpler to understand  focusses programmer attention on communication synchronization: naturally associated with communication synchronization: naturally associated with communication  fewer errors due to incorrect synchronization

10 10Multithreading Threads: multiple processes that share code and data (and much of their address space)  recently, the term has come to include processes that may run on different processors and even have disjoint address spaces, as long as they share the code Multithreading: exploit thread-level parallelism within a processor fine-grain multithreading fine-grain multithreading  switch between threads on each instruction! coarse-grain multithreading coarse-grain multithreading  switch to a different thread only if current thread has a costly stall –E.g., switch only on a level-2 cache miss

11 11Multithreading Fine-grain multithreading switch between threads on each instruction! switch between threads on each instruction! multiple threads executed in interleaved manner multiple threads executed in interleaved manner interleaving is usually round-robin interleaving is usually round-robin CPU must be capable of switching threads on every cycle! CPU must be capable of switching threads on every cycle!  fast, frequent switches main disadvantage: main disadvantage:  slows down the execution of individual threads  that is, traded off latency for better throughput

12 12Multithreading Coarse-grain multithreading switch only if current thread has a costly stall switch only if current thread has a costly stall  E.g., level-2 cache miss can accommodate slightly costlier switches can accommodate slightly costlier switches less likely to slow down an individual thread less likely to slow down an individual thread  a thread is switched “off” only when it has a costly stall main disadvantage: main disadvantage:  limited in ability to overcome throughput losses –shorter stalls are ignored, and there may be plenty of those  issues instructions from a single thread –every switch involves emptying and restarting the instruction pipeline

13 13 Simultaneous Multithreading (SMT) Example: new Pentium with “Hyperthreading” Key Idea: Exploit ILP across multiple threads! i.e., convert thread-level parallelism into more ILP i.e., convert thread-level parallelism into more ILP exploit following features of modern processors: exploit following features of modern processors:  multiple functional units –modern processors typically have more functional units available than a single thread can utilize  register renaming and dynamic scheduling –multiple instructions from independent threads can co-exist and co- execute!

14 14 SMT: Illustration (Fig. 6.44 HP3) (a)A superscalar processor with no multithreading (b)A superscalar processor with coarse-grain multithreading (c)A superscalar processor with fine-grain multithreading (d)A superscalar processor with simultaneous multithreading (SMT) (a) (b)(c)(d)

15 15 SMT: Design Challenges  Dealing with a large register file needed to hold multiple contexts needed to hold multiple contexts  Maintaining low overhead on clock cycle fast instruction issue: choosing what to issue fast instruction issue: choosing what to issue instruction commit: choosing what to commit instruction commit: choosing what to commit keeping cache conflicts within acceptable bounds keeping cache conflicts within acceptable bounds


Download ppt "1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism."

Similar presentations


Ads by Google