Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Slides:



Advertisements
Similar presentations
CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Lecture 6: Multicore Systems
Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
/ Computer Architecture and Design Instructor: Dr. Michael Geiger Summer 2014 Lecture 6: Speculation.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
Chapter 17 Parallel Processing.
How Multi-threading can increase on-chip parallelism
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Multi-core Processing The Past and The Future Amir Moghimi, ASIC Course, UT ECE.
CPE 731 Advanced Computer Architecture Thread Level Parallelism Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:
COMP Multithreading. Coarse Grain Multithreading Minimal pipeline changes – Need to abort instructions in “shadow” of miss – Resume instruction.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
Chapter 3.4: Loop-Level Parallelism and Thread-Level Parallelism
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
1 Chapter 3: Limits on ILP Limits to ILP (another perspective) Thread Level Parallelism Multithreading Simultaneous Multithreading Power 4 vs. Power 5.
Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
Ch3. Limits on Instruction-Level Parallelism 1. ILP Limits 2. SMT (Simultaneous Multithreading) ECE562/468 Advanced Computer Architecture Prof. Honggang.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
PipeliningPipelining Computer Architecture (Fall 2006)
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
COMP 740: Computer Architecture and Implementation
/ Computer Architecture and Design
CPE 731 Advanced Computer Architecture Thread Level Parallelism
Simultaneous Multithreading
Multi-core processors
Computer Structure Multi-Threading
Limits on ILP and Multithreading
5.2 Eleven Advanced Optimizations of Cache Performance
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
Hyperthreading Technology
Electrical and Computer Engineering
Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
Superscalar Processors & VLIW Processors
Levels of Parallelism within a Single Processor
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Limits to ILP Conflicting studies of amount
Hardware Multithreading
Simultaneous Multithreading in Superscalar Processors
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
How to improve (decrease) CPI
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Levels of Parallelism within a Single Processor
Hardware Multithreading
8 – Simultaneous Multithreading
Advanced Computer Architecture 5MD00 / 5Z032 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
Presentation transcript:

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions and data this is unlike the traditional (OS) definition of a thread which shares instructions with other threads but they each have their own stack and data (a thread in this case is multiple versions of the same process) –a thread may be a traditional thread or a separate process or a single program executing in parallel the idea here is that the thread offers different instructions and data so that the processor, when it has to stall, can switch to another thread and continue execution so that it does not cause time consuming stalls –TLP exploits a different kind of parallelism than ILP

Unit III Introduction to Multithreading CS2354 Advanced Computer Architecture

Approaches to TLP We want to enhance our current processor –superscalar with dynamic scheduling Fine-grained multi-threading –switches between threads at each clock cycle thus, threads are executed in an interleaved fashion –as the processor switches from one thread to the next, a thread that is currently stalled is skipped over –CPU must be able to switch between threads at every clock cycle so that it needs extra hardware support Coarse-grained multi-threading –switches between threads only when current thread is likely to stall for some time (e.g., level 2 cache miss) –the switching process can be more time consuming since we are not switching nearly as often and therefore does not need extra hardware support

Advantages/Disadvantages Fine-grained –Adv: less susceptible to stalling situations –Adv: throughput costs can be hidden because stalls are often unnoticed –Disadv: slows down execution of each thread –Disadv: requires a switching process that does not cost any cycles – this can be done at the expense of more hardware (we will require at a minimum a PC for every thread) Coarse-grained –Adv: more natural flow for any given thread –Adv: easier to implement switching process –Adv: can take advantage of current processors to implement coarse-grained, but not fine-grained –Disadv: limited in its ability to overcome throughput losses because of short stalling situations because the cost of starting the pipeline on a new thread is expensive (in comparison to fine-grained)

Simultaneous Multi-threading (SMT) SMT uses multiple issue and dynamic scheduling on our superscalar architecture but adds multi-threading –(a) is the traditional approach with idle slots caused by stalls and a lack of ILP –(b) and (c) are fine-grained and coarse-grained MT respectively –(d) shows the potential payoff for SMT –(e) goes one step further to illustrate multiprocessing

Four Approaches Superscalar on a single thread (a) –we are limited to ILP or, if we switch threads when one is going to stall, then the switch is equivalent to a context switch, which takes many (dozens or hundreds) of cycles Superscalar + coarse-grained MT (c) –fairly easy to implement, performance increase over no MT support, but still contains empty instruction slots due to short stalling situations (as opposed to lengthier stalls associated with cache miss) Superscalar + fine-grained MT (b) –requires switching between threads at each cycle which requires more complex and expensive hardware, but eliminates most stalls, the only problem is that a thread that lacks ILP or cannot make use of all instruction issue slots will not take full advantage of the hardware Superscalar + SMT (d) –most efficient way to use hardware and multithreading so that as many functional units as possible can be occupied

Superscalar Limitations for SMT In spite of the performance increase by combining our superscalar hardware and SMT, there are still inherent limitations –how many active threads can be considered at one time? we will be limited by resources such as number of PCs available to keep track of each thread, size of bus to accommodate multiple threads having instruction fetches at the same time, how many threads can be stored in main memory, etc –finite limitation on buffers used to support the superscalar reorder buffer, instruction queue, issue buffer –limitations on bandwidth between CPU and cache/memory –limitation on the combination of instructions that can be issued at the same time consider four threads, each of which contains an abnormally large number of FP * but no FP +, then the multiplier functional unit(s) will be very busy while the adder remains idle

SMT Design Challenges Superscalars best perform on lengthier pipelines We will only implement SMT using fine-grained MT so we need –large register file to accommodate multiple threads –per-thread renaming table and more registers for renaming –separate PCs for each thread –ability to commit instructions of multiple threads in the same cycle –added logic that does not require an increase in clock cycle time –cache and TLB setups that can handle simultaneous thread access without a degradation in their performance (miss rate, hit time) In spite of the design challenges, we will find –performance on each individual thread will decrease (this is natural since every thread will be interrupted as the CPU switches to other threads, cycle- by-cycle) One alternative strategy is to have a “preferred” thread of which instructions are issued every cycle as is possible –the functional unit slots not used are filled by alternate threads –if the preferred thread reaches a substantial stall, other threads fill in until the stall ends

SMT Example Design The IBM Power5 was built on top of the Power4 pipeline –but in this case, the Power5 implements SMT simple design choices whenever possible increase associativity of L1 instruction cache and TLB to offset the impact that might arise because of multithreading access to the cache and TLB add per-thread load/store queues increase size of L2 and L3 caches to permit more threads to be represented in these caches add separate instruction prefetch and buffering hardware increase number of virtual registers for renaming increase size of instruction issue queues –the cost for these enhancements is not extreme (although it does take up more space on the chip) – are the performance payoffs worthwhile?

Performance Improvement of SMT As it turns out, the improvement gains of SMT over a single thread processor is only modest –in part this is because multi-issue processors have not increased their issue size over the past few years – to best take advantage of SMT, issue size should increase from maybe 4 to 8 or more, but this is not practical Pentium IV Extreme had improvements of –1.01 and 1.07 for SPEC int and SPEC FP benchmarks respectively over Pentium IV (Extreme = Pentium IV + SMT support) When running 2 SPEC benchmarks at the same time in SMT mode, improvements ranged from –0.9 to 1.58 with an average improvement of 1.20 Conclusions –SMT has benefits but the costs do not necessarily pay for the improvement –another option: use multiple CPU cores on a single processor (see (e) from the figure on slide 4) –another factor discussed in the text (but skipped here) is the increasing demands on power consumption as we continue to add support for ILP/TLP/SMT

Advanced Multi-Issue Processors Here, we wrap up chapter 3 with a brief comparison of multi- issue superscalar processors ProcessorArchitectureFetch/Issue/ Execute Functional Units Clock Rate (GHz) Pentium 4 Extreme speculative dynamically scheduled, deeply pipelined, SMT 3/3/47 int 1 FP 3.8 AMD Athlon 64 speculative dynamically scheduled 3/3/46 int 3 FP 2.8 IBM Power 5 speculative dynamically scheduled, SMT, 2 CPU cores/chip 8/4/86 int 2 FP 1.9 Itanium 2 EPIC style (see appendix G), primarily statically scheduled 6/5/119 int 2 FP 1.6

Comparison on Integer Benchmarks

Comparison on FP Benchmarks