Presentation is loading. Please wait.

Presentation is loading. Please wait.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Similar presentations


Presentation on theme: "Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions."— Presentation transcript:

1 Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions and data this is unlike the traditional (OS) definition of a thread which shares instructions with other threads but they each have their own stack and data (a thread in this case is multiple versions of the same process) –a thread may be a traditional thread or a separate process or a single program executing in parallel the idea here is that the thread offers different instructions and data so that the processor, when it has to stall, can switch to another thread and continue execution so that it does not cause time consuming stalls –TLP exploits a different kind of parallelism than ILP

2 Unit III Introduction to Multithreading CS2354 Advanced Computer Architecture

3 Approaches to TLP We want to enhance our current processor –superscalar with dynamic scheduling Fine-grained multi-threading –switches between threads at each clock cycle thus, threads are executed in an interleaved fashion –as the processor switches from one thread to the next, a thread that is currently stalled is skipped over –CPU must be able to switch between threads at every clock cycle so that it needs extra hardware support Coarse-grained multi-threading –switches between threads only when current thread is likely to stall for some time (e.g., level 2 cache miss) –the switching process can be more time consuming since we are not switching nearly as often and therefore does not need extra hardware support

4 Advantages/Disadvantages Fine-grained –Adv: less susceptible to stalling situations –Adv: throughput costs can be hidden because stalls are often unnoticed –Disadv: slows down execution of each thread –Disadv: requires a switching process that does not cost any cycles – this can be done at the expense of more hardware (we will require at a minimum a PC for every thread) Coarse-grained –Adv: more natural flow for any given thread –Adv: easier to implement switching process –Adv: can take advantage of current processors to implement coarse-grained, but not fine-grained –Disadv: limited in its ability to overcome throughput losses because of short stalling situations because the cost of starting the pipeline on a new thread is expensive (in comparison to fine-grained)

5 Simultaneous Multi-threading (SMT) SMT uses multiple issue and dynamic scheduling on our superscalar architecture but adds multi-threading –(a) is the traditional approach with idle slots caused by stalls and a lack of ILP –(b) and (c) are fine-grained and coarse-grained MT respectively –(d) shows the potential payoff for SMT –(e) goes one step further to illustrate multiprocessing

6 Four Approaches Superscalar on a single thread (a) –we are limited to ILP or, if we switch threads when one is going to stall, then the switch is equivalent to a context switch, which takes many (dozens or hundreds) of cycles Superscalar + coarse-grained MT (c) –fairly easy to implement, performance increase over no MT support, but still contains empty instruction slots due to short stalling situations (as opposed to lengthier stalls associated with cache miss) Superscalar + fine-grained MT (b) –requires switching between threads at each cycle which requires more complex and expensive hardware, but eliminates most stalls, the only problem is that a thread that lacks ILP or cannot make use of all instruction issue slots will not take full advantage of the hardware Superscalar + SMT (d) –most efficient way to use hardware and multithreading so that as many functional units as possible can be occupied

7 Superscalar Limitations for SMT In spite of the performance increase by combining our superscalar hardware and SMT, there are still inherent limitations –how many active threads can be considered at one time? we will be limited by resources such as number of PCs available to keep track of each thread, size of bus to accommodate multiple threads having instruction fetches at the same time, how many threads can be stored in main memory, etc –finite limitation on buffers used to support the superscalar reorder buffer, instruction queue, issue buffer –limitations on bandwidth between CPU and cache/memory –limitation on the combination of instructions that can be issued at the same time consider four threads, each of which contains an abnormally large number of FP * but no FP +, then the multiplier functional unit(s) will be very busy while the adder remains idle

8 SMT Design Challenges Superscalars best perform on lengthier pipelines We will only implement SMT using fine-grained MT so we need –large register file to accommodate multiple threads –per-thread renaming table and more registers for renaming –separate PCs for each thread –ability to commit instructions of multiple threads in the same cycle –added logic that does not require an increase in clock cycle time –cache and TLB setups that can handle simultaneous thread access without a degradation in their performance (miss rate, hit time) In spite of the design challenges, we will find –performance on each individual thread will decrease (this is natural since every thread will be interrupted as the CPU switches to other threads, cycle- by-cycle) One alternative strategy is to have a “preferred” thread of which instructions are issued every cycle as is possible –the functional unit slots not used are filled by alternate threads –if the preferred thread reaches a substantial stall, other threads fill in until the stall ends

9 SMT Example Design The IBM Power5 was built on top of the Power4 pipeline –but in this case, the Power5 implements SMT simple design choices whenever possible increase associativity of L1 instruction cache and TLB to offset the impact that might arise because of multithreading access to the cache and TLB add per-thread load/store queues increase size of L2 and L3 caches to permit more threads to be represented in these caches add separate instruction prefetch and buffering hardware increase number of virtual registers for renaming increase size of instruction issue queues –the cost for these enhancements is not extreme (although it does take up more space on the chip) – are the performance payoffs worthwhile?

10 Performance Improvement of SMT As it turns out, the improvement gains of SMT over a single thread processor is only modest –in part this is because multi-issue processors have not increased their issue size over the past few years – to best take advantage of SMT, issue size should increase from maybe 4 to 8 or more, but this is not practical Pentium IV Extreme had improvements of –1.01 and 1.07 for SPEC int and SPEC FP benchmarks respectively over Pentium IV (Extreme = Pentium IV + SMT support) When running 2 SPEC benchmarks at the same time in SMT mode, improvements ranged from –0.9 to 1.58 with an average improvement of 1.20 Conclusions –SMT has benefits but the costs do not necessarily pay for the improvement –another option: use multiple CPU cores on a single processor (see (e) from the figure on slide 4) –another factor discussed in the text (but skipped here) is the increasing demands on power consumption as we continue to add support for ILP/TLP/SMT

11 Advanced Multi-Issue Processors Here, we wrap up chapter 3 with a brief comparison of multi- issue superscalar processors ProcessorArchitectureFetch/Issue/ Execute Functional Units Clock Rate (GHz) Pentium 4 Extreme speculative dynamically scheduled, deeply pipelined, SMT 3/3/47 int 1 FP 3.8 AMD Athlon 64 speculative dynamically scheduled 3/3/46 int 3 FP 2.8 IBM Power 5 speculative dynamically scheduled, SMT, 2 CPU cores/chip 8/4/86 int 2 FP 1.9 Itanium 2 EPIC style (see appendix G), primarily statically scheduled 6/5/119 int 2 FP 1.6

12 Comparison on Integer Benchmarks

13 Comparison on FP Benchmarks


Download ppt "Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions."

Similar presentations


Ads by Google