Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.

Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle Proceedings of ISCA ` 95, Italy Presented by : Amit Gaur

Overview Instruction Level Parallelism vs. Thread Level Parallelism Motivation Simulation Environment and Workload Simultaneous Multithreading Models Performance Analysis Extensions in Design Single Chip Multiprocessing Summary Current Implementations Retrospective

Instruction Level Parallelism Superscalar processors Shortcomings: a) Instruction Dependencies b) long latencies within single thread

Thread Level Parallelism Traditional Multithreaded Architecture Exploit parallelism at application level Multiple threads: Inherent Parallelism Attack Vertical Waste: memory and functional unit latencies E.g.: Server applications, online transaction processing, web services

Need for Simultaneous Multithreading Attack vertical as well as horizontal waste Fetch instructions from multiple threads each cycle Exploit all parallelism: full utilization of execution resources Decrease in wasted issue slots Comparison with superscalar,fine-grain multithreaded processor, single-chip,multiple issue multiprocessors

Simulation Environment Emulation based instruction level simulation Model on Alpha AXP 21164 extended for wide superscalar execution and multithreaded execution Support for increased single stream parallelism,more flexible instruction issue, improved branch prediction, and larger higher bandwidth caches Code generated using Multiflow trace scheduling compiler(static scheduling)

Simulation Environment(Continued) 10 functional units(4 integer, 2 floating point, 3 Load/Store, 1 Branch) All units pipelined In-order issue of dependence free instructions with 8 instruction per thread window L1 and L2 cache are on-chip 2048 entry, 2 bit branch prediction history table maintained Support for upto 8 hardware contexts

Workload Specifications SPEC92 Benchmark suite simulated To obtain TLP, distinct program allocated to each thread :Parallel workload based on multiprogramming Executable generated with lowest single thread execution time used

Limitations of Superscalar Processors

Superscalar Performance Degradation Overlap in a number of delaying causes Completely eliminating any 1 cause will not result in performance increase 61% vertical waste and 39% horizontal waste Tackle both using simultaneous multithreading

Simultaneous Multithreading Models Fine Grain Multithreading: 1 thread issues instructions in each cycle SM:Full Simultaneous Issue: All eight threads compete for each issue slot, each cycle=> Maximum flexibility. SM:Single Issue, SM: Dual Issue, SM:Four Issue: limits the number of instructions each thread can issue, or have active in the scheduling window, each cycle. SM: Limited Connection: Each hardware context is connected to exactly one type of functional unit=> Least Dynamic of all Models.

Hardware Complexities of Models

Design Challenges in SMT processors Issue slot usage limited by imbalances in resource needs and resource availability Number of active threads, limitations on buffer sizes, instruction mix from multiple threads Hardware complexity: need to implement superscalar along with thread level parallelism Use of priority threads can result in throughput reduction as pipeline less likely to have instruction mix from different threads Mixing many threads also compromises performancce of individual threads. Tradeoff- small number of active threads, even smaller number of preferred threads

From Superscalar to SMT SMT is an out of order superscalar extended with hardware to support multiple threads Multiple Thread Support: a) per-thread program counters b) per-thread return stacks c) per-thread bookkeeping for instruction retirement,trap and instruction dispatch from prefetch queue d) thread identifiers eg. With BTB and TLB entries Should SMT processors speculate?? Determine role of instruction speculation in SMT.

Instruction Speculation Speculation executes ‘probable’ instructions to hide branch latencies Processor fetches on a hardware based prediction Correct prediction - Keep going Incorrect prediction - Rollback SMT has 2 ways to deal with branch delay stalls a) Speculation b) Fetch/Issue from other threads SMT and Speculation: Speculation can be wasteful on SMT as one thread’s speculative instructions can compete with & replace another’s non speculative instructions

Performance Evaluation of SMT

Performance Evaluation(Contd.) Fine Grain MT: Max Speedup is 2.1. No gain in vertical waste reduction after 4 threads SMT models: Speedup ranges from 3.5 to 4.2, with issue rate reaching 6.3 IPC 4 issue model gets nearly same performance as full issue, dual issue is at 94% of full issue at 8 threads As ratio of threads to issue slots increases performance of models increases. Tradeoff between number of hardware contexts and hardware complexity. Adverse effect of competition for sharing of resources -> lowest priority thread runs slowest More strain on caches due to reduced locality- increase in I and D cache misses Overall increase in instruction throughput

Extensions: Alternative cache Design for SMT Comparison of private per thread caches(L1) to shared caches for Instructions and Data. Shared caches optimize for small number of threads Shared d-cache outperforms private d-cache for all configurations. Private I-caches perform better at high number of threads.

Speculation in SMT

SMT vs. Single chip Multiprocessing Similarities: use of multiple register sets, multiple functional units, need for high issue bandwidth on single chip Differences: Multiprocessor uses static allocation of resources, SM processor allows resource allocation to change every cycle. Same configuration used for testing performance: a) 8KB private I-cache and D-cache b) 256 KB 4-way set assoc.. L2 cache c) 2 MB direct mapped L3 cache Attempt to bias the test in favor of MP

Test Results

Test Results(Contd.) Test A,B,C : high ratio of FU and threads to issue bandwidth- greater opportunity to utilize issue bandwidth. Test D repeats A but SMT Processor has 10 FU’s. It still outperforms Multiprocessor Test E & F- MP is allowed greater issue bandwidth even then SMT processor shows better performance Test G -both have 8 FU’s and 8 issues per cycle, however SMT processor has 8 contexts and Multiprocessor has 2 processor (2 register sets)- SMT processor has 2.5 greater performance

Summary Simultaneous Multithreading combines facilities of superscalar as well as multithreaded architectures It has the ability to boost utilization of resources by dynamically scheduling functional units among multiple threads Comparison of several models of SMT have been done with wide superscalar, fine-grain multithreaded, and single chip, multiple issue multiprocessing architectures The results of simulation show that: a) a simultaneous multithreaded architecture with proper configuration can achieve 4 times instruction throughput of a single-threaded wide superscalar with the same issue width b)simultaneous multithreading outperforms fine-grain multithreading by a factor of 2. c)simultaneous multiprocessor is superior in performance to a multiple issue multiprocessor, given same hardware resources

Commercial Machines MemoryLogix - SMT processor for mobile devices. Sun Microsystems has announced a 4-SMT- processor CMP. Hyper-Threading Technology (Intel® Xeon® Architecture) Clearwater Networks, a Los Gatos-based startup, was building an 8-context SMT network processor. Compaq Computer Corp. designed a 4-context SMT processor, Alpha 21464 (EV-8)

In Retrospect The design of SMT architecture was influenced by previous projects like the Tera, MIT Alewife and M- machine SMT was different from previous projects as it addressed a more complete and descriptive goal as compared to previous designs. The idea was to utilize thread level parallelism in place of lack of instruction level parallelism Aim was to target mainstream processor designs like the Alpha 21164

Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.

Similar presentations

Presentation on theme: "Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.

Similar presentations

Presentation on theme: "Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle."— Presentation transcript:

Similar presentations

About project

Feedback