Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY
AGENDA INTRODUCTION Motivation Types of Parallesim Vertical and Horizontal Wasted Slot Superscalar Processors Multithreading Simultaneous Multithreading The Idea SMT Model Issues: What to Fetch and What to Issue? Caching Performance Analysis Simulation Results Comparision Drawbacks Commercial Examples IBM POWER5 Future Tendincies
INTRODUCTION: Motivation Microprocessor Design Optimization Some Focus Areas: 1. Memory latency Increased processor speeds make memory appear further away Longer stalls possible 2. Branch Processing Mispredict more costly as pipeline depth increases resulting in stalls and wasted power Predication drives increased power and larger chip area 3. Execution Unit Utilization 20-25% execution unit utilization common SMT Adresses these areas!
INTRODUCTION: Motivation Memory subsystem improvement or increasing system integration is not sufficient for significant performance improvement. Solution: Increase parallelism in all its available form Combine the multiple-issue-per-instruction features of modern superscalar processors With latency-hiding ability of multithreaded architectures
INTRODUCTION: Types of Parallesim Bit-level Wider processor datapaths (8,16,32,64…) Word-level (SIMD) Vector processors Multimedia instruction sets (Intel’s MMX and SSE, Sun’s VIS, etc.) Instruction-level Pipelining Superscalar VLIW and EPIC Task and Application-levels Explicit parallel programming Multiple threads Multiple applications
INTRODUCTION: Vertical Slot & Horizontal Slot Vertical waste is introduced when the processor issues no instructions in a cycle Horizontal waste is introduced when not all issue slots can be filled in a cycle. %61 of the wasted cycles are vertical waste.
INTRODUCTION: Superscalar Issues multiple instructions in each cycle. Typically 4. Several functional units of the same type, e.g. ALUs Dispatcher reads instructions, decides which can run in parallel Limited by instruction dependencies and long- latency operations Effects Horizontal & Vertical Waste Low Utilization even with higher-issue machines; 8 Issue with %20
INTRODUCTION: Superscalar Many slots in the execution core are unused.
MULTITHREADING Processor is extended with the concept of thread allowing the scheduler to chose instructions from one thread or another at each clock. Two types in thread scheduling: coarse- grain multithreading and fine-grain multithreading. SMT uses both types of Multithreading
MULTITHREADING
What a processor needs for Multithreading? 1. Processor must be aware of several independent states, one per each thread: Program Counter Register File (and Flags) Memory 2. Either multiple resources in the processor or a fast way to switch across states
MULTITHREADING: Coarse - Grain Multithreading Swith between threads only on costly stalls This form of multithreading only hides long latency events. Easy to implement but has large grains
MULTITHREADING: Coarse-Grain
MULTITHREADING: Fine - Grain Multithreading Context switch the threads on every clock cycle. Occupancy of the execution core is now much higher Hides both long and short latency events Vertical waste are eliminated but horizontal waste is not. If a thread has little or no operations to execute issue slots will be wasted.
MULTITHREADING: Fine-Grain
Simultaneous Multithreading: Idea Combine Superscalar and Multithreading such that; 1. Issue multiple instructions per cycle – Supercalar 2. Hardware state for several programs/threads – Multithreading So; issue multiple instructions from multiple threads in each cycle
Simultaneous Multithreading: Idea
Simultaneous Multithreading: Model Extend, replicate and redesign some units of superscalar to achive multithreading Resources replicated State for hardware contexts (registers, PCs) Per thread mechanisms for Pipeline flushing and subroutine returns Per thread identiers for branch target buffer and translation lookaside buffer
Simultaneous Multithreading: Model Resources redesigned Instruction fetch unit Processor pipeline Instruction Scheduling Does not require additional hardware Register renaming (same as superscalar)
Simultaneous Multithreading: Model SuperScalar Architecture
Simultaneous Multithreading: Model Block Diagram
Simultaneous Multithreading: Model Instruction Fetch Unit Takes advantage of inter-thread competition Partitioning bandwidth Fetching threads that give maximum local benefit 2.8 fetching Fetch 1 inst. per logical processor, for 2 threads Decode 1 thread till branch/end of cache line, then jump to the other ICount feedback Highest priority to threads with fewest instructions in the decode, renaming, and queue pipeline stages Small hardware addition to track queue lengths
Simultaneous Multithreading: Model Register File Each thread has 32 registers Register File: 32 * #threads + rename registers So, larger register file longer access time
Simultaneous Multithreading: Model Pipeline Format Superscalar SMT
Simultaneous Multithreading: Model Pipeline Format To avoid increase in clock cycle time, SMT pipeline extended to allow 2 cycle register reads and writes 2 cycle reads/writes increase branch misprediction penalty
Simultaneous Multithreading: Where to Fetch Where to Fetch Static solutions: Round-robin Each cycle 8 instructions from 1 thread Each cycle 4 instructions from 2 threads, 2 from 4,… Each cycle 8 instructions from 2 threads, and forward as many as possible from #1 then when long latency instruction in #1 pick rest from #2 Dynamic solutions: Check execution queues! Favour threads with minimal # of in-flight branches Favour threads with minimal # of outstanding misses Favour threads with minimal # of in-flight instructions Favour threads with instructions far from queue head
Simultaneous Multithreading: What to Issue Not exactly the same as in superscalars… In superscalar: oldest is the best (least speculation, more dependent ones waiting, etc.) In SMT not so clear: branch-speculation level and optimism (cache-hit speculation) vary across threads Based on this the selection strategies: Oldest first Cache-hit speculated last Branch speculated last Branches first… Important result: doesn’t matter too much!
Simultaneous Multithreading: Compiler Optimizations Should try to minimize cache interference Latency hiding techniques like speculation should be enhanced Sharing optimization techniques from multiprocessors changed – data sharing is now good
Simultaneous Multithreading: Caching Same cache shared among threads Performance degradation due to cache sharing Possibility of cache thrashing
PERFORMANCE ANALYSIS Four model is selected Basic Machine is 10 FU, 8 Issue 1. Fine-Grain Multithreading 2. SM:Full Simultaneous Issue: Eight threads compete for each of the issue slots each cycle. 3. SM:Single Issue,SM:Dual Issue, SM:Four Issue: Limit the number of instructions each thread can issue e.g: each thread can issue a maximum of 2 instructions per cycle; therefore, a minimum of 4 threads would be required to fill the 8 issue slots in one cycle. 4. SM:Limited Connection: Each hardware context is directly connected to exactly one of each type of functional unit.
PERFORMANCE ANALYSIS
PERFORMANCE ANALYSIS: H/W COMPLEXITY
COMPARISION SMT vs. Multiprocessing Multiprocessing statically assigns functional units to threads SMT allows threads to expand Using available resources
COMPARISION
DRAWBACKS Two main drawbacks 1. Single thread perfomance decreases due to the architectural constraints 2. Additional contexts will increase power consumption
Commercial Examples Compaq Alpha (EV8) 4T SMT Project killed June 2001 Intel Pentium IV (Xeon) 2T SMT Availability in 2002 (already there before, but not enabled) 10-30% gains expected Also called as Hyperthreading SUN Ultra IV 2-core CMP, 2T SMT IBM POWER5 Dual processor core 8-way superscalar Simultaneous multithreaded (SMT) core: Up to 2 virtual processors per real processor 24% area growth per core for SMT
Commercial Examples: IBM POWER5
SMT added to Superscalar Micro-architecture Second Program Counter (PC) added to share I- fetch bandwidth GPR/FPR rename mapper expanded to map second set of registers (High order address bit indicates thread) Completion logic replicated to track two threads Thread bit added to most address/tag buses
Commercial Examples: IBM POWER5
Includes; 1. Thread Priority Mechanism: Power Efficiency, 8 levels 2. Dynamic Thread Switching Used if no task ready for second thread to run Allocates all machine resources to one thread Initiated by SW
Commercial Examples: IBM POWER5 Dormant thread wakes up on: 1. External interrupt 2. Decrementer interrupt 3. Special instruction from active thread
Future Tendincies Simultaneous & Redundantly Threaded Processors(SRT) Increase reliability with fault detection and correction. Run multiple copies of the same programme simultaneously Software Pre-Execution in SMT: In some cases data adress is extremely hard to predict. Prefetching is useless Use an idle thread of SMT for pre-execution. A complete software solution Speculation More techniques on speculation E.g Speculative Data-Driven Multithreading, Threaded Multiple Path Execution, Simultaneous Subordinate Microthreading and Thread Level Speculation
REFERANCES "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. “Simultaneous Multithreading: Present Developments and Future Directions” by Miquel Peric, June 2003 “Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor” by IBM, Aug 2004 “Simultaneous Multithreading: A Platform for Next- Generation Processors” by Eggers, Emer, Levy, Lo, Stamm and Tullsen in IEEE Micro, October, 1997.
Q&A THANKS!