A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors Ayose Falcón Alex Ramirez Mateo Valero HPCA-10 February 18, 2004.

Slides:



Advertisements
Similar presentations
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.
EECC722 - Shaaban #1 Lec # 5 Fall Decoupled Fetch/Execute Superscalar Processor Engines Superscalar processor micro-architecture is divided.
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
Prophet/Critic Hybrid Branch Prediction Falcon, Stark, Ramirez, Lai, Valero Presenter: Christian Wanamaker.
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.
Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis.
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Architectural Impact of Stateful Networking Applications Javier Verdú, Jorge García Mario Nemirovsky, Mateo Valero The 1st Symposium on Architectures for.
CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Fetch Directed Prefetching - a Study
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
Sunpyo Hong, Hyesoon Kim
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
ECE Dept., Univ. Maryland, College Park
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Simultaneous Multithreading
Hyperthreading Technology
Lecture: SMT, Cache Hierarchies
Levels of Parallelism within a Single Processor
Lecture: SMT, Cache Hierarchies
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Ka-Ming Keung Swamy D Ponpandi
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Lecture: SMT, Cache Hierarchies
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Lecture: SMT, Cache Hierarchies
Levels of Parallelism within a Single Processor
Lecture 22: Multithreading
Ka-Ming Keung Swamy D Ponpandi
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors Ayose Falcón Alex Ramirez Mateo Valero HPCA-10 February 18, 2004

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 2 Simultaneous Multithreading  SMT [Tullsen95] / Multistreaming [Yamamoto95]  Instructions from different threads coexist in each processor stage  Resources are shared among different threads  But…  Sharing implies competition In caches, queues, FUs, …  Fetch policy decides! time

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 3 Motivation  SMT performance is limited by fetch performance  A superscalar fetch is not enough to feed an aggressive SMT core  SMT fetch is a bottleneck [Tullsen96] [Burns99]  Straightforward solution: Fetch from several threads each cycle a) Multiple fetch units (1 per thread)  EXPENSIVE! b) Shared fetch + fetch policy [Tullsen96]  Multiple PCs  Multiple branch predictions per cycle  Multiple I-cache accesses per cycle  Does the performance of this fetch organization compensate its complexity?

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 4 Talk Outline  Motivation  Fetch Architectures for SMT  High-Performance Fetch Engines  Simulation Setup  Results  Summary & Conclusions

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 5 Branch Predictor Instruction Cache Fetching from a Single Thread (1.X)  Fine-grained, non-simultaneous sharing  Simple  similar to a superscalar fetch unit  No additional HW needed  A fetch policy is needed  Decides fetch priority among threads  Several proposals in the literature SHIFT&MASK

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 6 Fetching from a Single Thread (1.X)  But…a single thread is not enough to fill fetch BW  Gshare / hybrid branch predictor + BTB limits fetch width to one basic block per cycle (6-8 instructions)  Fetch BW is heavily underused  Avg 40% wasted with 1.8  Avg 60% wasted with 1.16  Fully use the fetch BW  31% fetch cycles with 1.8  6% fetch cycles with 1.16

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 7 Fetching from Multiple Threads (2.X)  Increases fetch throughput  More threads  more possibilities to fill fetch BW  More fetch BW use than 1.X  Fully use the fetch BW  54% of cycles with 2.8  16% of cycles with % 33%

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 8 Fetching from Multiple Threads (2.X) Branch Predictor BANK 1 Instruction Cache BANK SHIFT&MASK MERGE 2 predictions per cycle + 2 ports Multibanked + multiported instruction cache Replication of SHIFT & MASK logic New HW to realign and merge cache lines  But…what is the additional HW cost of a 2.X fetch?

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 9 Our Goal  Can we take the best of both worlds?  Low complexity of a 1.X fetch architecture +  High performance of a 2.X fetch architecture  That is…can a single thread provide sufficient instructions to fill the available fetch bandwidth?

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 10 Talk Outline  Motivation  Fetch Architectures for SMT  High-Performance Fetch Engines  Simulation Setup  Results  Summary & Conclusions

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 11 High Performance Fetch Engines (I)  We look for high performance  Gshare / hybrid branch predictor + BTB  Low performance  Limit fetch BW to one basic block per cycle 6-8 instructions  We look for low complexity  Trace cache, Branch Target Address Cache, Collapsing Buffer, etc…  Fetch multiple basic blocks per cycle instructions  High complexity

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 12 High Performance Fetch Engines (II)  Our alternatives  Gskew [Michaud97] + FTB [Reinman99]  FTB fetch blocks are larger than basic blocks  5% speedup over gshare+BTB in superscalars  Stream Predictor [Ramirez02]  Streams are larger than FTB fetch blocks  11% speedup over gskew+FTB in superscalars

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 13 Talk Outline  Motivation  Fetch Architectures for SMT  High-Performance Fetch Engines  Simulation Setup  Results  Summary & Conclusions

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 14 Simulation Setup SMTSIM  Modified version of SMTSIM [Tullsen96]  Trace-driven, allowing wrong-path execution  Decoupled fetch (1 additional pipeline stage)  Branch predictor sizes of approx. 45KB  Decode & rename width limited to 8 instructions  Fetch width 8/16 inst.  Fetch buffer 32 inst. Fetch policyICOUNT RAS /thread64-entry FTQ size /thread4-entry Functional units6 int, 4 ld/st, 3 fp Inst. queues32 int, 32 ld/st, 32 fp ROB /thread256-entry Physical registers 384 int, 384 fp L1 I-cache & D- cache 32KB, 2W, 8 banks L2 cache1MB, 2W, 8banks, 10 cyc. Line size64B (16 instructions) TLB48 I + 48 D Mem. lat.100 cyc.

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 15 Workloads  SPECint2000  Code layout optimized  Spike [Cohn97] + profile data using train input  Most representative 300M instruction trace  Using ref input  Workloads including 2, 4, 6, and 8 threads  Classified according to threads characteristics:  ILP  ILP  only ILP benchmarks  MEM  MEM  memory-bounded benchmarks  MIX  MIX  mix of ILP and MEM benchmarks

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 16 Talk Outline  Motivation  Fetch Architectures for SMT  High-Performance Fetch Engines  Simulation Setup  Results  ILP workloads  MEM & MIX workloads  Summary & Conclusions Only for 2 & 4 threads (see paper for the rest)

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 17 ILP Workloads - Fetch Throughput  With a given fetch bandwidth, fetching from two threads always benefits fetch performance  Critical point is 1.16  Stream predictor  Better fetch performance than 2.8  Gshare+BTB / gskew+FTB  Worse fetch perform. than 2.8 Fetch Throughput

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 18 ILP Workloads – 1.X (1.8) vs 2.X (2.8)  ILP benchmarks have few memory problems and high parallelism  Fetch unit is the real limiting factor  The higher the fetch throughput, the higher the IPC Commit Throughput

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 19 ILP Workloads  So…2.X better than 1.X in ILP workloads…  But, what about 1. 2X instead of 2.X?  That is, 1.16 instead of 2.8  Maintain single thread fetch  Cache lines and buses already 16-instruction wide  We have to modify the HW to select 16 instead of 8 instructions

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 20 ILP Workloads – 2.X (2.8) vs 1. 2X (1.16)  With 1.16, stream predictor increases throughput (9% avg)  Streams are long enough for a 16-wide fetch  Fetching a single block per cycle is not enough  Gshare+BTB  10% slowdown  Gskew+FTB  4% slowdown Similar/Better performance than 2.16! Commit Throughput

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 21 MEM & MIX Workloads - Fetch Throughput  Same trend compared to ILP fetch throughput  For a given fetch BW, fetching from two threads is better  Stream > gskew + FTB > gshare + BTB Fetch Throughput

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 22 MEM & MIX Workloads – 1.X (1.8) vs 2.X (2.8)  With memory-bounded benchmarks…overall performance actually decreases!!  Memory-bounded threads monopolize resources for many cycles  Previously identified  New fetch policies  Flush [Tullsen01] or stall [Luo01, El-Mousry03] problematic threads Commit Throughput

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 23 MEM & MIX workloads  Fetching from only one thread allows to fetch only from the first, most priority thread  Allows the highest priority thread to proceed with more resources  Avoids low-quality (less priority) threads to monopolize more and more resources on cache misses  Registers, IQ slots, etc.  Only the highest priority thread is fetched  When cache miss is resolved, instructions from the second thread will be consumed  ICOUNT will give it more priority after the cache miss resolution  A powerful fetch unit can be harmful if not well used

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 24 MEM & MIX workloads – 1.X (1.8) vs 1. 2X (1.16) Commit Throughput  Even 2.16 has worse commit performance than 1.8  More interference introduced by low-quality threads  Overall, 1.16 is the best combination  Low complexity  fetching from one thread  High performance  wide fetch

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 25 Talk Outline  Motivation  Fetch Architectures for SMT  High-Performance Fetch Engines  Simulation Setup  Results  Summary & Conclusions

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 26 Summary  Fetch unit is the most significant obstacle to obtain high SMT performance  However, researchers usually don’t care about SMT fetch performance  They care on how to combine threads to maintain available fetch throughput  A simple gshare/hybrid + BTB is commonly used  Everybody assumes that 2.8 (2.X) is the correct answer  Fetching from many threads can be counterproductive  Sharing implies competing  Low-quality threads monopolize more and more resources

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 27 Conclusions  1.16 (1. 2X ) is the best fetch option  Using a high-width fetch architecture  It’s not the prediction accuracy, it’s the fetch width  Beneficial for both ILP and MEM workloads  1.X is bad for ILP  2.X is bad for MEM  Fetches only from the most promising thread (according to fetch policy), and as much as possible  Offers the best performance/complexity tradeoff  Fetching from a single thread may require revisiting current SMT fetch policies

Thanks Questions & Answers

Backup Slides

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 30 SMT Workloads WorkloadThreads 2_ILPeon, gcc 2_MEMmcf, twolf 2_MIXgzip, twolf 4_ILPeon, gcc, gzip, bzip2 4_MEMmcf, twolf, vpr, perlbmk 4_MIXgzip, twolf, bzip2, mcf 6_ILPeon, gcc, gzip, bzip2, crafty, vortex 6_MIXgzip, twolf, bzip2, mcf, vpr, eon 8_ILPeon, gcc, gzip, bzip2, crafty, vortex, gap, parser 8_MIXgzip, twolf, bzip2, mcf, vpr, eon, gap, parser

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 31 Simulation Setup Fetch policyICOUNT Gshare predictor64K-entry, 16 bits history Gskew predictor3x32K-entry, 15 bits history BTB/FTB2K-entry, 4W asc. Stream predictor1K-entry, 4W + 4K-entry, 4W RAS /thread64-entry FTQ size /thread4-entry Functional units6 int, 4 ld/st, 3 fp Inst. queues32 int, 32 ld/st, 32 fp ROB /thread256-entry Physical registers384 int, 384 fp L1 I-cache & D-cache32KB, 2W, 8 banks L2 cache1MB, 2W, 8banks, 10 cyc. Line size64B (16 instructions) TLB48 I + 48 D Mem. lat.100 cyc.