A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors Ayose Falcón Alex Ramirez Mateo Valero HPCA-10 February 18, 2004.

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors Ayose Falcón Alex Ramirez Mateo Valero HPCA-10 February 18, 2004

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 2 Simultaneous Multithreading  SMT [Tullsen95] / Multistreaming [Yamamoto95]  Instructions from different threads coexist in each processor stage  Resources are shared among different threads  But…  Sharing implies competition In caches, queues, FUs, …  Fetch policy decides! time

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 3 Motivation  SMT performance is limited by fetch performance  A superscalar fetch is not enough to feed an aggressive SMT core  SMT fetch is a bottleneck [Tullsen96] [Burns99]  Straightforward solution: Fetch from several threads each cycle a) Multiple fetch units (1 per thread)  EXPENSIVE! b) Shared fetch + fetch policy [Tullsen96]  Multiple PCs  Multiple branch predictions per cycle  Multiple I-cache accesses per cycle  Does the performance of this fetch organization compensate its complexity?

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 4 Talk Outline  Motivation  Fetch Architectures for SMT  High-Performance Fetch Engines  Simulation Setup  Results  Summary & Conclusions

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 5 Branch Predictor Instruction Cache Fetching from a Single Thread (1.X)  Fine-grained, non-simultaneous sharing  Simple  similar to a superscalar fetch unit  No additional HW needed  A fetch policy is needed  Decides fetch priority among threads  Several proposals in the literature SHIFT&MASK

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 6 Fetching from a Single Thread (1.X)  But…a single thread is not enough to fill fetch BW  Gshare / hybrid branch predictor + BTB limits fetch width to one basic block per cycle (6-8 instructions)  Fetch BW is heavily underused  Avg 40% wasted with 1.8  Avg 60% wasted with 1.16  Fully use the fetch BW  31% fetch cycles with 1.8  6% fetch cycles with 1.16

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 7 Fetching from Multiple Threads (2.X)  Increases fetch throughput  More threads  more possibilities to fill fetch BW  More fetch BW use than 1.X  Fully use the fetch BW  54% of cycles with 2.8  16% of cycles with 2.16 28% 33%

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 8 Fetching from Multiple Threads (2.X) Branch Predictor BANK 1 Instruction Cache BANK 2 2 2 SHIFT&MASK MERGE 2 predictions per cycle + 2 ports Multibanked + multiported instruction cache Replication of SHIFT & MASK logic New HW to realign and merge cache lines  But…what is the additional HW cost of a 2.X fetch?

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 9 Our Goal  Can we take the best of both worlds?  Low complexity of a 1.X fetch architecture +  High performance of a 2.X fetch architecture  That is…can a single thread provide sufficient instructions to fill the available fetch bandwidth?

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 11 High Performance Fetch Engines (I)  We look for high performance  Gshare / hybrid branch predictor + BTB  Low performance  Limit fetch BW to one basic block per cycle 6-8 instructions  We look for low complexity  Trace cache, Branch Target Address Cache, Collapsing Buffer, etc…  Fetch multiple basic blocks per cycle 12-16 instructions  High complexity

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 12 High Performance Fetch Engines (II)  Our alternatives  Gskew [Michaud97] + FTB [Reinman99]  FTB fetch blocks are larger than basic blocks  5% speedup over gshare+BTB in superscalars  Stream Predictor [Ramirez02]  Streams are larger than FTB fetch blocks  11% speedup over gskew+FTB in superscalars

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 14 Simulation Setup SMTSIM  Modified version of SMTSIM [Tullsen96]  Trace-driven, allowing wrong-path execution  Decoupled fetch (1 additional pipeline stage)  Branch predictor sizes of approx. 45KB  Decode & rename width limited to 8 instructions  Fetch width 8/16 inst.  Fetch buffer 32 inst. Fetch policyICOUNT RAS /thread64-entry FTQ size /thread4-entry Functional units6 int, 4 ld/st, 3 fp Inst. queues32 int, 32 ld/st, 32 fp ROB /thread256-entry Physical registers 384 int, 384 fp L1 I-cache & D- cache 32KB, 2W, 8 banks L2 cache1MB, 2W, 8banks, 10 cyc. Line size64B (16 instructions) TLB48 I + 48 D Mem. lat.100 cyc.

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 15 Workloads  SPECint2000  Code layout optimized  Spike [Cohn97] + profile data using train input  Most representative 300M instruction trace  Using ref input  Workloads including 2, 4, 6, and 8 threads  Classified according to threads characteristics:  ILP  ILP  only ILP benchmarks  MEM  MEM  memory-bounded benchmarks  MIX  MIX  mix of ILP and MEM benchmarks

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 16 Talk Outline  Motivation  Fetch Architectures for SMT  High-Performance Fetch Engines  Simulation Setup  Results  ILP workloads  MEM & MIX workloads  Summary & Conclusions Only for 2 & 4 threads (see paper for the rest)

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 17 ILP Workloads - Fetch Throughput  With a given fetch bandwidth, fetching from two threads always benefits fetch performance  Critical point is 1.16  Stream predictor  Better fetch performance than 2.8  Gshare+BTB / gskew+FTB  Worse fetch perform. than 2.8 Fetch Throughput

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 18 ILP Workloads – 1.X (1.8) vs 2.X (2.8)  ILP benchmarks have few memory problems and high parallelism  Fetch unit is the real limiting factor  The higher the fetch throughput, the higher the IPC Commit Throughput

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 19 ILP Workloads  So…2.X better than 1.X in ILP workloads…  But, what about 1. 2X instead of 2.X?  That is, 1.16 instead of 2.8  Maintain single thread fetch  Cache lines and buses already 16-instruction wide  We have to modify the HW to select 16 instead of 8 instructions

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 20 ILP Workloads – 2.X (2.8) vs 1. 2X (1.16)  With 1.16, stream predictor increases throughput (9% avg)  Streams are long enough for a 16-wide fetch  Fetching a single block per cycle is not enough  Gshare+BTB  10% slowdown  Gskew+FTB  4% slowdown Similar/Better performance than 2.16! Commit Throughput

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 21 MEM & MIX Workloads - Fetch Throughput  Same trend compared to ILP fetch throughput  For a given fetch BW, fetching from two threads is better  Stream > gskew + FTB > gshare + BTB Fetch Throughput

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 22 MEM & MIX Workloads – 1.X (1.8) vs 2.X (2.8)  With memory-bounded benchmarks…overall performance actually decreases!!  Memory-bounded threads monopolize resources for many cycles  Previously identified  New fetch policies  Flush [Tullsen01] or stall [Luo01, El-Mousry03] problematic threads Commit Throughput

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 23 MEM & MIX workloads  Fetching from only one thread allows to fetch only from the first, most priority thread  Allows the highest priority thread to proceed with more resources  Avoids low-quality (less priority) threads to monopolize more and more resources on cache misses  Registers, IQ slots, etc.  Only the highest priority thread is fetched  When cache miss is resolved, instructions from the second thread will be consumed  ICOUNT will give it more priority after the cache miss resolution  A powerful fetch unit can be harmful if not well used

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 24 MEM & MIX workloads – 1.X (1.8) vs 1. 2X (1.16) Commit Throughput  Even 2.16 has worse commit performance than 1.8  More interference introduced by low-quality threads  Overall, 1.16 is the best combination  Low complexity  fetching from one thread  High performance  wide fetch

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 26 Summary  Fetch unit is the most significant obstacle to obtain high SMT performance  However, researchers usually don’t care about SMT fetch performance  They care on how to combine threads to maintain available fetch throughput  A simple gshare/hybrid + BTB is commonly used  Everybody assumes that 2.8 (2.X) is the correct answer  Fetching from many threads can be counterproductive  Sharing implies competing  Low-quality threads monopolize more and more resources

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 27 Conclusions  1.16 (1. 2X ) is the best fetch option  Using a high-width fetch architecture  It’s not the prediction accuracy, it’s the fetch width  Beneficial for both ILP and MEM workloads  1.X is bad for ILP  2.X is bad for MEM  Fetches only from the most promising thread (according to fetch policy), and as much as possible  Offers the best performance/complexity tradeoff  Fetching from a single thread may require revisiting current SMT fetch policies

Thanks Questions & Answers

Backup Slides

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 30 SMT Workloads WorkloadThreads 2_ILPeon, gcc 2_MEMmcf, twolf 2_MIXgzip, twolf 4_ILPeon, gcc, gzip, bzip2 4_MEMmcf, twolf, vpr, perlbmk 4_MIXgzip, twolf, bzip2, mcf 6_ILPeon, gcc, gzip, bzip2, crafty, vortex 6_MIXgzip, twolf, bzip2, mcf, vpr, eon 8_ILPeon, gcc, gzip, bzip2, crafty, vortex, gap, parser 8_MIXgzip, twolf, bzip2, mcf, vpr, eon, gap, parser

HPCA-10A Low-Complexity, High-Performance Fetch Unit for SMT Processors 31 Simulation Setup Fetch policyICOUNT Gshare predictor64K-entry, 16 bits history Gskew predictor3x32K-entry, 15 bits history BTB/FTB2K-entry, 4W asc. Stream predictor1K-entry, 4W + 4K-entry, 4W RAS /thread64-entry FTQ size /thread4-entry Functional units6 int, 4 ld/st, 3 fp Inst. queues32 int, 32 ld/st, 32 fp ROB /thread256-entry Physical registers384 int, 384 fp L1 I-cache & D-cache32KB, 2W, 8 banks L2 cache1MB, 2W, 8banks, 10 cyc. Line size64B (16 instructions) TLB48 I + 48 D Mem. lat.100 cyc.

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors Ayose Falcón Alex Ramirez Mateo Valero HPCA-10 February 18, 2004.

Similar presentations

Presentation on theme: "A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors Ayose Falcón Alex Ramirez Mateo Valero HPCA-10 February 18, 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors Ayose Falcón Alex Ramirez Mateo Valero HPCA-10 February 18, 2004.

Similar presentations

Presentation on theme: "A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors Ayose Falcón Alex Ramirez Mateo Valero HPCA-10 February 18, 2004."— Presentation transcript:

Similar presentations

About project

Feedback