WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design Matt Ramsay, Chris Feucht & Mikko H. Lipasti University of Wisconsin-Madison PHARM Team www.ece.wisc.edu/~pharm

WCED: June 7, 2003 Slide 2 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Introduction & Motivation Two main performance limitations: Two main performance limitations: Memory stalls Memory stalls  Pipeline flushes due to incorrect speculation In SMTs: In SMTs: Multiple threads to hide these problems Multiple threads to hide these problems However, multiple threads make speculation harder because of interference with shared prediction resources However, multiple threads make speculation harder because of interference with shared prediction resources This interference can cause more branch mispredicts and thus limit potential performance This interference can cause more branch mispredicts and thus limit potential performance

WCED: June 7, 2003 Slide 3 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Introduction & Motivation We study: We study: Providing each thread with its own pieces of the branch predictor to eliminate interference between threads Providing each thread with its own pieces of the branch predictor to eliminate interference between threads Apply these changes to different branch prediction schemes to evaluate their performance Apply these changes to different branch prediction schemes to evaluate their performance We hypothesize: We hypothesize: Elimination of thread interference in the branch predictor will improve prediction accuracy Elimination of thread interference in the branch predictor will improve prediction accuracy Thread-level parallelism in an SMT makes branch prediction accuracy much less important than in a single-threaded processor Thread-level parallelism in an SMT makes branch prediction accuracy much less important than in a single-threaded processor

WCED: June 7, 2003 Slide 4 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Talk Outline Talk Outline Introduction & Motivation Introduction & Motivation SMT Overview SMT Overview Branch Prediction Overview Branch Prediction Overview Test Methodology Test Methodology Results Results Conclusions Conclusions

WCED: June 7, 2003 Slide 5 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison SMT Overview SMT Overview Simultaneous Multithreading Simultaneous Multithreading Machines often have more resources than can be used by one thread Machines often have more resources than can be used by one thread SMT: Allows TLP along with ILP SMT: Allows TLP along with ILP 4-wide example: 4-wide example:

WCED: June 7, 2003 Slide 6 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Tested Predictors Static Predictors (in paper): Static Predictors (in paper): Always Taken Always Taken Backward-Taken-Forward-Not-Taken Backward-Taken-Forward-Not-Taken 2-Bit Predictor: 2-Bit Predictor: Branch History Table (BHT) indexed by PC of branch instruction Branch History Table (BHT) indexed by PC of branch instruction Allows for significant aliasing by branches that share low bits of PC Allows for significant aliasing by branches that share low bits of PC Does not take advantage of global branch history information Does not take advantage of global branch history information Gshare Predictor: Gshare Predictor: BHT indexed by XOR of the branch PC and the global branch history BHT indexed by XOR of the branch PC and the global branch history Hashing reduces aliasing Hashing reduces aliasing Correlates prediction based on global branch behavior Correlates prediction based on global branch behavior

WCED: June 7, 2003 Slide 7 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison YAGS Predictor YAGS Predictor

WCED: June 7, 2003 Slide 8 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Indirect Branch Predictor Indirect Branch Predictor Predicts the target of Jump-Register (JR) instructions Predicts the target of Jump-Register (JR) instructions Prediction table holds target addresses Prediction table holds target addresses Larger table entries lead to more aliasing Larger table entries lead to more aliasing Indexed like Gshare branch predictor Indexed like Gshare branch predictor Split indirect predictor caused little change in branch prediction accuracy and overall performance (in paper) Split indirect predictor caused little change in branch prediction accuracy and overall performance (in paper)

WCED: June 7, 2003 Slide 10 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Simulation Environment Simulation Environment  # of Threads = 4  # of Address Spaces = 4  # Bits in Branch History = 12  # of BT Entries = 4096  # Bits in Indirect History = 10  # of IT Entries = 1024 Machine Width = 4 Machine Width = 4 Pipeline Depth = 15 Pipeline Depth = 15 Max Issue Window = 64 Max Issue Window = 64 # of Physical Registers = 512 # of Physical Registers = 512  # Instructions Simulated = ~40M L1 Latency = 1 cycle L2 Latency = 10 cycles Mem Latency = 200 cycles L1 Size = 32 KB L1 Associativity = D.M. L1 Block Size = 64 B L2 Size = 1MB L2 Associativity = 4 L2 Block Size = 128 B Multithreaded version of SimpleScalar developed by Craig Zilles at UW Multithreaded version of SimpleScalar developed by Craig Zilles at UW Machine Configuration: Machine Configuration:

WCED: June 7, 2003 Slide 11 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Benchmarks Tested Benchmarks Tested From SpecCPU2000 From SpecCPU2000 INT INT crafty crafty gcc gcc FP FP ammp ammp equake equake Benchmark Configurations Benchmark Configurations Heterogeneous Threads: Each thread runs one of the listed benchmarks to simulate a multi-tasking environment Heterogeneous Threads: Each thread runs one of the listed benchmarks to simulate a multi-tasking environment Homogeneous Threads: Each thread runs a different copy of the same benchmark (crafty) to simulate a multithreaded server environment Homogeneous Threads: Each thread runs a different copy of the same benchmark (crafty) to simulate a multithreaded server environment

WCED: June 7, 2003 Slide 12 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Shared Configuration Shared Configuration Thread 0 Thread 1 Thread 2 Thread 3 History Predictor

WCED: June 7, 2003 Slide 13 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Split Branch Configuration Split Branch Configuration Predictor block retains original size when duplicated Predictor block retains original size when duplicated Thread 3 Thread 2 Thread 1 Thread 0 HistoryPredictorHistory Predictor History Predictor History Predictor

WCED: June 7, 2003 Slide 14 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Split Branch Table Configuration Split Branch Table Configuration Thread 0 Thread 1 Thread 2 Thread 3 History Predictor Predictor Predictor Predictor Thread ID

WCED: June 7, 2003 Slide 15 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Split History Configuration Split History Configuration Thread 0 Thread 1 Thread 2 Thread 3 History 1 Predictor Thread ID History 0 History 2 History 3

WCED: June 7, 2003 Slide 17 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Split Branch Predictor Accuracy Full predictor split: Predictors act as expected, as they would in a single threaded environment

WCED: June 7, 2003 Slide 18 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Shared Branch Predictor Accuracy Shared predictor: Performance suffers because of interference by other threads (esp. Gshare)

WCED: June 7, 2003 Slide 19 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Prediction Accuracy: Heterogeneous Threads Prediction Accuracy: Heterogeneous Threads Yags & Gshare: Yags & Gshare: Sharing the history register performs very poorly Sharing the history register performs very poorly Split history configuration performs almost as well as the split branch configuration while using significantly less resources Split history configuration performs almost as well as the split branch configuration while using significantly less resources 2-Bit: splitting the predictor performs better, mispredicts reduced from 9.52% to 8.35% 2-Bit: splitting the predictor performs better, mispredicts reduced from 9.52% to 8.35%

WCED: June 7, 2003 Slide 20 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Prediction Accuracy: Homogeneous Threads Prediction Accuracy: Homogeneous Threads Yags & Gshare: Yags & Gshare: Configurations perform similarly to heterogeneous thread case Configurations perform similarly to heterogeneous thread case Split history configuration performs even closer to split branch configuration because of positive aliasing in the BHT Split history configuration performs even closer to split branch configuration because of positive aliasing in the BHT Surprisingly, splitting portions of the predictor still performs better even when each thread runs the same program Surprisingly, splitting portions of the predictor still performs better even when each thread runs the same program

WCED: June 7, 2003 Slide 21 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Per Thread CPI: Heterogeneous Threads Per Thread CPI: Heterogeneous Threads Sharing history register using Gshare has significant negative effect on performance (near 50% mispredicts) Sharing history register using Gshare has significant negative effect on performance (near 50% mispredicts) Split history configuration produces almost same performance as split branch configuration while using significantly less resources Split history configuration produces almost same performance as split branch configuration while using significantly less resources

WCED: June 7, 2003 Slide 22 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Per Thread CPI: Homogeneous Threads Per Thread CPI: Homogeneous Threads Per-thread performance is worse in homogeneous thread configuration because crafty benchmark has highest number of cache misses Per-thread performance is worse in homogeneous thread configuration because crafty benchmark has highest number of cache misses

WCED: June 7, 2003 Slide 23 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Performance Across Predictors Performance Across Predictors Branch prediction scheme has little effect on performance Only 2.75% and 5% CPI increases when Gshare and 2-bit predictors are used instead of much more expensive YAGS Increases are 6% and 11% in a single-threaded machine Heterogeneous thread configuration performs similarly

WCED: June 7, 2003 Slide 24 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Performance Across Predictors Performance Across Predictors Split history configuration still allows performance to hold for simpler schemes 4% and 6.25% CPI increases for Gshare and 2-bit schemes compared to YAGS Simpler schemes allow for reduced cycle time and power consumption CPI numbers only close estimates because simulations are not deterministic

WCED: June 7, 2003 Slide 25 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Talk Outline Introduction & Motivation Introduction & Motivation SMT Overview SMT Overview Branch Prediction Overview Branch Prediction Overview Test Methodology Test Methodology Results Results Conclusions Conclusions

WCED: June 7, 2003 Slide 26 of 26 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison Conclusions Conclusions Multithreaded execution interferes with branch prediction accuracy Multithreaded execution interferes with branch prediction accuracy Prediction accuracy trends are similar across both homogeneous and heterogeneous thread test cases Prediction accuracy trends are similar across both homogeneous and heterogeneous thread test cases Splitting only the branch history has best branch prediction accuracy and performance per resource Splitting only the branch history has best branch prediction accuracy and performance per resource Performance (CPI) is relatively stable, even when branch prediction structure is simplified Performance (CPI) is relatively stable, even when branch prediction structure is simplified

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

Similar presentations

Presentation on theme: "WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

Similar presentations

Presentation on theme: "WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design."— Presentation transcript:

Similar presentations

About project

Feedback