Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Runahead Execution Processors A Power-Efficient Processing Paradigm for Tolerating Long Main Memory Latencies Onur Mutlu PhD Defense 4/28/2006.

Similar presentations


Presentation on theme: "Efficient Runahead Execution Processors A Power-Efficient Processing Paradigm for Tolerating Long Main Memory Latencies Onur Mutlu PhD Defense 4/28/2006."— Presentation transcript:

1 Efficient Runahead Execution Processors A Power-Efficient Processing Paradigm for Tolerating Long Main Memory Latencies Onur Mutlu PhD Defense 4/28/2006

2 2 Talk Outline Motivation: The Memory Latency Problem Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism  Efficient Runahead Execution  Address-Value Delta (AVD) Prediction Summary of Contributions Future Work

3 3 Motivation Memory latency is very long in today’s processors  CDC 6600: 10 cycles [Thornton, 1970]  Alpha 21264: 120+ cycles [Wilkes, 2001]  Intel Pentium 4: 300+ cycles [Sprangle & Carmean, 2002] And, it continues to increase (in terms of processor cycles)  DRAM latency is not reducing as fast as processor cycle time Conventional techniques to tolerate memory latency do not work well enough.

4 4 Conventional Latency Tolerance Techniques Caching [initially by Wilkes, 1965]  Widely used, simple, effective, but inefficient, passive  Not all applications/phases exhibit temporal or spatial locality Prefetching [initially in IBM 360/91, 1967]  Works well for regular memory access patterns  Prefetching irregular access patterns is difficult, inaccurate, and hardware- intensive Multithreading [initially in CDC 6600, 1964]  Works well if there are multiple threads  Improving single thread performance using multithreading hardware is an ongoing research effort Out-of-order execution [initially by Tomasulo, 1967]  Tolerates cache misses that cannot be prefetched  Requires extensive hardware resources for tolerating long latencies

5 5 Out-of-order Execution Instructions are executed out of sequential program order to tolerate latency. Instructions are retired in program order to support precise exceptions/interrupts. Not-yet-retired instructions and their results are buffered in hardware structures, called the instruction window.  The size of the instruction window determines how much latency the processor can tolerate.

6 6 Small Windows: Full-window Stalls When a long-latency instruction is not complete, it blocks retirement. Incoming instructions fill the instruction window. Once the window is full, processor cannot place new instructions into the window.  This is called a full-window stall. A full-window stall prevents the processor from making progress in the execution of the program.

7 7 ADD R2  R2, 64 STOR mem[R2]  R4 ADD R4  R4, R5 MUL R4  R4, R3 LOAD R3  mem[R2] ADD R2  R2, 8 BEQ R1, R0, target LOAD R1  mem[R5] Small Windows: Full-window Stalls Oldest L2 Miss! Takes 100s of cycles. 8-entry instruction window: Independent of the L2 miss, executed out of program order, but cannot be retired. Younger instructions cannot be executed because there is no space in the instruction window. The processor stalls until the L2 Miss is serviced. L2 cache misses are responsible for most full-window stalls. LOAD R3  mem[R2]

8 8 Impact of L2 Cache Misses 512KB L2 cache, 500-cycle DRAM latency, aggressive stream-based prefetcher Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model L2 Misses

9 9 Impact of L2 Cache Misses 500-cycle DRAM latency, aggressive stream-based prefetcher Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model L2 Misses

10 10 The Problem Out-of-order execution requires large instruction windows to tolerate today’s main memory latencies. As main memory latency increases, instruction window size should also increase to fully tolerate the memory latency. Building a large instruction window is a challenging task if we would like to achieve  Low power/energy consumption  Short cycle time  Low design and verification complexity

11 11 Talk Outline Motivation: The Memory Latency Problem Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism  Efficient Runahead Execution  Address-Value Delta (AVD) Prediction Summary of Contributions Future Work

12 12 Overview of Runahead Execution [HPCA’03] A technique to obtain the memory-level parallelism benefits of a large instruction window (without having to build it!) When the oldest instruction is an L2 miss:  Checkpoint architectural state and enter runahead mode In runahead mode:  Instructions are speculatively pre-executed  The purpose of pre-execution is to discover other L2 misses  The processor does not stall due to L2 misses Runahead mode ends when the original L2 miss returns  Checkpoint is restored and normal execution resumes

13 Compute Load 1 Miss Miss 1 Stall Compute Load 2 Miss Miss 2 Stall Load 1 Hit Load 2 Hit Compute Load 1 Miss Runahead Load 2 MissLoad 2 Hit Miss 1 Miss 2 Compute Load 1 Hit Saved Cycles Perfect Caches: Small Window: Runahead: Runahead Example

14 14 Benefits of Runahead Execution Instead of stalling during an L2 cache miss: Pre-executed loads and stores independent of L2-miss instructions generate very accurate data prefetches:  For both regular and irregular access patterns Instructions on the predicted program path are prefetched into the instruction/trace cache and L2. Hardware prefetcher and branch predictor tables are trained using future access information.

15 15 Runahead Execution Mechanism Entry into runahead mode  Checkpoint architectural register state Instruction processing in runahead mode Exit from runahead mode  Restore architectural register state from checkpoint

16 16 Instruction Processing in Runahead Mode Compute Load 1 Miss Runahead Miss 1 Runahead mode processing is the same as normal instruction processing, EXCEPT: It is purely speculative: Architectural (software-visible) register/memory state is NOT updated in runahead mode. L2-miss dependent instructions are identified and treated specially.  They are quickly removed from the instruction window.  Their results are not trusted.

17 17 L2-Miss Dependent Instructions Compute Load 1 Miss Runahead Miss 1 Two types of results produced: INV and VALID INV = Dependent on an L2 miss INV results are marked using INV bits in the register file and store buffer. INV values are not used for prefetching/branch resolution.

18 18 Removal of Instructions from Window Compute Load 1 Miss Runahead Miss 1 Oldest instruction is examined for pseudo-retirement  An INV instruction is removed from window immediately.  A VALID instruction is removed when it completes execution. Pseudo-retired instructions free their allocated resources.  This allows the processing of later instructions. Pseudo-retired stores communicate their data to dependent loads.

19 19 Store/Load Handling in Runahead Mode Compute Load 1 Miss Runahead Miss 1 A pseudo-retired store writes its data and INV status to a dedicated memory, called runahead cache. Purpose: Data communication through memory in runahead mode. A dependent load reads its data from the runahead cache. Does not need to be always correct  Size of runahead cache is very small.

20 20 Branch Handling in Runahead Mode Compute Load 1 Miss Runahead Miss 1 INV branches cannot be resolved.  A mispredicted INV branch causes the processor to stay on the wrong program path until the end of runahead execution. VALID branches are resolved and initiate recovery if mispredicted.

21 21 Hardware Cost of Runahead Execution Checkpoint of the architectural register state  Already exists in current processors INV bits per register and store buffer entry Runahead cache (512 bytes) <0.05% area overhead

22 22 Talk Outline Motivation: The Memory Latency Problem Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism  Efficient Runahead Execution  Address-Value Delta (AVD) Prediction Summary of Contributions Future Work

23 23 Baseline Processor 3-wide fetch, 29-stage pipeline x86 processor 128-entry instruction window 512 KB, 8-way, 16-cycle unified L2 cache Approximately 500-cycle L2 miss latency  Bandwidth, contention, conflicts modeled in detail Aggressive streaming data prefetcher (16 streams) Next-two-lines instruction prefetcher

24 24 Evaluated Benchmarks 147 Intel x86 benchmarks simulated for 30 million instructions Benchmark Suites  SPEC CPU 95 (S95): Mostly scientific FP applications  SPEC FP 2000 (FP00)  SPEC INT 2000 (INT00)  Internet (WEB): Spec Jbb, Webmark2001  Multimedia (MM): MPEG, speech recognition, games  Productivity (PROD): Powerpoint, Excel, Photoshop  Server (SERV): Transaction processing, E-commerce  Workstation (WS): Engineering/CAD applications

25 25 Performance of Runahead Execution

26 26 Runahead Execution vs. Large Windows

27 27 Talk Outline Motivation: The Memory Latency Problem Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism  Efficient Runahead Execution  Address-Value Delta (AVD) Prediction Summary of Contributions Future Work

28 28 Limitations of the Baseline Runahead Mechanism Energy Inefficiency  A large number of instructions are speculatively executed  Efficient Runahead Execution [ISCA’05, IEEE Micro Top Picks’06] Ineffectiveness for pointer-intensive applications  Runahead cannot parallelize dependent L2 cache misses  Address-Value Delta (AVD) Prediction [MICRO’05] Irresolvable branch mispredictions in runahead mode  Cannot recover from a mispredicted L2-miss dependent branch  Wrong Path Events [MICRO’04]

29 29 Talk Outline Motivation: The Memory Latency Problem Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism  Efficient Runahead Execution  Address-Value Delta (AVD) Prediction Summary of Contributions Future Work

30 30 The Efficiency Problem [ISCA’05] A runahead processor pre-executes some instructions speculatively Each pre-executed instruction consumes energy Runahead execution significantly increases the number of executed instructions, sometimes without providing performance improvement

31 31 The Efficiency Problem 22% 27%

32 32 Efficiency of Runahead Execution Goals:  Reduce the number of executed instructions without reducing the IPC improvement  Increase the IPC improvement without increasing the number of executed instructions % Increase in IPC Efficiency = % Increase in Executed Instructions

33 33 Causes of Inefficiency Short runahead periods Overlapping runahead periods Useless runahead periods

34 34 Short Runahead Periods Processor can initiate runahead mode due to an already in-flight L2 miss generated by  the prefetcher, wrong-path, or a previous runahead period Short periods  are less likely to generate useful L2 misses  have high overhead due to the flush penalty at runahead exit Compute Load 1 Miss Runahead Load 2 Miss Miss 1 Miss 2 Load 1 Hit

35 35 Eliminating Short Runahead Periods Mechanism to eliminate short periods:  Record the number of cycles C an L2-miss has been in flight  If C is greater than a threshold T for an L2 miss, disable entry into runahead mode due to that miss  T can be determined statically (at design time) or dynamically T=400 for a minimum main memory latency of 500 cycles works well

36 36 Overlapping Runahead Periods Compute Load 1 Miss Miss 1 Runahead Load 2 Miss Miss 2 Load 2 INVLoad 1 Hit OVERLAP Two runahead periods that execute the same instructions Second period is inefficient

37 37 Overlapping Runahead Periods (cont.) Overlapping periods are not necessarily useless  The availability of a new data value can result in the generation of useful L2 misses But, this does not happen often enough Mechanism to eliminate overlapping periods:  Keep track of the number of pseudo-retired instructions R during a runahead period  Keep track of the number of fetched instructions N since the exit from last runahead period  If N < R, do not enter runahead mode

38 38 Useless Runahead Periods Periods that do not result in prefetches for normal mode They exist due to the lack of memory-level parallelism Mechanism to eliminate useless periods:  Predict if a period will generate useful L2 misses  Estimate a period to be useful if it generated an L2 miss that cannot be captured by the instruction window Useless period predictors are trained based on this estimation Compute Load 1 Miss Runahead Miss 1 Load 1 Hit

39 39 Predicting Useless Runahead Periods Prediction based on the past usefulness of runahead periods caused by the same static load instruction  A 2-bit state machine records the past usefulness of a load Prediction based on too many INV loads  If the fraction of INV loads in a runahead period is greater than T, exit runahead mode Sampling (phase) based prediction  If last N runahead periods generated fewer than T L2 misses, do not enter runahead for the next M runahead opportunities Compile-time profile-based prediction  If runahead modes caused by a load were not useful in the profiling run, mark it as non-runahead load

40 40 Performance Optimizations for Efficiency Both efficiency AND performance can be increased by increasing the usefulness of runahead periods Three major optimizations:  Turning off the Floating Point Unit (FPU) in runahead mode FP instructions do not contribute to the generation of load addresses  Optimizing the update policy of the hardware prefetcher (HWP) in runahead mode Improves the positive interaction between runahead and HWP  Early wake-up of INV instructions Enables the faster removal of INV instructions

41 41 Overall Impact on Executed Instructions 26.5% 6.2%

42 42 Overall Impact on IPC 22.6% 22.1%

43 43 Talk Outline Motivation: The Memory Latency Problem Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism  Efficient Runahead Execution  Address-Value Delta (AVD) Prediction Summary of Contributions Future Work

44 44 Runahead execution cannot parallelize dependent misses  wasted opportunity to improve performance  wasted energy (useless pre-execution) Runahead performance would improve by 25% if this limitation were ideally overcome The Problem: Dependent Cache Misses Compute Load 1 Miss Miss 1 Load 2 Miss Miss 2 Load 2Load 1 Hit Runahead: Load 2 is dependent on Load 1 Runahead Cannot Compute Its Address! INV

45 45 The Goal of AVD Prediction Enable the parallelization of dependent L2 cache misses in runahead mode with a low-cost mechanism How:  Predict the values of L2-miss address (pointer) loads Address load: loads an address into its destination register, which is later used to calculate the address of another load as opposed to data load

46 46 Parallelizing Dependent Cache Misses Compute Load 1 Miss Miss 1 Load 2 Hit Miss 2 Load 2Load 1 Hit Value Predicted Runahead Saved Cycles Can Compute Its Address Compute Load 1 Miss Miss 1 Load 2 Miss Miss 2 Load 2 INVLoad 1 Hit Runahead Cannot Compute Its Address! Saved Speculative Instructions Miss

47 47 AVD Prediction [MICRO’05] Address-value delta (AVD) of a load instruction defined as: AVD = Effective Address of Load – Data Value of Load For some address loads, AVD is stable An AVD predictor keeps track of the AVDs of address loads When a load is an L2 miss in runahead mode, AVD predictor is consulted If the predictor returns a stable (confident) AVD for that load, the value of the load is predicted Predicted Value = Effective Address – Predicted AVD

48 48 Why Do Stable AVDs Occur? Regularity in the way data structures are  allocated in memory AND  traversed Two types of loads can have stable AVDs  Traversal address loads Produce addresses consumed by address loads  Leaf address loads Produce addresses consumed by data loads

49 49 Traversal Address Loads Regularly-allocated linked list: A A+k A+2k A+3k... A traversal address load loads the pointer to next node: node = node  next Effective AddrData Value AVD AA+k -k A+kA+2k-k A+2kA+3k-k Stable AVDStriding data value AVD = Effective Addr – Data Value

50 50 Leaf Address Loads Sorted dictionary in parser: Nodes point to strings (words) String and node allocated consecutively A+k A C+k C B+k B D+kE+kF+kG+k DEFG Dictionary looked up for an input word. A leaf address load loads the pointer to the string of each node: Effective AddrData Value AVD A+kA k C+kCk F+kFk lookup (node, input) { //... ptr_str = node  string; m = check_match(ptr_str, input); // … } Stable AVDNo stride! AVD = Effective Addr – Data Valuestring node

51 51 Performance of AVD Prediction runahead 14.3% 15.5%

52 52 Talk Outline Motivation: The Memory Latency Problem Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism  Efficient Runahead Execution  Address-Value Delta (AVD) Prediction Summary of Contributions Future Work

53 53 Summary of Contributions Runahead execution provides the latency tolerance benefit of a large instruction window by parallelizing independent cache misses  With very modest increase in hardware cost and complexity  128-entry window + Runahead ~ 384-entry window Efficient runahead execution techniques improve the energy-efficiency of base runahead execution  Only 6% extra instructions executed for 22% performance benefit Address-Value Delta (AVD) prediction enables the parallelization of dependent cache misses  By exploiting regular memory allocation patterns  A 16-entry (102-byte) AVD predictor improves the performance of runahead execution by 14% on pointer-intensive applications

54 54 Talk Outline Motivation: The Memory Latency Problem Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism  Efficient Runahead Execution  Address-Value Delta (AVD) Prediction Summary of Contributions Future Work

55 55 Future Work in Runahead Execution Compilation/programming techniques for runahead processors Keeping runahead execution on the correct program path Parallelizing dependent cache misses in linked data structure traversals Runahead co-processors/accelerators Evaluation of runahead execution on multithreaded and multiprocessor systems

56 56 Research Summary Runahead execution  Original runahead proposal [HPCA’03, IEEE Micro Top Picks’03]  Efficient runahead execution [ISCA’05, IEEE Micro Top Picks’06]  AVD prediction [MICRO’05]  Result reuse in runahead execution [Comp. Arch. Letters’05] High-performance memory system designs  Pollution-aware caching [IJPP’05]  Parallelism-aware caching [ISCA’06]  Performance analysis of speculative memory references [IEEE Trans. on Computers’05]  Latency/bandwidth tradeoffs in memory controllers [Patent’04] Branch instruction handling techniques through compiler-microarchitecture cooperation  Wish branches [MICRO’05 & IEEE Micro Top Picks’06]  Wrong path events [MICRO’04]  Compiler-assisted dynamic predication [in progress] Efficient compile-time profiling techniques for detecting input-dependent program behavior  2D profiling [CGO’06] Fault tolerant microarchitecture design  Microarchitecture-based introspection [DSN’05]

57 Thank you.

58 Backup Slides

59 59 Thesis Statement Efficient runahead execution is a cost- and complexity-effective microarchitectural technique that can tolerate long main memory latencies without requiring  unreasonably large, slow, complex, and power-hungry hardware structures  significant increases in processor complexity and power consumption.

60 60 Impact of L2 Cache Misses 500-cycle DRAM latency, aggressive stream-based prefetcher Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model L2 Misses

61 61 When an L2-miss load instruction is the oldest in the instruction window: Processor checkpoints architectural register state. Processor records the address of the L2-miss load. L2-miss load marks its destination register as INV (invalid) and it is removed from the instruction window. Entry into Runahead Mode Compute Load 1 Miss Miss 1

62 62 Exit from Runahead Mode Compute Load 1 Miss Runahead Miss 1 Compute Load 1 Re-fetched and Re-executed When the runahead-causing L2 miss is serviced: All instructions in the machine are flushed. INV bits are reset. Runahead cache is flushed. Processor restores the architectural state as it was before the runahead-causing instruction was fetched.  Architecturally, NOTHING happened.  But, hopefully useful prefetch requests were generated (caches warmed up). Mode is switched to normal mode  Instructions executed in runahead mode are re-executed in normal mode.

63 63 When to Enter Runahead Mode Why not at the time an L2 miss happens?  Not guaranteed to be a valid correct-path instruction until it becomes the oldest.  Limited potential (An L2-miss inst. becomes oldest instruction 10 cycles later on average)  Need to checkpoint state at oldest instruction (Throw away all instructions older than the L2 miss?) Why not when the window becomes full?  Delays the removal of instructions from window, which can result in slow progress in runahead mode.  No significant gain (The window becomes full 98% of the time after we see an L2 miss) Why not on L1 cache misses?  >50% of L1 cache misses hit in the L2 cache  Many short runahead periods

64 64 When to Exit Runahead Mode Why not exit early to fill the pipeline and the window?  How do we determine “how early”? Memory does not have fixed latency  This reduces the progress made in runahead mode Exiting early using oracle information has lower performance than exiting when miss returns Why not exit late to make further progress in runahead?  Not necessarily beneficial to stay in runahead longer  On average, this policy hurts performance  But, more intelligent runahead period extension schemes improve performance

65 65 Modifications to Pipeline < 0.05% area overhead Uop Queue Store Buffer INV Instruction Decoder L2 Access Queue Front Side Bus Access Queue From memory To memory Frontend RAT Backend FP Queue Int Queue Mem Queue ADDR FP RAT Prefetcher Selection Logic Trace Cache Fetch Unit L2 Cache Renamer Reorder Buffer MEM Sched INT Sched FP SchedRegfile INT Units INT Units GEN L1 Data Cache CHECKPOINTED RUNAHEAD CACHE STATE

66 66 Effect of a Better Front-end

67 67 Why is Runahead Better with a Better Front-end? A better front-end provides more correct-path instructions (hence, more and more accurate L2 misses) in runahead periods Average number of instructions during runahead: 711  before mispredicted INV branch: 431  with perfect TC/BP this average increases to 909 Average number of L2 misses during runahead: 2.6  before mispredicted INV branch: 2.38  with perfect TC/BP this average increases to 3.18 If all INV branches were resolved correctly during runahead  performance gain would be 25% instead of 22%

68 68 Importance of Store-Load Communication

69 69 In-order vs. Out-of-order

70 70 Sensitivity to L2 Cache Size

71 71 Instruction vs. Data Prefetching Benefits

72 72 Why Does Runahead Work? ~70% of instructions are VALID in runahead mode  These values show periodic behavior Runahead prefetching reduces L1, L2, TC misses during normal mode Data Miss Reduction  18% decrease in normal mode L1 misses (base: 13.7/1K uops)  33% of normal mode L2 data misses are fully or partially covered by runahead prefetching (base L2 data miss rate: 4.3/1K uops)  15% of normal mode L2 data misses are fully covered (these misses are never seen in normal mode) Instruction Miss Reduction  3% decrease in normal mode TC misses  14% decrease in normal mode L2 fetch misses (some of these are only partially covered by runahead requests) Overall Increase in Data Misses  L2 misses are increased by 5% (due to contention and useless prefetches)

73 73 Correlation Between L2 Miss Reduction and Speedup

74 74 Runahead on a More Aggressive Processor

75 75 Runahead on Future Model

76 76 Future Model with Perfect Front-end

77 77 Baseline Alpha Processor Execution-driven Alpha simulator 8-wide superscalar processor 128-entry instruction window, 20-stage pipeline 64 KB, 4-way, 2-cycle L1 data and instruction caches 1 MB, 32-way, 10-cycle unified L2 cache 500-cycle minimum main memory latency Aggressive stream-based prefetcher 32 DRAM banks, 32-byte wide processor-memory bus (4:1 frequency ratio), 128 outstanding misses  Detailed memory model

78 78 Runahead vs. Large Windows (Alpha)

79 79 In-order vs. Out-of-order Execution (Alpha)

80 80 Comparison to 1024-entry Window

81 81 Runahead vs. HWP (Alpha)

82 82 Effect of Memory Latency (Alpha)

83 83 1K & 2K Memory Latency (Alpha)

84 Efficient Runahead

85 85 Methods for Efficient Runahead Execution Eliminating inefficient runahead periods Increasing the usefulness of runahead periods Reuse of runahead results Value prediction of L2-miss load instructions Optimizing the exit policy from runahead mode

86 86 Impact on Efficiency 6.7% 26.5% 22.6% 20.1% 26.5% 15.3% 26.5% 11.8% 26.5% 14.9%

87 87 Extra Instructions with Efficient Runahead

88 88 Performance Increase with Efficient Runahead

89 89 Cache Sizes (Executed Instructions)

90 90 Cache Sizes (IPC Delta)

91 91 Turning Off the FPU in Runahead Mode FP instructions do not contribute to the generation of load addresses FP instructions can be dropped after decode  Spares processor resources for more useful instructions  Increases performance by enabling faster progress  Enables dynamic/static energy savings  Results in an unresolvable branch misprediction if a mispredicted branch depends on an FP operation (rare) Overall – increases IPC and reduces executed instructions

92 92 HWP Update Policy in Runahead Mode Aggressive hardware prefetching in runahead mode may hurt performance, if the prefetcher accuracy is low Runahead requests more accurate than prefetcher requests Three policies:  Do not update the prefetcher state  Update the prefetcher state just like in normal mode  Only train existing streams, but do not create new streams Runahead mode improves the timeliness of the prefetcher in many benchmarks Only training the existing streams is the best policy

93 93 Early INV Wake-up Keep track of INV status of an instruction in the scheduler. Scheduler wakes up the instruction if any source is INV. + Enables faster progress during runahead mode by removing the useless INV instructions faster. - Increases the number of executed instructions. - Increases the complexity of the scheduling logic. Not worth implementing due to small IPC gain

94 94 Short Runahead Periods

95 95 RCST Counter

96 96 Sampling for Useless Periods

97 97 Efficiency Techniques Extra Instructions

98 98 Efficiency Techniques IPC Increase

99 99 Effect of Memory Latency on Efficient Runahead

100 100 Usefulness of Runahead Periods

101 101 L2 Misses Per Useful Runahead Periods

102 Other Considerations for Efficient Runahead

103 103 Performance Potential of Result Reuse Ideal reuse study  To determine the upper bound on the performance gain possible by reusing results of runahead instructions Valid pseudo-retired runahead instructions magically update architectural state during normal mode  They do not consume any resources (ROB or buffer entries) Only invalid pseudo-retired runahead instructions are re-executed  They are fed into the renamer (fetch/decode pipeline is skipped)

104 104 Ideal Reuse of All Valid Runahead Results

105 105 Alpha Reuse IPC’s

106 106 Number of Reused Instructions

107 107 Why Does Reuse Not Work?

108 108 IPC Increase with Reuse

109 109 Extra Instructions with Reuse

110 110 Runahead Period Statistics

111 111 Mem Latency and BP Accuracy in Reuse

112 112 Runahead + VP: Extra Instructions

113 113 Runahead + VP: IPC Increase

114 114 Late Exit from Runahead (Extra Inst.)

115 115 Late Exit from Runahead (IPC Increase)

116 AVD Prediction and Optimizations

117 117 Identifying Address Loads in Hardware Observation: If the AVD is too large, the value being loaded is likely NOT an address Only predict loads that have satisfied: -MaxAVD < AVD < +MaxAVD This identification mechanism eliminates almost all data loads from consideration  Enables the AVD predictor to be small

118 118 An Implementable AVD Predictor Set-associative prediction table Prediction table entry consists of  Tag (PC of the load)  Last AVD seen for the load  Confidence counter for the recorded AVD Updated when an address load is retired in normal mode Accessed when a load misses in L2 cache in runahead mode Recovery-free: No need to recover the state of the processor or the predictor on misprediction  Runahead mode is purely speculative

119 119 AVD Update Logic

120 120 AVD Prediction Logic

121 121 Stable AVDs can be captured with a stride value predictor Stable AVDs disappear with the re-organization of the data structure (e.g., sorting) Stability of AVDs is dependent on the behavior of the memory allocator  Allocation of contiguous, fixed-size chunks is useful Properties of Traversal-based AVDs A A+k A+2k A+3k A+k A A+2k Sorting Distance between nodes NOT constant!

122 122 Properties of Leaf-based AVDs Stable AVDs cannot be captured with a stride value predictor Stable AVDs do not disappear with the re-organization of the data structure (e.g., sorting) Stability of AVDs is dependent on the behavior of the memory allocator A+k A B+k BC C+k Sorting Distance between node and string still constant! C+k C A+k AB B+k

123 123 AVD Prediction vs. Stride Value Prediction Performance:  Both can capture traversal address loads with stable AVDs e.g., treeadd  Stride VP cannot capture leaf address loads with stable AVDs e.g., health, mst, parser  AVD predictor cannot capture data loads with striding data values Predicting these can be useful for the correct resolution of mispredicted L2-miss dependent branches, e.g., parser Complexity:  AVD predictor requires much fewer entries (only address loads)  AVD prediction logic is simpler (no stride maintenance)

124 124 AVD vs. Stride VP Performance 12.1% 2.5% 13.4% 12.6% 4.5% 16% 16 entries4096 entries

125 125 AVD vs. Stride VP Performance 5.1% 2.7% 6.5% 5.5% 4.7% 8.6% 16 entries4096 entries

126 126 AVD vs. Stream Prefetching Performance 12.1% 16.5% 22.5% 12.1% 13.4% 20.1%

127 127 AVD vs. Stream Pref. (L2 bandwidth) 5.1% 32.8% 35.3% 5.1% 24.5% 26%

128 128 AVD vs. Stream Pref. (Mem bandwidth) 3.2% 14.9% 19.5% 3.2% 12.1% 16.4%

129 129 Source Code Optimization for AVD Prediction

130 130 Effect of Code Optimization (parser) 6.4% 10.5%

131 131 Accuracy/Coverage w/ Code Optimization

132 132 AVD and Efficiency Techniques

133 133 AVD and Efficiency Techniques

134 134 Effect of AVD on Runahead Periods

135 135 AVD Example from treeadd

136 136 AVD Example from parser

137 137 AVD Example from health

138 138 Motivation for NULL-value Optimization

139 139 Effect of NULL-value Optimization

140 Related Work

141 Methodology

142 Future Research Directions


Download ppt "Efficient Runahead Execution Processors A Power-Efficient Processing Paradigm for Tolerating Long Main Memory Latencies Onur Mutlu PhD Defense 4/28/2006."

Similar presentations


Ads by Google