ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.

Slides:



Advertisements
Similar presentations
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
Advertisements

Increasing the Energy Efficiency of TLS Systems Using Intermediate Checkpointing Salman Khan 1, Nikolas Ioannou 2, Polychronis Xekalakis 3 and Marcelo.
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
OSDI ’10 Research Visions 3 October Epoch parallelism: One execution is not enough Jessica Ouyang, Kaushik Veeraraghavan, Dongyoon Lee, Peter Chen,
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Instruction-Level Parallelism (ILP)
Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.
CS 7810 Lecture 8 Memory Dependence Prediction using Store Sets G.Z. Chrysos and J.S. Emer Proceedings of ISCA
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
October 2003 What Does the Future Hold for Parallel Languages A Computer Architect’s Perspective Josep Torrellas University of Illinois
Efficient and Flexible Architectural Support for Dynamic Monitoring YUANYUAN ZHOU, PIN ZHOU, FENG QIN, WEI LIU, & JOSEP TORRELLAS UIUC.
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
Multiscalar processors
Trace Processors Presented by Nitin Kumar Eric Rotenberg Quinn Jacobson, Yanos Sazeides, Jim Smith Computer Science Department University of Wisconsin-Madison.
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
EECS 470 Memory Scheduling Lecture 11 Coverage: Chapter 3.
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical.
Thread-Level Speculation Karan Singh CS
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
CMPE 421 Parallel Computer Architecture
Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre.
1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.
CBP 2005Comp 3070 Computer Architecture1 Last Time … All instructions the same length We learned to program MIPS And a bit about Intel’s x86 Instructions.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.
CS203 – Advanced Computer Architecture ILP and Speculation.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
CS 352H: Computer Systems Architecture
Data Prefetching Smruti R. Sarangi.
Multiscalar Processors
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
Lecture: Out-of-order Processors
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Out-of-Order Commit Processors
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Address-Value Delta (AVD) Prediction
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Lecture 8: Dynamic ILP Topics: out-of-order processors
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Out-of-Order Commit Processors
Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,
* From AMD 1996 Publication #18522 Revision E
Data Prefetching Smruti R. Sarangi.
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 9: Dynamic ILP Topics: out-of-order processors
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Transparent Control Independence (TCI)
Handling Stores and Loads
Presentation transcript:

ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan Zhou University of Illinois at Urbana-Champaign Presented at CARG, March 8 th, 2006 By Jason Zebchuk

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 2 Data Value Speculation  Predict the value and proceed speculatively  When correct value is known, if mispediction, squash and re-execute  Initial proposals L1 data value Store/Load Data dependences  Aggressive novel proposals  Retire speculative instructions Values of L2 misses Thread independence in Thread-Level Speculation (TLS) Speculative synchronization

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 3 Long-latency Speculation  Misprediction recovery is very wasteful  Most discarded instructions are still useful Checkpoint Prediction Resolution 210 Instructions 6.6 Instructions Speculatively retired instructions

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 4 Contributions (I)  ReSlice: Architecture to buffer forward slice and re- execute it on misprediction Checkpoint Prediction Resolution Buffered Forward Slice If succeeds If fails

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 5 Contributions (II)  A sufficient condition for correct slice re-execution and merge Guarantee to correctly repair the program state  Application to Thread-Level Speculation (TLS) on SpecInt Prediction: task live-ins Removed 61% task squashes Speedup: up to 33% over TLS (avg 12%) E×D 2 reduction: avg 20%

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 6 Outline  Motivation and Contributions  Idea of ReSlice  Architecture Design  Experimental Results  Conclusions

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 7 Idea of ReSlice  Initial execution of the task Predict value of “risky” load and continue e.g L2 missing loads, exposed loads in TLS … Buffer in HW the forward slice of the load  If a misprediction happens Re-execute the slice with the correct value If succeed: merge the register and memory state and continue If fail: roll back and re-execute (conventional recovery)

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 8 Why is this Challenging? … #1: LD R1 mem_A … #2: ST R2 (R1) … #3: LD R5 mem[0x20] … “Risky” Load R1  0x10 ST R2 mem[0x10] R1  0x20 ST R2 mem[0x20] LD R5 mem[0x20] Problem: Instruction #3 is not buffered! Buffered SliceCorrect Execution Initial Execution Initial execution is speculative  Buffered slices may not be correct

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 9 Solution: Run-time Address Check … #1: LD R1 mem_A … #2: ST R2 (R1) … #3: LD R5 mem[0x20] … “Inhibiting” Store R1  0x10 ST R2 mem[0x10] R1  0x20 ST R2 mem[0x20] Buffered SliceSlice Re-ExecutionInitial Execution Different store addresses; And mem[0x20] was loaded in initial execution “Risky” Load

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 10 Problems Buffered Slice Missing Insts Extra Insts Data Corruption Slice Re-execution Initial Speculative Execution (Example just shown) “Dangling” Load“Inhibiting” Load

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 11 A Sufficient Condition  Guarantee correct Slice re-execution State merge with program  Two sub-conditions Branch takes same direction in both initial execution and slice re-execution No presence of any previous three problems  Easy for HW to check at run-time  Details please see our paper (Briefly: compare addresses accessed in old and new run, also look for references to same address in first run)

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 12 How does ReSlice Work? Timeline Checkpoint Prediction Resolution Slice Collection Compare the correct and predicted value Correct prediction Slice Re-Execution Misprediction Incorrect State Merging Correct

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 13 Architecture Design CPU Live Ins Buffered Instr. Re-Execution Unit (REU) Slice Descriptor Slice Descriptor Slice Descriptors Slice Info Slice Buffer Cache Register State Memory State

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 14 Other Architectural Requirements  Tag physical registers with “SliceTag”  Tag Cache for storing addresses accessed by slice instructions  Undo Log records first store to each memory location in each Slice  High-coverage Predictor to predict which instructions to use as seeds. Uses Dependence and Value Predictor used by TLS  Uses TLS information to identify Speculatively Read/Written locations in cache

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 15 Step 1: Slice Collection Fill up the Slice Buffer when a prediction is made Track both register and memory dependences Save live-in operands and slice instructions Slices are buffered when instructions are retired CPU Live Ins Buffered Instr. Re-Execution Unit (REU) Slice Descriptor Slice Info Slice Buffer Cache Register State Memory State

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 16 An Example Slice Desc. Live InsBuffered Insts ST R3, mem[R4] R4 = 0x10 ADD R3, R1, R2 R4 = 0x10 ST R3, mem[R4]

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 17 Step 2: Slice Re-Execution REU takes over after a misprediction In-order execution Sufficient condition is checked during slice re-execution If success, merge the register and memory state; otherwise, squash the task and restart CPU Live Ins Buffered Instr. Re-Execution Unit (REU) Slice Descriptor Slice Info Slice Buffer Cache Register State Memory State

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 18 Step 3: State Merging Copy live registers back to CPU register file Merge memory state (details in paper) CPU Live Ins Buffered Instr. Re-Execution Unit (REU) Slice Descriptor Slice Info Slice Buffer Cache Register State Memory State

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 19 Multiple Overlapping Slices  One slice may change live-ins of the other slice  Detect overlapping slices during slice collection  Combine the instructions of the overlapping slices and re- execute them in program order misprediction

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 20 Outline  Motivation and Contributions  Idea of ReSlice  Architecture Design  Experimental Results  Conclusions

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 21 Methodology  Simulated CMP architecture 5 GHz Private 16k L1 per core; 1 MB Shared L2  SpecInt 2000 applications TLS+ReSlice Four 3-issue cores with TLS and ReSlice Baseline: TLS Four 3-issue cores with TLS Serial One 3-issue core

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 22 Slice Characterization (Averages) 10.4 instructions 4.5 Reg live-ins 1.0 Mem live-ins 2.2 Reg live-outs 1.9 Mem live-outs 1.6 slices per task 15% tasks with overlapping slices

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 23 Number of Task Squashes 61% squashes are avoided because of ReSlice

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 24 Performance  Speedup TLS+ReSlice Over TLS: up to 33% (avg 12%) Over Serial: avg 45%

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 25 Energy × Delay 2 E×D 2 reduction: 20% over TLS

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing Wei Liu, University of Illinois 26 Conclusions  Generic architecture for forward slice re-execution  Sufficient condition for correct re-execution and merge  Improve TLS on SpecInt Speedups: up to 33% over TLS (avg 12%) E×D 2 reduction: avg 20% over TLS  Recovering wasted work is a promising approach Boost performance Energy efficiency

ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan Zhou University of Illinois at Urbana-Champaign