UPC Dynamic Removal of Redundant Computations Carlos Molina, Antonio González and Jordi Tubella Universitat Politècnica de Catalunya - Barcelona

Slides:



Advertisements
Similar presentations
Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
Advertisements

UPC MICRO35 Istanbul Nov Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.
Machine cycle.
ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica.
Topics Left Superscalar machines IA64 / EPIC architecture
Link-Time Path-Sensitive Memory Redundancy Elimination Manel Fernández and Roger Espasa Computer Architecture Department Universitat.
U P C CGO’03 San Francisco March 2003 Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric.
UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
Synonymous Address Compaction for Energy Reduction in Data TLB Chinnakrishnan Ballapuram Hsien-Hsin S. Lee Milos Prvulovic School of Electrical and Computer.
Decoupled Pipelines: Rationale, Analysis, and Evaluation Frederick A. Koopmans, Sanjay J. Patel Department of Computer Engineering University of Illinois.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
UPC Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya 1999 International.
UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.
UPC Value Compression to Reduce Power in Data Caches Carles Aliagas, Carlos Molina and Montse García Universitat Rovira i Virgili – Tarragona, Spain {caliagas,
The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture ­Facilitate parallel execution ­Scale well with advancing.
Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.
UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.
Neural Methods for Dynamic Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.
Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.
Revisiting Load Value Speculation:
5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.
Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.
A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,
On the Value Locality of Store Instructions Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison
Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.
Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Power Awareness through Selective Dynamically Optimized Traces Roni Rosner, Yoav Almog, Micha Moffie, Naftali Schwartz and Avi Mendelson – Intel Labs,
Fetch Directed Prefetching - a Study
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
D A C U C P Speculative Alias Analysis for Executable Code Manel Fernández and Roger Espasa Computer Architecture Department Universitat Politècnica de.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Multiscalar Processors
Out of Order Processors
Lecture: Out-of-order Processors
Out-of-Order Commit Processors
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Milad Hashemi, Onur Mutlu, Yale N. Patt
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.
EE 382N Guest Lecture Wish Branches
Lecture on High Performance Processor Architecture (CS05162)
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Out-of-Order Commit Processors
Lecture 20: OOO, Memory Hierarchy
Lecture 20: OOO, Memory Hierarchy
Computer Architecture
rePLay: A Hardware Framework for Dynamic Optimization
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Conceptual execution on a processor which exploits ILP
Presentation transcript:

UPC Dynamic Removal of Redundant Computations Carlos Molina, Antonio González and Jordi Tubella Universitat Politècnica de Catalunya - Barcelona ICS´99, Rhodes (Greece) - June 20-25, 1999

UPC for (i=0; i<N; i++) A[i] = B[i]+C[i];..... R = S / T ;..... X = S / U ;..... Motivation Quasi - invariantQuasi-common subexpression

UPC Outline Instruction Reuse Related Work Redundant Computation Buffer Performance Results Conclusions

UPC Instruction Reuse Fetch Decode & Rename Commit OOO Execution Reuse Mechanism index

UPC Related Work Instruction Reuse Value Cache for the Tree Machine (Harbison 82) Result Cache (Richardson 92, Oberman et al. 95) Reuse Buffer (Sodani and Sohi 97) Physical Register Reuse (Jourdan et al. 98) Trace Reuse Basic blocks (Huang and Lilja 99) General traces (González et al. 99)

UPC Related Work Result Cache Richardson 92, Oberman & Flynn 95 –Special purpose (long latency operations) –Indexed by operand values –No reuse chaining –Can reuse dynamic instances of other static instructions Reuse Buffer Sodani & Sohi 97 –General purpose –Indexed by PC –Reuse chaining –Only reuse dynamic instances of same static instructions

UPC Redundant Computation Buffer Vtabl e Atable pointer opcoderesult/addressopnd1opnd2pointer Atable address tag result Mtable Reuse Test Reused Value Reused Memory Value

UPC RCB (Working Example) I1: 8 / 2 = 4 Vtable Atable 10: div8nil24 4 while (cond) { r = s / t ; x = s / u ; }

UPC 20: div824 nil RCB (Working Example) Vtable 10: Atable div8nil24 4 while (cond) { r = s / t ; x = s / u ; } I2: 8 / 2 = 4

UPC Vtable 10: Atable div8nil24 4 while (cond) { r = s / t ; x = s / u ; } I2: 8 / 2 = 4 20: div824 RCB (Working Example)

UPC 20: div8nil24div8nil24div9nil33 Vtable 10: Atable 4 while (cond) { r = s / t ; x = s / u ; } I1: 9 / 3 = 3 3 I2: 9 / 3 = 3 RCB (Working Example)

UPC Enhanced Result Cache Mtable address tag result Atable opcoderesult/addressopnd1opnd2 Operands Enhanced Reuse Buffer Mtable Atable opcoderesult/addressopnd1opnd2 address tag result PC Enhancements to Other Schemes

UPC Timing Considerations fetchissue commit execute write back decode& rename opnd read &dispatch Pipeline Stages Atable lookup reuse test Latency of the Reuse Buffer 1 st Atable lookup reuse test 2 nd Atable lookup Latency of the RCB Atable lookup reuse test Latency of the Result Cache

UPC Experimental Framework Simulator Alpha version of the SimpleScalar Toolset Benchmarks Spec95 Maximum Optimization Level DEC C & F77 compilers with -non_shared -O5 Statistics Collected for 125 million instructions Skipping initializations

UPC Basic Reuse Statistics We evaluate different schemes - Enhanced Result Cache (ERC) - Enhanced Reuse Buffer (ERB) - Redundant Computation Buffer (RCB) We find best configuration for each scheme - Number of entries - History depth Best configurations will be evaluated - Percentage of reuse - Speedup

UPC Quasi-Common Subexpressions 32 KB

UPC Study of Reuse (ERB) | | | | | | | | | Size in Kbytes

UPC Study of Reuse (RCB) | | | | | | | | | Size in Kbytes

UPC Study of Reuse (Comparative) | | | | | | | | | Size in Kbytes

UPC Performance Evaluation Two different capacities are evaluated - 32 KB KB Best configuration has been chosen for each reuse scheme We present a performance evaluation for a supercalar processor - Speedup - Percentage of reuse

UPC Base Microarchitecture

UPC Speedup (32 KB)

UPC Speedup (200 KB)

UPC Reuse (32 KB) Ops ready

UPC Reuse (200 KB) Ops ready

UPC Reuse by Instruction Category  Load Value  Memory Address  Arithmetic  Cond Branch

UPC Hybrid Scheme opcores/addrop1op2pointer Atable PC Atable opcores/addrop1op2pointer PC Opnds opcores/addrop1op2 nil Atable opcodresult/addropnd1opnd2 Opnds

UPC Speedup (Hybrid Scheme)

UPC Reuse (Hybrid Scheme)

UPC Speedup (Perfect Reuse Engine)

UPC Conclusions Redundant Computation Buffer Quasi-invariants Quasi-common subexpressions High reuse coverage and low latency 30% reuse 10% speedup Outperforms previous schemes