UPC Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya 1999 International.

Slides:

Advertisements

Similar presentations

Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.

Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

Chapter 12 Pipelining Strategies Performance Hazards.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

UPC Dynamic Removal of Redundant Computations Carlos Molina, Antonio González and Jordi Tubella Universitat Politècnica de Catalunya - Barcelona

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

ELEC Fall 05 1 Very- Long Instruction Word (VLIW) Computer Architecture Fan Wang Department of Electrical and Computer Engineering Auburn.

UPC Value Compression to Reduce Power in Data Caches Carles Aliagas, Carlos Molina and Montse García Universitat Rovira i Virgili – Tarragona, Spain {caliagas,

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Trace Processors Presented by Nitin Kumar Eric Rotenberg Quinn Jacobson, Yanos Sazeides, Jim Smith Computer Science Department University of Wisconsin-Madison.

EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.

Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.

Revisiting Load Value Speculation:

Very Long Instruction Word (VLIW) Architecture. VLIW Machine It consists of many functional units connected to a large central register file Each functional.

Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.

Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

The Central Processing Unit (CPU) and the Machine Cycle.

Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.

CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.

Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

The Alpha Thomas Daniels Other Dude Matt Ziegler.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

现代计算机体系结构主讲教师：张钢天津大学计算机学院 2009 年.

Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.

Multiscalar Processors

5.2 Eleven Advanced Optimizations of Cache Performance

Flow Path Model of Superscalars

Milad Hashemi, Onur Mutlu, Yale N. Patt

Lecture 11: Memory Data Flow Techniques

Ka-Ming Keung Swamy D Ponpandi

Instruction Level Parallelism (ILP)

Adapted from the slides of Prof

Dynamic Hardware Prediction

rePLay: A Hardware Framework for Dynamic Optimization

Lecture 11: Machine-Dependent Optimization

Ka-Ming Keung Swamy D Ponpandi

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

UPC Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya 1999 International Conference on Parallel Processing ICPP´99

UPC September 21, 1999ICPP´992 Motivation Increase performance by overcoming dataflow limitation DATA SPECULATION Exploits predictability of values DATA REUSE Exploits redundancy of computations

UPC September 21, 1999ICPP´993 Motivation Redundant computations are rather frequent code loops, recursive subroutines data finite domain of values The results could be reused instead of recomputed OUT = f (IN) dynamic execution stream redundant computations

UPC September 21, 1999ICPP´994 Motivation Reuse granularity an instruction a sequence of instructions TRACE-LEVEL REUSE Performance potential of data reuse at instruction-level at trace-level

UPC September 21, 1999ICPP´995 Outline Trace-level reuse Performance potential A first approach Related work Conclusions

UPC September 21, 1999ICPP´996 Trace-Level Reuse Trace Any dynamic sequence of instructions Goal Avoid the execution of a trace by reusing its results provided that the same trace with the same inputs has already been executed Advantages Reduces other machine resources utilization Reduces time to compute results Allows the processor to exceed the dataflow limit

UPC September 21, 1999ICPP´997 Trace-Level Reuse Hardware scheme Main Issues Reuse Trace Memory (RTM) Dynamic trace collection Reuse test State update

UPC September 21, 1999ICPP´998 Reuse Trace Memory (RTM) RTM stores candidate traces to be reused Initial Address Input registers identifiers&contents Input memory addresses&contents Output registers identifiers&contents output memory addresses&contents Next Address Trace inputTrace output TRACE INPUT OUTPUT

UPC September 21, 1999ICPP´999 Dynamic trace collection Chooses candidate traces Initial address Next address Input and output trace locations are computed at execution-time and stored along with their values in RTM

UPC September 21, 1999ICPP´9910 Reuse Test & State Update Reuse test At some points of the execution the reused test is performed Checks if a trace input, stored in RTM, matches the current execution state State update Writes output trace values to output trace locations REUSE LATENCY Reuse test plus State update

UPC September 21, 1999ICPP´9911 Outline Trace-level reuse Performance Potential A first approach Related work Conclusions

UPC September 21, 1999ICPP´9912 Performance Potential Base-line machine ISA: Alpha Only constrained by: Data dependences Data dependences + Finite instruction window Reuse engine Perfect trace reuse Maximum-length traces Minimum number of traces

UPC September 21, 1999ICPP´9913 Performance Potential Instruction-level reuse (ILR) Perfect instruction reuse engine: All previous executed instances of each instruction are checked for a possible reuse Maximum reusability: almost 90%

UPC September 21, 1999ICPP´9914 ILR Performance limits Base-line machine constrained by data dependences Reuse engine: 1-cycle latency

UPC September 21, 1999ICPP´9915 ILR Performance limits Base-line machine constrained by data dependences data dependences and instruction window Reuse latency: 1 to 4 cycles

UPC September 21, 1999ICPP´9916 ILR Performance limits Moderate potential with a perfect reuse engine Instruction latency is reduced The reuse of a chain of dependent instructions is still a sequential process Source operands must be ready

UPC September 21, 1999ICPP´9917 Performance Potential Trace-level reuse (TLR) Perfect reuse engine Traces consist of maximum-length dynamic sequences of reusable instructions –Upper bound of the maximum reusability –Lower bound of the minimum traces I1 I2 I3 I4 I5 I6 TRACE

UPC September 21, 1999ICPP´9918 TLR Average trace size: 15.0 instructions FP: 11.7 INT:

UPC September 21, 1999ICPP´9919 TLR Performance limits Base-line machine constrained by data dependences ans instruction window (256-entry) Reuse engine latency Constant Linear: f(#INPUTS+#OUTPUTS) CONSTANTLINEAR

UPC September 21, 1999ICPP´9920 Outline Trace-level reuse Performance potential A first approach Related work Conclusions

UPC September 21, 1999ICPP´9921 A First Approach Reuse Trace Memory (RTM) Indexed by trace initial address (4-way and 8-way) Maximum number of input and output values: 8 register values 4 memory values Sizes 512 entries (4 different entries per initial address) 4K entries (8 entries per initial address) 32K entries (16 entries per initial address) 256K entries (16 entries per initial address)

UPC September 21, 1999ICPP´9922 A First Approach In-order execution Reuse test performed for every fetch operation PC Instruction Cache RTM RTM entry Reuse Test Execute Commit Fetch Decode

UPC September 21, 1999ICPP´9923 A First Approach Dynamic trace collection Built traces have all instructions reusable an additional memory to check instruction reusability is needed Fixed-length traces starting at any address Trace expansion on reuse hit

UPC September 21, 1999ICPP´9924 Reusable Instructions 25% reusability for a 4K-entry RTM

UPC September 21, 1999ICPP´9925 Trace Size 6 instructions for a 4K-entry RTM

UPC September 21, 1999ICPP´9926 Related work Data Reuse Software implementation Memoization [Richardson,92] Hardware implementation Tree Machine [Harbison,82] At instruction-level Reuse Buffer [Sodani and Sohi,97] Register renaming [Jourdan et al.,98] Redundant Computation Buffer [Molina, González and Tubella,99] At “trace”-level Result cache [Richardson,93] [Oberman and Flynn,95] Basic block reuse [Huang and Lilja,99]

UPC September 21, 1999ICPP´9927 Conclusions Increasing the granularity of reuse from instructions to traces Less reusability More effective Fetch band-width is reduced Effective instruction window size is increased Number of operations per reused instruction is reduced DATA DEPENDENCES ARE BROKEN

UPC September 21, 1999ICPP´9928 Conclusions Concentrate effort in divising strategies to choose reusable traces High-level structures Compiler assistance reducing the reuse test overhead Boolean test Invalidate/validate RTM entries