UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco (USA) - February 13, 2005 λ Intel Barcelona Research Center Intel Labs - UPC Barcelona, Spain antoniox.gonzalez@intel.com ф Dept. Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona, Spain {antonio,jordit}@ac.upc.edu ψ Dept. Enginyeria Informàtica Universitat Rovira i Virgili Tarragona, Spain carlos.molina@urv.net

Trace Level Speculation Avoids serialization caused by data dependences Skips in a row multiple instructions Predicts values based on the past Introduces penalties due to misspeculations With Live Output Test Trace Level Speculation With Live Input Test

BUFFER Trace Level Speculation with Live Output Test Live Output Update & Trace Speculation NST ST Trace Miss Speculation Detection & Recovery Actions INSTRUCTION EXECUTION NOT EXECUTED LIVE OUTPUT VALIDATION

TSMA Block Diagram Cache I Engine Fetch Rename Decode & Units Functional Predictor Branch Trace Speculation Engine NST Reorder Buffer ST Reorder Buffer NST Ld/St Queue ST Ld/St Queue NST I Window ST I Window Look Ahead Buffer Engine Verification L1NSDC L2NSDC L1SDC Data Cache Register File NST Arch. Register File ST Arch.

Motivation Two orthogonal issues microarchitecture support for trace speculation control and data speculation techniques –prediction of initial and final points –prediction of live output values TSMA does not introduce significant misspeculation penalties does not impose constraints to build or predict traces This work focuses on developing effective trace selection schemes for TSMA based on static analysis that uses profiling data

Outline Trace Selection Graph Construction Graph Analysis Performance Evaluation Conclusions

Graph Construction Test input set of the analyzed benchmarks Abstract data structure is built based on control flow graph data dependences graph predictability of values Each node represents each static instruction type of instruction, number of dynamic executions pointers and frequencies to succeeding instructions pointers and frequencies to preceding instructions predictability of live output values and dead values

Graph Analysis Two important issues initial and final point of a trace –maximize trace length & minimize control flow misspeculations predictability of live output values –prediction accuracy and utilization degree Three basic heuristics Procedure Trace Heuristic Loop Trace Heuristic Instruction Chaining Trace Heuristic

Procedure Trace Heuristic Procedures relatively frequent Computations that follow a subroutine fairly independent of the subroutine except return values and some memory locations Quite easy to predict the end of a trace

I10 I4 I5 I6 I7 I12 I1 I2 I3 I11 I12 I11 Call Branch Return T NT T Branch I13 I14 Call instruction is marked as initial point of the trace I3 1 Return address is marked as final point of the trace I11 2 N instructions after the final point of the trace are checked. Only significant paths are considered. I12 I13 I14 I11 3 Each instruction in a significant path it is checked whether any of its operands are produced by any instruction of the procedure. 4 In this case, utilization degree of the value produced and predictability of the producer instruction is evaluated. 5 If it does not achieve a certain threshold, the trace is discarded 6 Procedure Trace Heuristic

Loop Trace Heuristic Traditional source of parallelization and speculation We consider the whole execution of a loop as a trace The objective is to detect loops whose live- output values after their whole execution are predictable

I8 I1 I2 I3 I5 I6 I4 Backward Branch T NT Branch I7 T NT Backward branch target is marked as initial point of the trace I2 1 Fall-through instruction of the same backward branch is marked as final point of the trace I8 2 N instructions after the final point of the trace are checked. Same behaviour as procedure trace heuristic 3 Loop Trace Heuristic

Ichaining Trace Heuristic Goal to identify large sequences of dynamic instructions besides procedures and loops A trace is identified by: initial point final point behaviour of conditional branches within the trace

IChaining Trace Heuristic I2 I12 I1 I5 I7 I11 I8 Conditional Branch T NT I4 NT Conditional Branch I6 I10I9 Conditional Branch T NT T I3 Taken and not taken targets of all conditional branches are considered as initial points of a trace I2I3 I7 I8 I9I10 1

IChaining Trace Heuristic I2 I12 I1 I5 I7 I11 I8 Conditional Branch T NT I4 NT Conditional Branch I6 I10I9 Conditional Branch T NT T I3 Given an initial point, a trace is extended adding successive instructions I3 2 I5 Every time a conditional branch is found, the trace is split into two. 3

IChaining Trace Heuristic I2 I12 I1 I5 I7 I11 I8 Conditional Branch T NT I4 NT Conditional Branch I6 I10I9 Conditional Branch T NT T I3 I5 I7 I11 I12

IChaining Trace Heuristic I2 I12 I1 I5 I7 I11 I8 Conditional Branch T NT I4 NT Conditional Branch I6 I10I9 Conditional Branch T NT T I3 I5 I7 I11 I12 Final point is reached if: new instruction already belongs to the trace, trace reaches a maximum size or new instructions is an indirect jump. 4 I12

IChaining Trace Heuristic I2 I12 I1 I5 I7 I11 I8 Conditional Branch T NT I4 NT Conditional Branch I6 I10I9 Conditional Branch T NT T I3 I5 I7 I11 I12 Live-output values are determined and its predictability is checked for every trace candidate (highest between prediction accuracy and utilization degree) 5 Trace is considered predictable, if the multiplication of percentages of all live output-values is above certain threshold 6 If not, final instruction is removed and process starts again. (until trace reaches a minimum size) 7 I12

Trace Speculation Engine Traces are communicated to the hardware at program loading time filling a special hardware structure (trace table) Each entry of the trace table contains initial PC final PC branch history live-output values information frequency counter

Experimental Framework Simulator Alpha version of the SimpleScalar Toolset Benchmarks Spec2000, ref input Maximum Optimization Level DEC C & F77 compilers with -non_shared -O5 Statistics Collected for 250 million instructions Skipping an initial part of 500 million

Simulation Parameters Base microarchitecture out of order machine, 4 instructions per cycle I cache: 16KB, D cache: 16KB, L2 shared: 256KB bimodal predictor TSMA additional structures each thread: I window, reorder buffer, register file speculative data cache: 1KB verification engine: up to 8 instructions per cycle trace table: 128 entries, 4-way set associative look ahead buffer: 128 entries

Profiling Analysis Parameters Value Predictors: Stride & Context Minimum size of trace: 16 Maximum size of trace: 1024 Maximum number of live-outputs: 32 Threshold to consider a set of LO predictable: 25% Significative path (mimimum frequency): 10%

Type of Speculated Instructions 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Loop HeuristicProcedure HeuristicIchaining Heuristic

Type of Speculated Instructions Procedure and loop traces are relatively low But sizes are significantly larger than Ichain Some statistics: procedure trace size: 97.3 loop trace size: 215.8 Ichaining trace size: 36.4 average size of speculated traces: 65.7 average number of live output values: 16.4 branches within a trace (Ichaining): 5.3 traces with same initial PC (Ichaining): 1.57

Type of Speculations 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Spec KO, Path KO Spec KO, Path OK Spec OK, Path KO Spec OK, Path OK

Type of Speculations Correct speculations: up to 70% 65% for correctly predicted paths 7% for incorrectly predicted paths (positive missprediction) Incorrect speculations: close to 30% 20% for correctly predicted paths 8% for incorrectly predicted paths These confirms that mechanism proposed to predict paths and final points provides significant accuracy

Speedup 1.35 1.30 1.25 1.20 1.15 1.10 1.05 1.00 1.40 1.45

Speedup Average speedup close to 38% In spite of misspeculating close to 30%

Type of Cycles of ST 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % ST can not speculate ST can speculate

Type of Cycles of ST 25% of the time ST can speculate but does not find a trace to be speculated performance could be improved with further analysis 75% of the time ST can not speculate because NST is executing and verifying a speculated trace speculation may be performed only when NST catches up ST

Type of Cycles of NST 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % NST is verifying instructions NST is executing instructions

Type of Cycles of NST 65% of the time NST is executing traces speculated by ST more speculated instructions imply more time executing instructions 35% of the time NST is verifying instructions from the look ahead buffer verifying instructions is faster than executing them

Useless Cycles of ST 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 %

Useless Cycles of ST Up to 20% of the time ST is executing instructions beyond the misspeculation point ST is wasting up to 20% of the time executing instructions that will be discarded Ideal scenario would be when this percentage is negligible

Branch Behaviour Distribution 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 %

Branch Behaviour Distribution Instruction chanining heuristic does not provide many traces with the same initial point despite the significant number of branches within a trace (5.3 on average) The study concludes that the majority of branches take almost always the same direction Close to 80% of the branches take the same direction more than 90% of the times

Conclusions Profile guided analysis to support TSMA identifies large and highly predictable traces reducing hardware complexity Three basic heuristics are proposed procedure trace heuristic loop trace heuristic instruction chaining heuristic Results show speedup of 38% with a 30% of missprediction rate Future work aggressive trace level predictors generalization to multiple threads

UPC Questions & Answers INTERACT-9, San Francisco (USA) - February 13, 2005

UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Similar presentations

Presentation on theme: "UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Similar presentations

Presentation on theme: "UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco."— Presentation transcript:

Similar presentations

About project

Feedback