Presentation is loading. Please wait.

Presentation is loading. Please wait.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Similar presentations


Presentation on theme: "UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de."— Presentation transcript:

1 UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de 2005) Advisors: Antonio González and Jordi Tubella

2 2 Outline Motivation & Objectives Overview of Proposals To improve the memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work

3 3 Outline Motivation & Objectives Overview of Proposals To improve the memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work

4 4 Motivation General by design real-world programs operating systems Often designed in mind to future expansion code reuse Input sets have little variation Repetition is relatively common Even with aggressive compilers

5 5 Types of Repetition Repetition Computations Values z = F (x, y)

6 6 Repetitive Computations 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Spec CPU2000, 500 million instructions

7 7 z = F (x, y) Types of Repetition Repetition Computations Values

8 8 Repetitive Values 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Spec CPU2000, 500 million instructions, analysis of destination value

9 9 Objectives Exploit Value Repetition of Store Instructions Redundant store instructions Non redundant data cache To improve the memory system Exploit Computation Repetition of all Insts Redundant computation buffer (ILR) Trace-level reuse (TLR) Trace-level speculative multithreaded architecture (TLS) To speed-up the execution of instructions

10 10 Experimental Framework Methodology Analysis of benchmarks Definition of proposal Evaluation of proposal Tools Atom Cacti 3.0 Simplescalar Tool Set Benchmarks Spec CPU95 Spec CPU2000

11 11 Outline Motivation & Objectives Overview of Proposals To improve the memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work

12 12 Techniques to Improve Memory Redundant Stores Non Redundant Cache Value Repetition

13 13 Contributions Redundant stores Analysis of repetition into same storage location Redundant stores applied to reduce memory traffic Redundant Store Redundant Stores Instructions Do NOT modify memory STORE i, Value Y) If (Value X==Value Y) then Memory Value X Molina, González, Tubella, “Reducing Memory Traffic via Redundant Store Instructions”, HPCN’99 Main results 15%-25% of redundant store instructions 5%-20% of memory traffic reduction

14 14 Non redundant data cache (NRC) Data Cache Non Redundant Data Cache Molina, Aliagas, García,Tubella, González, “Non Redundant Data Cache”, ISLPED’03 Aliagas, Molina, García, González, Tubella, “Value Compression to Reduce Power in Data Caches”, EUROPAR’03 Contributions Analysis of repetition in several storage locations If (Value A==Value D) then Value Repetition Tag X Tag Y Value A Value C Value B Value D Main results On average, a value is stored 4 times at any given time NRC: -32% area, -13% energy, -25% latency, +5% miss FFFF 1234

15 15 Outline Motivation & Objectives Overview of Proposals To improve the memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work

16 16 Techniques to Speed-up I Execution Data Value Reuse Data Value Speculation  Avoid serialization caused by data dependences  Determine results of instructions without executing them  Target is to speed-up the execution of programs  Avoid serialization caused by data dependences  Determine results of instructions without executing them  Target is to speed-up the execution of programs Computation Repetition

17 17  NON SPECULATIVE !!!  Buffers previous inputs and their corresponding outputs  Only possible if a computation has been done in the past  Inputs have to be ready at reuse test time  NON SPECULATIVE !!!  Buffers previous inputs and their corresponding outputs  Only possible if a computation has been done in the past  Inputs have to be ready at reuse test time Techniques to Speed-up I Execution Computation Repetition Data Value Reuse Data Value Speculation

18 18  SPECULATIVE !!!  Predicts values as a function of the past history  Needs to confirm speculation at a later point  Solves reuse test but introduces misspeculation penalty  SPECULATIVE !!!  Predicts values as a function of the past history  Needs to confirm speculation at a later point  Solves reuse test but introduces misspeculation penalty Techniques to Speed-up I Execution Computation Repetition Data Value Reuse Data Value Speculation

19 19 Applied to a SINGLE instruction Techniques to Speed-up I Execution Computation Repetition Data Value Reuse Instruction Level Trace Level Data Value Speculation Instruction Level Trace Level

20 20 Applied to a GROUP of instructions Techniques to Speed-up I Execution Computation Repetition Data Value Reuse Instruction Level Trace Level Data Value Speculation Instruction Level Trace Level

21 21 Techniques to Speed-up I Execution Computation Repetition Data Value Reuse Instruction Level Trace Level Data Value Speculation Instruction Level Trace Level

22 22 Redundant Computation Buffer (RCB) Instruction Level Reuse (ILR) index Reuse Table Fetch Decode & Rename Commit OOO Execution Molina, González, Tubella, “Dynamic Removal of Redundant Computations”, ICS’99 Contributions Performance potential of ILR RCB Main results Ideal ILR speed-up of 1.5 RCB speed-up of 1.1 (outperforms previous proposals)

23 23 Contributions Trace Level Reuse Initial design issues for integrating TLR Trace Level Reuse (TLR) González, Tubella, Molina, “Trace-Level Reuse”, ICPP’99 Performance potential of TLR I1 I2 I3 I4 I5 I6 TRACE Main results Ideal TLR speed-up of 3.6 4K-entry table: 25% of reuse, average trace size of 6

24 24 Compiler analysis to support TSMA Trace Level Speculation (TLS) Molina, González, Tubella, “Trace-Level Speculative Multithreaded Architecture (TSMA)”, ICCD’02 Molina, González, Tubella “Compiler Analysis for TSMA”, INTERACT’05 Molina, Tubella, González, “Reducing Misspeculation Penalty in TSMA”, ISHPC’05 Contributions Trace Level Speculative Multithreaded Architecture Main results speedup of 1.38 with a 20% of misspeculations Microarchitecture Support for Trace Speculation Two orthogonal issues Control and Data Speculation Techniques TSMA Static Analysis Based on Profiling Info

25 25 Objectives & Proposals To improve the memory system To speed-up the execution of instructions Redundant store instructions Non redundant data cache Redundant computation buffer (ILR) Trace-level reuse buffer (TLR) Trace-level speculative multithreaded architecture (TLS)

26 26 Outline Motivation & Objectives Overview of Proposals To improve the memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work

27 27 Motivation Caches spend close to 50% of total die area Caches are responsible of a significant part of total power dissipated by a processor

28 28 Data Value Repetition percentage of repetitive values percentage of time Spec CPU2000, 1 billion instructions, 256KB data cache

29 29 Value A Value C Value B Value D FFFF 1234 Conventional Cache If (Value A==Value D) then Tag X Tag Y Value Repetition

30 FFFF FFFF 1234 Non Redundant Data Cache Tag X Tag Y Pointer Table Value Table Die Area Reduction

31 31 Additional Hardware: Pointers 1234 FFFF 0000 Non Redundant Data Cache Tag X Tag Y Pointer Table Value Table

32 32 Additional Hardware: Counters 1234 FFFF 0000 Non Redundant Data Cache Tag X Tag Y Pointer Table Value Table 1 2 1

33 33 Data Value Inlining Some values can be represented with a small number of bits (Narrow Values) Narrow values can be inlined into pointer area Simple sign extension is applied Benefits enlarges effective capacity of VT reduces latency reduces power dissipation

34 34 0 F 2 Data Value Inlining FFFF10000 Non Redundant Data Cache Tag X Tag Y Pointer Table Value Table

35 35 Miss Rate vs Die Area Miss Ratio % % % % % % % % L2 Cache: 256KB 512KB 1MB 2MB 4MB | | | 0,1 0,5 1,0 cm 2 VT50 VT30 VT20CONV Spec CPU2000, 1 billion instructions

36 36 Results Caches ranging from 256 KB to 4 MB

37 37 Outline Motivation & Objectives Overview of Proposals To improve the memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work

38 38 Trace Level Speculation Avoids serialization caused by data dependences Skips in a row multiple instructions Predicts values based on the past Solves live-input test Introduces penalties due to misspeculations

39 39 Trace Level Speculation Two orthogonal issues microarchitecture support for trace speculation control and data speculation techniques –prediction of initial and final points –prediction of live output values Trace Level Speculative Multithreaded Architecture (TSMA) does not introduce significant misspeculation penalties Compiler Analysis based on static analysis that uses profiling data

40 40 Trace Level Speculation with Live Output Test Live Output Update & Trace Speculation NST ST Miss Trace Speculation Detection & Recovery Actions INSTRUCTION EXECUTION INSTRUCTION SPECULATION INSTRUCTION VALIDATION Instruction Flow

41 41 TSMA Block Diagram Cache I Engine Fetch Rename Decode & Units Functional Predictor Branch Speculation Trace NST Reorder Buffer ST Reorder Buffer NST Ld/St Queue ST Ld/St Queue NST I Window ST I Window Look Ahead Buffer Engine Verification L1NSDC L2NSDC L1SDC Data Cache Register File NST Arch. Register File ST Arch.

42 42 Compiler Analysis Focuses on developing effective trace selection schemes for TSMA based on static analysis that uses profiling data Trace Selection Graph Construction (CFG & DDG) Graph Analysis

43 43 Graph Analysis Two important issues initial and final point of a trace –maximize trace length & minimize misspeculations predictability of live output values –prediction accuracy and utilization degree Three basic heuristics Procedure Trace Heuristic Loop Trace Heuristic Instruction Chaining Trace Heuristic

44 44 Trace Speculation Engine Traces are communicated to the hardware at program loading time filling a special hardware structure (trace table) Each entry of the trace table contains initial PC final PC live-output values information branch history frequency counter

45 45 Simulation Parameters Base microarchitecture out of order machine, 4 instructions per cycle I cache: 16KB, D cache: 16KB, L2 shared: 256KB bimodal predictor 64-entry ROB, FUs: 4 int, 2 div, 2 mul, 4 fps TSMA additional structures each thread: I window, reorder buffer, register file speculative data cache: 1KB trace table: 128 entries, 4-way set associative look ahead buffer: 128 entries verification engine: up to 8 instructions per cycle

46 46 Speedup Spec CPU2000, 250 million instructions

47 47 Misspeculations Spec CPU2000, 250 million instructions

48 48 Outline Motivation & Objectives Overview of Proposals To improve memory system To speed-up the execution of instructions Non Redundant Data Cache Trace-Level Speculative Multithreaded Arch. Conclusions & Future Work

49 49 Conclusions Repetition is very common in programs Can be applied to improve the memory system to speed-up the execution of instructions Investigated several alternatives Novel cache organizations Instruction level reuse approach Trace level reuse concept Trace level speculation architecture

50 50 Future Work Value repetition in instruction caches Profiling to support data value reuse schemes Traces starting at different PCs Value prediction in TSMA Multiple speculations in TSMA Multiple threads in TSMA

51 51 Publications Value Repetition in Cache Organizations Reducing Memory Traffic Via Redundant Store Instructions, HPCN'99 Non Redundant Data Cache, ISLPED'03 Value Compression to Reduce Power in Data Caches, EUROPAR'03 Instruction & Trace Level Reuse The Performance Potential of Data Value Reuse, TR-UPC-DAC’98 Dynamic Removal of Redundant Computations, ICS'99 Trace Level Reuse, ICPP'99 Trace Level Speculation Trace-Level Speculative Multithreaded Architecture, ICCD'02 Compiler Analysis for TSMA, INTERACT’05 Reducing Misspeculation Penalty in TSMA, ISHPC´05

52 UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona, 14 de Diciembre de 2005) Advisors: Antonio González and Jordi Tubella


Download ppt "UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de."

Similar presentations


Ads by Google