UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.

Slides:

Advertisements

Similar presentations

Topics Left Superscalar machines IA64 / EPIC architecture

Advertisements

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

UPC Dynamic Removal of Redundant Computations Carlos Molina, Antonio González and Jordi Tubella Universitat Politècnica de Catalunya - Barcelona

UPC Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya 1999 International.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

UPC Value Compression to Reduce Power in Data Caches Carles Aliagas, Carlos Molina and Montse García Universitat Rovira i Virgili – Tarragona, Spain {caliagas,

Multiscalar processors

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

Trace Processors Presented by Nitin Kumar Eric Rotenberg Quinn Jacobson, Yanos Sazeides, Jim Smith Computer Science Department University of Wisconsin-Madison.

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Slipstream Processors by Pujan Joshi1 Pujan Joshi May 6 th, 2008 Slipstream Processors Improving both Performance and Fault Tolerance.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

Revisiting Load Value Speculation:

In-Line Interrupt Handling for Software Managed TLBs Aamer Jaleel and Bruce Jacob Electrical and Computer Engineering University of Maryland at College.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

CS203 – Advanced Computer Architecture ILP and Speculation.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Lecture: Out-of-order Processors

Dynamic Scheduling Why go out of style?

CSL718 : Superscalar Processors

Speculative Lock Elision

Multiscalar Processors

Lecture: Out-of-order Processors

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

/ Computer Architecture and Design

Lecture 6: Advanced Pipelines

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Computer Architecture Lecture 4 17th May, 2006

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Ka-Ming Keung Swamy D Ponpandi

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

Adapted from the slides of Prof

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Adapted from the slides of Prof

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Dynamic Hardware Prediction

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lecture 9: Dynamic ILP Topics: out-of-order processors

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Ka-Ming Keung Swamy D Ponpandi

Project Guidelines Prof. Eric Rotenberg.

Presentation transcript:

UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González and Jordi Tubella Universitat Politècnica de Catalunya – Barcelona, Spain ICCD´02, Freiburg (Germany) - September 16-18, 2002

Outline Motivation Related Work TSMA Performance Results Conclusions

Motivation Two techniques to avoid serialization caused by data dependences Data Value Speculation Data Value Reuse Speculation predicts values based on past Reuse is posible if has been done in the past Both may be considered at two levels Instruction Level Trace Level

Trace Level Reuse Set of instructions can be skipped in a row These instructions do not need to be fetched Live input test is not easy to handle Dynamic Trace Level Reuse Static

Trace Level Speculation Solves live input test Introduces penalties due to misspeculations Two orthogonal issues microarchitecture support for trace speculation control and data speculation techniques –prediction of initial and final points –prediction of live output values With Live Output Test Trace Level Speculation With Live Input Test

Trace Level Speculation with Live Input Test Live Output Actualization & Trace Speculation NST ST Miss Trace Speculation Detection & Recovery Actions INSTRUCTION EXECUTION NOT EXECUTED LIVE INPUT VALIDATION & INSTRUCTION EXECUTION

BUFFER Trace Level Speculation with Live Output Test Live Output Actualization & Trace Speculation NST ST Miss Trace Speculation Detection & Recovery Actions INSTRUCTION EXECUTION NOT EXECUTED LIVE OUTPUT VALIDATION

Related Work Trace Level Reuse Basic blocks (Huang and Lilja, 99) General traces (González et al, 99) Traces with compiler support (Connors and Hwu, 99) Trace Level Speculation DIVA (Austin, 99) Slipstream processors (Rotenberg et al, 99) Pre-execution (Sohi et al, 01) Precomputation (Shen et al, 01) Nearby and distant ILP (Balasubramonian et al, 01)

TSMA Cache I Engine Fetch Rename Decode & Units Functional Predictor Branch Speculation Trace NST Reorder Buffer ST Reorder Buffer NST Ld/St Queue ST Ld/St Queue NST I Window ST I Window Look Ahead Buffer Engine Verification L1NSDC L2NSDC L1SDC Data Cache Register File NST Arch. Register File ST Arch.

Trace Speculation Engine Two issues may handle to implement a trace level predictor to communicate trace speculation opportunity Trace level predictor PC-indexed table with N entries Each entry contains –live output values –final program counter of trace Trace speculation communication INI_TRACE instruction Additional MOVE instrucions

Look Ahead Buffer First-input first-output queue Stores instructions executed by ST The fields of each entry are: Program Counter Operation Type: indicates memory operation Source register Id 1 & source value 1 Source register Id 2 & source value 2 Destination register Id & destination value Memory address

Verification Engine Validates speculated instructions Mantains the non-speculative state Consumes instructions from LAB Test is performed as follows: testing source values of Is with non-speculative state if matching, destination value of I may be updated memory operations check effective address store instructions update memory, rest update registers Hardware required is minimal

Thread Synchronization Handles trace misspredictions Recovery actions involved are: Instruction execution is stopped ST structures are emptied (IW,LSQ,ROB,LAB) Speculative cache and ST register file are invalidated Two types of synchronization Total (Occurs when NST is not executing instructions) –Penalty due to fill again the pipeline Partial (Occurs when NST is executing instructions) –No penalty –NST takes the role of ST

Mantains memory state speculative non speculative Rules ST store updates values in L1SDC only 1 Traditional memory subsystem is supported Additional and small first level cache is added to mantain memory speculative state ST load get values from L1SDC. If not, get from NS caches 2 NST store updates values and allocate space in NS caches 3 NST loads get values and allocates space in NS caches 4 Line replaced in L1NSDC is copied back to L2NSDC 5 Memory Subsystem L1SDC L1NSDC L2NSDC

Register File Slight modification to permit prompt execution Register map table contains for each entry: Commited Value ROB Tag Counter Counter field is mantained as follows: New ST instruction increases dest. register counter Counter is decreased when ST instruction is commited After trace speculation counter are no longer increased But it is decreased until reaches the value zero.

1 ST Begins Execution 2 Live Output Actualization & Trace Speculation 3 NST Begins Execution 4 VE Validates Instructions 5 NST Executes Speculated Trace 6 NST Executes Some Additional Instructions 7 VE Begins Verification 8 VE Finishes Verification 9 Live Output Actualization & Trace SpeculationNST Execution 10 ST INSTRUCTION EXECUTION NOT EXECUTED LIVE OUTPUT VALIDATION Working Example NST VE

Experimental Framework Simulator Alpha version of the SimpleScalar Toolset Benchmarks Spec95 Maximum Optimization Level DEC C & F77 compilers with -non_shared -O5 Statistics Collected for 125 million instructions Skipping initializations

Base Microarchitecture

TSMA Additional Stuctures

Performance Evaluation Main objective: trace misspeculations cause minor penalties Traces are built following a simple rule from backward branch to backward branch minimum and maximum size of 8 and 64 respectively Simple Trace Predictor is evaluated Stride + Context Value (history of 9) Results provided Percentage of misspeculations Percentage of predicted instructions Speedup

Misspeculations

Predicted Instructions

Speedup

Conclusions TSMA designed to exploit trace-level speculation Special emphasis on minimizing misspeculation penalties Results show: architecture is tolerant to misspeculations speedup of 16% with a predictor that misses 70%

Future Work Agressive trave level predictors bigger traces better value predictors Generalization to multiple threads cascade execution Mixing prediction & execution speculated traces do not need to be fully speculated