Memory Consistency in Vector IRAM David Martin. Consistency model applies to instructions in a single instruction stream (different than multi-processor.

Memory Consistency in Vector IRAM David Martin

Consistency model applies to instructions in a single instruction stream (different than multi-processor consistency!). The Memory Consistency Model SaSSaVVaSVaVVPaVP RaW*++++ WaR*++++ WaW*++++ a = after V = vector R = read VP = virtual processor W = write * = no sync required S = scalar + = sync required Definition of a “XaY” sync: All operations of type Y occurring before the sync in program order appear to execute before any operation of type X occurring after the sync in program order. Definition of a “XaY” sync to vector register $vr i : The most recent operation of type Y to $vr i appears to execute before any operation of type X occurring after the sync in program order.

Why Relax Memory Consistency?  Natural micro- architecture has multiple paths to memory  Want to decouple scalar and vector units without complex hardware Trade-off between more complex hardware (speculation, disambiguation, cache coherence) and more complex software (sync instructions) Should explore solutions to this trade-off that involve more hardware: e.g. Hardware guarantees SaV and VaS ordering, but leaves VaV and VP orderings to software. Fetch Scalar Core Vector Unit Sync Memory

Software Conventions for Syncs Vector code is responsible for not messing things up. –Allows us to vectorize libraries to speed up existing programs. –Don’t want to assume that our compiler will compile and globally optimize all non-vector code that we run. Alternative model: Pass around flags to communicate sync requirements or history –Must assume that our compiler compiles all code run on IRAM. –Not sure we want to accept that restriction. Scalar CodeVector Code VaS,VaV SaV Vector Function Conventions: 1. Execute VaS and VaV syncs on entry to vector code. 2. Execute SaV sync on exit from vector code.

Sync Implementations and Costs SaV : Stall fetch unit until vector unit has committed all vector memory instructions. –Could take 1000s of cycles with many indexed vector memory operations in flight! –Very difficult to delay issue since it is often issued at the end of a vector routine. VaS : Stall fetch unit until scalar unit has committed all scalar memory instructions. –Not too expensive (10s of cycles?) because scalar unit is ahead of the vector unit, because the scalar core is simple, and because the data cache is write-thru. –Easy to delay issue because it is often issued at the start of a vector routine. VaV and VPaVP: No operation. –Nop because we have 1 vector memory unit and no vector caches.

Current Sync Analysis Tool Executes a program and tells you: 1. Whenever two memory references are not: Ordered by architectural guarantees Ordered by register dependencies Ordered by an intervening sync instruction 2. Whenever a sync instruction is not used to resolve any hazard, as described in (1). Caveats: –Hazards are detected from a single program execution: Information may not hold true for all possible executions of the program. –Hazard detection is conservative in the presence of synchronization chains. Write(A) <- r1 RAW SYNC Read(A) <- r2 WAR SYNC Write(A) <- r3 Write(A) <- r1 RAW SYNC Read(A) <- r2 Write(A) <- r2 Hazard? Two Examples of Synchronization Chains

Optimizing Code Basic problem: –Vector unit requires setup: VL, VPW, mask, exceptions –Vector code responsible for issuing syncs –Both of these are required in a vector routine if nothing is known about the calling context! All solutions share the notion of giving control of the calling context to the compiler. Two options: (1) Pass around flags so that syncs and setup code can be avoided at run-time (2) Do global optimizations so that syncs and setup code can be eliminated at compile- time. Scalar code Vector setup VaS and VaV sync Vector function SaV sync Scalar code Vector setup VaS and VaV sync Vector function SaV sync Scalar code.

Optimization Example Demonstrates potential benefit from optimizing scalar-vector communication Code computes A+B+C+D+E+F in the following manner: + + +++ ADBCEF Unoptimized code calls a general vector add routine 5 times First optimization inlines the 5 routines and removes vector initialization sequences Second optimization also removes unnecessary sync instructions Optimization goal is to avoid “sawtooth” in instantaneous performance graphs caused by draining the vector pipelines between vector loops

Large optimization potential for short vector loops. SaV syncs are most important to eliminate or delay. VaS sync performance impact is unclear. VaV syncs are virtually free in VIRAM-1. Setup code is expensive. For this example, it is as expensive as the SaV syncs.

Memory Consistency in Vector IRAM David Martin. Consistency model applies to instructions in a single instruction stream (different than multi-processor.

Similar presentations

Presentation on theme: "Memory Consistency in Vector IRAM David Martin. Consistency model applies to instructions in a single instruction stream (different than multi-processor."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Memory Consistency in Vector IRAM David Martin. Consistency model applies to instructions in a single instruction stream (different than multi-processor.

Similar presentations

Presentation on theme: "Memory Consistency in Vector IRAM David Martin. Consistency model applies to instructions in a single instruction stream (different than multi-processor."— Presentation transcript:

Similar presentations

About project

Feedback