Memory Consistency in Vector IRAM David Martin. Consistency model applies to instructions in a single instruction stream (different than multi-processor.

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

CSCI 4717/5717 Computer Architecture
Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
COMP25212 Further Pipeline Issues. Cray 1 COMP25212 Designed in 1976 Cost $8,800,000 8MB Main Memory Max performance 160 MFLOPS Weight 5.5 Tons Power.
Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,
Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
PZ13A Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ13A - Processor design Programming Language Design.
By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Chapter 12 Pipelining Strategies Performance Hazards.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Midterm Thursday let the slides be your guide Topics: First Exam - definitely cache,.. Hamming Code External Memory & Buses - Interrupts, DMA & Channels,
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Evaluation of Memory Consistency Models in Titanium.
Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga.
CMPE 421 Parallel Computer Architecture
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
CA406 Computer Architecture Pipelines... continued.
How Computers Work Lecture 12 Page 1 How Computers Work Lecture 12 Introduction to Pipelining.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
The Standford Hydra CMP  Lance Hammond  Benedict A. Hubbert  Michael Siu  Manohar K. Prabhu  Michael Chen  Kunle Olukotun Presented by Jason Davis.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.
CS203 – Advanced Computer Architecture Pipelining Review.
1 Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Chapter Six.
Dynamic Scheduling Why go out of style?
Computer Architecture Chapter (14): Processor Structure and Function
CDA3101 Recitation Section 8
William Stallings Computer Organization and Architecture 8th Edition
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Superscalar Processors & VLIW Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Hardware Multithreading
How to improve (decrease) CPI
Chapter Six.
Advanced Computer Architecture
Chapter Six.
Instruction Level Parallelism (ILP)
Shared Memory Consistency Models: A Tutorial
Instruction Execution Cycle
EE 4xx: Computer Architecture and Performance Programming
CS203 – Advanced Computer Architecture
Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Prof. Sirer CS 316 Cornell University
Pipelining.
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
How to improve (decrease) CPI
Loop-Level Parallelism
Lecture 10: ILP Innovations
COMPUTER ORGANIZATION AND ARCHITECTURE
Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Presentation transcript:

Memory Consistency in Vector IRAM David Martin

Consistency model applies to instructions in a single instruction stream (different than multi-processor consistency!). The Memory Consistency Model SaSSaVVaSVaVVPaVP RaW*++++ WaR*++++ WaW*++++ a = after V = vector R = read VP = virtual processor W = write * = no sync required S = scalar + = sync required Definition of a “XaY” sync: All operations of type Y occurring before the sync in program order appear to execute before any operation of type X occurring after the sync in program order. Definition of a “XaY” sync to vector register $vr i : The most recent operation of type Y to $vr i appears to execute before any operation of type X occurring after the sync in program order.

Why Relax Memory Consistency?  Natural micro- architecture has multiple paths to memory  Want to decouple scalar and vector units without complex hardware Trade-off between more complex hardware (speculation, disambiguation, cache coherence) and more complex software (sync instructions) Should explore solutions to this trade-off that involve more hardware: e.g. Hardware guarantees SaV and VaS ordering, but leaves VaV and VP orderings to software. Fetch Scalar Core Vector Unit Sync Memory

Software Conventions for Syncs Vector code is responsible for not messing things up. –Allows us to vectorize libraries to speed up existing programs. –Don’t want to assume that our compiler will compile and globally optimize all non-vector code that we run. Alternative model: Pass around flags to communicate sync requirements or history –Must assume that our compiler compiles all code run on IRAM. –Not sure we want to accept that restriction. Scalar CodeVector Code VaS,VaV SaV Vector Function Conventions: 1. Execute VaS and VaV syncs on entry to vector code. 2. Execute SaV sync on exit from vector code.

Sync Implementations and Costs SaV : Stall fetch unit until vector unit has committed all vector memory instructions. –Could take 1000s of cycles with many indexed vector memory operations in flight! –Very difficult to delay issue since it is often issued at the end of a vector routine. VaS : Stall fetch unit until scalar unit has committed all scalar memory instructions. –Not too expensive (10s of cycles?) because scalar unit is ahead of the vector unit, because the scalar core is simple, and because the data cache is write-thru. –Easy to delay issue because it is often issued at the start of a vector routine. VaV and VPaVP: No operation. –Nop because we have 1 vector memory unit and no vector caches.

Current Sync Analysis Tool Executes a program and tells you: 1. Whenever two memory references are not: Ordered by architectural guarantees Ordered by register dependencies Ordered by an intervening sync instruction 2. Whenever a sync instruction is not used to resolve any hazard, as described in (1). Caveats: –Hazards are detected from a single program execution: Information may not hold true for all possible executions of the program. –Hazard detection is conservative in the presence of synchronization chains. Write(A) <- r1 RAW SYNC Read(A) <- r2 WAR SYNC Write(A) <- r3 Write(A) <- r1 RAW SYNC Read(A) <- r2 Write(A) <- r2 Hazard? Two Examples of Synchronization Chains

Optimizing Code Basic problem: –Vector unit requires setup: VL, VPW, mask, exceptions –Vector code responsible for issuing syncs –Both of these are required in a vector routine if nothing is known about the calling context! All solutions share the notion of giving control of the calling context to the compiler. Two options: (1) Pass around flags so that syncs and setup code can be avoided at run-time (2) Do global optimizations so that syncs and setup code can be eliminated at compile- time. Scalar code Vector setup VaS and VaV sync Vector function SaV sync Scalar code Vector setup VaS and VaV sync Vector function SaV sync Scalar code.

Optimization Example Demonstrates potential benefit from optimizing scalar-vector communication Code computes A+B+C+D+E+F in the following manner: ADBCEF Unoptimized code calls a general vector add routine 5 times First optimization inlines the 5 routines and removes vector initialization sequences Second optimization also removes unnecessary sync instructions Optimization goal is to avoid “sawtooth” in instantaneous performance graphs caused by draining the vector pipelines between vector loops

Large optimization potential for short vector loops. SaV syncs are most important to eliminate or delay. VaS sync performance impact is unclear. VaV syncs are virtually free in VIRAM-1. Setup code is expensive. For this example, it is as expensive as the SaV syncs.