DeSC: Decoupled Supply-Compute Communication Management for Heterogeneous Architectures 1 Tae Jun Ham (Princeton Univ.) Juan Luis Aragón (Univ. of Murcia)

Slides:

Advertisements

Similar presentations

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Advertisements

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Instruction-Level Parallelism (ILP)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

Chapter 12 Pipelining Strategies Performance Hazards.

Computer System Overview

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,

Computer System Overview

Multiscalar processors

Chapter 12 CPU Structure and Function. Example Register Organizations.

1 Today I/O Systems Storage. 2 I/O Devices Many different kinds of I/O devices Software that controls them: device drivers.

Pipelining By Toan Nguyen.

Computer Systems Overview. Page 2 W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware resources of one.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.

Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Ravikumar Source:

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

CS203 – Advanced Computer Architecture ILP and Speculation.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Computer Systems Overview. Lecture 1/Page 2AE4B33OSS W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Multiscalar Processors

/ Computer Architecture and Design

CSC 4250 Computer Architectures

Cache Memory Presentation I

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Short Circuiting Memory Traffic in Handheld Platforms

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Morgan Kaufmann Publishers The Processor

Lecture 6: Advanced Pipelines

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Milad Hashemi, Onur Mutlu, Yale N. Patt

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Ka-Ming Keung Swamy D Ponpandi

Lecture: Out-of-order Processors

Adapted from the slides of Prof

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Instruction Execution Cycle

Adapted from the slides of Prof

Lecture 5: Pipeline Wrap-up, Static ILP

Lecture 11: Machine-Dependent Optimization

Conceptual execution on a processor which exploits ILP

Ka-Ming Keung Swamy D Ponpandi

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

DeSC: Decoupled Supply-Compute Communication Management for Heterogeneous Architectures 1 Tae Jun Ham (Princeton Univ.) Juan Luis Aragón (Univ. of Murcia) Margaret Martonosi (Princeton Univ.)

Limited scratchpad memory size Accelerator Communication Challenge 2 Best if data arrives at accelerator before computation needs it Data should be carefully divided/blocked Programmers manage accelerator communication Little memory latency tolerance Difficult and error-prone Limited portability across varying local memory sizes Often results in suboptimal performance Problems Accelerators require careful communication management Current Solution

DeSC Solution 3 DEcoupled Supply-Compute communication management (DeSC) automatically manages and optimizes accelerator communication Portability: work with any local memory size Performance: minimize the effect of memory latency by communicating data to local memory as early as possible Specialization: different HW could be used for comm. and comp. Our Solution Limited scratchpad memory size Little memory latency tolerance Problems Accelerators require careful communication management Best if data arrives at accelerator before computation needs it Data should be carefully divided/blocked

Decoupling Communication and Computation 4  Decoupling communication and computation is a key to DeSC’s communication management Inspired by James Smith’s seminal work “Decoupled Access/Execute Architecture” Computation Slice for (i=0;i<N;i++) { v1 = CONSUME(); v2 = CONSUME(); val = v1 + v2 * k; STORE_VAL(val) } Original Code for (i=0;i<N;i++) { v1 = LOAD(&a[i]); v2 = LOAD(&b[i]); val = v1 + v2*k; STORE(&c[i], val) } Data Supply Slice for (i=0;i<N;i++) { v1 = LOAD(&a[i]); PRODUCE(v1); v2 = LOAD(&b[i]); PRODUCE(v2); STORE_ADDR(&c[i]); }  Data Supply Slice: instructions to calculate the address for LOAD / STORE instructions & PRODUCE instructions  Computation Slice: instructions to compute the value for STORE instructions & CONSUME instructions

DeSC: Decoupled Supply Compute Communication Management 5  DeSC is a HW/SW framework which automatically manages and optimizes communication through decoupling 1.LLVM-based DeSC compiler decouples software into data supply slice and computation slice 2.Each slice is mapped to a different specialized hardware (SuppD and CompD) Supplier Side Supplier Device (SuppD) Computation Device (CompD) LLVM DeSC Compiler Pass Data Supply Slice Computation Slice Mem Interface Compile-Time Run-Time Communication Queue Computation Side PRODUCE CONSUME

1.DeSC SuppD knows exact next data that will be used by CompD 2.DeSC SuppD can pass data to CompD before it actually needs it 3.DeSC allows use of specialized device for SuppD and CompD Key Benefits of DeSC 6 Portability : DeSC can work with any given local storage (Comm. Queue) size Performance: DeSC can minimize the memory latency exposed to the comp. Specialization: DeSC utilizes an extended out-of-order core as SuppD and an accelerator or an out of order core as CompD

Key optimizations of DeSC Presentation Outline 7 -Terminal load optimization DeSC Evaluation Results Conclusions -Loss of decoupling optimization

Challenges in using OoO core as a SuppD 8  Challenge : Long latency load blocks the head of ROB OoO SuppD CycleIssue ROB1ROB2ROB3ROB4 0LD A1LD B1 LD A1 LD B Issue width = 2 ROB Size = 4 LD A latency = 6 LD B latency = 2 Example Simplified SuppD Slice for (i=1;i<N;i++) { v1 = LOAD(&a[i]); v2 = LOAD(&b[i]); } On cycle 0, the core will issue both LD A1 and LD B1.

Challenges in using OoO core as a SuppD 9  Challenge : Long latency load blocks the head of ROB OoO SuppD CycleIssue ROB1ROB2ROB3ROB4 0LD A1LD B1 LD A1 LD B1 1LD A2LD B2 LD A2 LD B Issue width = 2 ROB Size = 4 LD A latency = 6 LD B latency = 2 Example Simplified SuppD Slice for (i=1;i<N;i++) { v1 = LOAD(&a[i]); v2 = LOAD(&b[i]); } On cycle 1, the core will issue LD A2 and LD B2.

Challenges in using OoO core as a SuppD 10  Challenge : Long latency load blocks the head of ROB OoO SuppD CycleIssue ROB1ROB2ROB3ROB4 0LD A1LD B1 LD A1 LD B1 1LD A2LD B2 LD A2 LD B2 2 Wait for Commit Issue width = 2 ROB Size = 4 LD A latency = 6 LD B latency = 2 Example Simplified SuppD Slice for (i=1;i<N;i++) { v1 = LOAD(&a[i]); v2 = LOAD(&b[i]); } LD B1 is finished but cannot commit because LD A1 didn’t commit

Challenges in using OoO core as a SuppD 11  Challenge : Long latency load blocks the head of ROB OoO SuppD CycleIssue ROB1ROB2ROB3ROB4 0LD A1LD B1 LD A1 LD B1 1LD A2LD B2 LD A2 LD B2 2XX Wait for Commit 3XX 4XX 5XX Issue width = 2 ROB Size = 4 LD A latency = 6 LD B latency = 2 Example Simplified SuppD Slice for (i=1;i<N;i++) { v1 = LOAD(&a[i]); v2 = LOAD(&b[i]); } No instruction will be issued until the end of Cycle 5.

Challenges in using OoO core as a SuppD 12  Challenge : Long latency load blocks the head of ROB OoO SuppD CycleIssue ROB1ROB2ROB3ROB4 0LD A1LD B1 LD A1 LD B1 1LD A2LD B2 LD A2 LD B2 2XX Wait for Commit 3XX 4XX 5XX Issue width = 2 ROB Size = 4 LD A latency = 6 LD B latency = 2 A1, B1 A2, B2 Example Simplified SuppD Slice for (i=1;i<N;i++) { v1 = LOAD(&a[i]); v2 = LOAD(&b[i]); } Data will be communicated to the Comm. Queue when they commit

Challenges in using OoO core as a SuppD 13  Challenge : Long latency load blocks the head of ROB OoO SuppD CycleIssue ROB1ROB2ROB3ROB4 0LD A1LD B1 LD A1 LD B1 1LD A2LD B2 LD A2 LD B2 2XX Wait for Commit 3XX 4XX 5XX 6LD A3LD B3 LD A3 LD B3 7LD A4LD B4 LD A4 LD B4 8XX Wait for Commit 9XX 10XX 11XX 12LD A5LD B5 LD A5 LD B5 13LD A6LD B6 14XX Issue width = 2 ROB Size = 4 LD A latency = 6 LD B latency = 2 A1, B1 A2, B2 A3, B3 A4, B4 Example Simplified SuppD Slice for (i=1;i<N;i++) { v1 = LOAD(&a[i]); v2 = LOAD(&b[i]); }

Challenges in using OoO core as a SuppD 14  DeSC Question : Why should LD B should wait until former long latency LD A commits? SuppD ROB1ROB2ROB3ROB4 LD A1 LD B1 LD A2 LD B2 Wait for Commit LD A3 LD B3 LD A4 LD B4 Wait for Commit A1, B1 A2, B2 A3, B3 A4, B4 SuppD Issue ROB1ROB2ROB3ROB4 LD A1LD B1 LD A1 LD B1 LD A2LD B2 LD A2 LD B2 LD A3X LD B3X XX LD A4X LD B4X LD A5X LD B5LD A6 LD B5 LD A6 XX LD B6X LD A7X LD B7X LD A8X B1 B2 B3 A1 A2 A3, B4 B5 B6 A5 A6, B7 Allow later instructions to commit before specific earlier instructions Problem: All instructions should commit in-order

Terminal Load Optimization in DeSC 15  Allow later instructions to commit before specific earlier instructions  “Specific Instructions” = Terminal Loads which reached the head of ROB Terminal loads: loads where fetched value is only used for PRODUCE  Compiler identifies & marks terminal loads ( LOAD_PRODUCE instruction)  Very common in decoupled archs, but non-existent in ordinary archs SuppD Slice Code before marking terminal loads Code after marking terminal loads for (i=0;i<N;i++) { idx = LOAD(&a[i]) tmp = LOAD(v&[idx]) PRODUCE(tmp) } for(i=0;i<N;i++) { idx = LOAD(&a[i]) LOAD_PRODUCE(&v[idx]) }

Terminal Load Optimization in DeSC 16  When a terminal load reaches the head of a ROB, it is moved to the terminal load buffer if data is not ready  From terminal load buffer, it is moved to communication queue when data is ready  Property #1 : Any entry in terminal load buffer is non-speculative Terminal loads are only moved to buffer from the head of the ROB  Property #2 : No entry in terminal load buffer has dependents No need to update any other ROB entry with its load result Reorder Buffer (ROB) :::: :::: :::: Terminal Load Buffer (CAM) ID… :::: If Data is not Ready When Data is ready Communication Queue ID DATA Long Latency Terminal Load A Short Latency Terminal Load B Terminal Load A If Data is ready *B *A Data not ready Data ready

Terminal Load Optimization in DeSC 17  Terminal Load Optimization allows out-of-order insertion of data into communication queue  To support out-of-order data consumption, DeSC adds CAM Structure communication buffer IDData :::: :::: Comm. Buffer (32-64 entries CAM) Comm. Queue (2-4KB FIFO) ID DATA PRODUCE CONSUME  Program order based ID is assigned for each PRODUCE & CONSUME so that CONSUME can find its matching counterpart CONSUME

Using general purpose OoO core as a SuppD 18  Simple microarch support SuppD ROB1ROB2ROB3ROB4 LD A1 LD B1 LD A2 LD B2 Wait for LD A LD A3 LD B3 LD A4 LD B4 Wait for LD A A1, B1 A2, B2 A3, B3 A4, B4 SuppD Issue ROB1ROB2ROB3ROB4 LD A1LD B1 LD A1 LD B1 LD A2LD B2 LD A2 LD B2 LD A3X LD B3X XX LD A4X LD B4X LD A5X LD B5LD A6 LD B5 LD A6 XX LD B6X LD A7X LD B7X LD A8X B1 B2 B3 A1 A2 A3, B4 B5 B6 A5 A6, B7 Better data supply throughput DeSC terminal load optimization allows instruction to commit earlier than long latency terminal loads in specific cases

Supplier Side Supplier Device (SuppD) Computation Device (CompD) Mem Interface Communication Queue Computation Side PRODUCE CONSUME DeSC Loss of Decoupling Optimizations 19  Loss of Decoupling (LOD) : SuppD cannot runahead because its data/control is dependent on CompD Data AliasingStall Reason for(i=1;i<10;i++) { a[i] = a[i]*x v = a[5]*y } a[5]’ =a[5]*x a [5] Stall to wait for a[5] ‘= a[5]*x a [5]’ On 5 th iteration, a[5] is updated in CompD so SuppD should stall until updated a[5] (= a[5] * x) to be passed from CompD v = a’[5] * y a[5]’ a[5] *Comm. Buffer is not shown in the diagram for simplicity

Supplier Side Supplier Device (SuppD) Computation Device (CompD) Mem Interface Communication Queue Computation Side PRODUCE DeSC Loss of Decoupling Optimizations 20  Problem: SuppD stalls only to return the just received data back to CompD Data AliasingDeSC Solution – Store to Load Forwarding for(i=1;i<10;i++) { a[i] = a[i]*x v = a[5]*y } a[5]’ =a[5]*x a [5] When SuppD needs to supply a value that will be computed in CompD, SuppD sends a pointer packet (pointing to the CompD’s temp buffer) instead of stalling. When consuming a pointer packet, CompD will reuse data from its own temporary buffer SuppD inserts a pointer packet (pointing CompD’s temp buffer) to the communication queue Ptr a[5]’ v = a’[5] * y CONSUME a[5] Pt r *Comm. Buffer is not shown in the diagram for simplicity Allow CompD to hold recently computed values for a while and reuse them Temporary buffer

DeSC Performance Improvements 21  DeSC (OoO SuppD + OoO CompD) offers 2.04x average speedup over single core Overall speedup on par with perfect L1 cache Higher speedup on memory-bound workloads Terminal Load & LOD Optimizations are key to the speedup *Mem-bound workloads use scaled axis on the right

DeSC Performance Improvements 22  DeSC (OoO SuppD + Accelerator CompD) offers 1.56x speedup over accelerators with their own memory hierarchy DeSC provides better latency hiding ability than accelerators with its own memory hierarchy Please check the paper for more evaluation results

Conclusions 23  DeSC is a HW/SW framework which automatically manages and optimizes communication through decoupling Portability : Works with any given local storage size Performance : Minimizes latency exposed to computation Specialization : Communication and computation can use different devices  DeSC provides various optimizations Terminal load optimization  Utilizes general purpose OoO core as the high-throughput data supplier Loss of decoupling optimization  Allows supplier device to runahead without stalling for computation device DeSC achieves 1.5x-2.0x speedup for different cases Please check the paper for more details & explanations

DeSC: Decoupled Supply-Compute Communication Management for Heterogeneous Architectures 24 Tae Jun Ham (Princeton Univ.) Juan Luis Aragón (Univ. of Murcia) Margaret Martonosi (Princeton Univ.) Paper URL :

More in Paper 25  Detailed DeSC hardware implementation  DeSC Compiler Pass implementation  More loss of decoupling optimizations More details on data aliasing LOD cases Computation dependent control optimization Computed address optimization  Evaluation methodologies  More evaluation results Only 2 out of 7 graphs are presented here  Detailed comparisons to the related works

Backup : DeSC Hardware Diagram 26 STORE ADDR STORE Cache, Memory Interface Comm. Queue (FIFO Queue) IDDataFwd :::: :::: :::: Value Computation LOAD STORE VALUE CONSUME Comm. Buffer (CAM) IDDataFwd :::: :::: :::: Register File PRODUCE LOAD_ PRODUCE Terminal Load Buffer Store Address Buffer (FIFO CAM) AddrAwtCnt :::: :::: :::: LOAD_ PRODUCE Addr Store Value Buffer (FIFO Array) DataCnt :::: :::: CONSUME Supplier Device (SuppD) Computation Device (CompD) Address Computation Value