1 Executive Summary. 2 Overall Architecture of ARC ♦ Architecture of ARC  Multiple cores and accelerators  Global Accelerator Manager (GAM)  Shared.

Slides:

Advertisements

Similar presentations

Computer Architecture

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Part IV: Memory Management

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.

1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.

1/1/ / faculty of Electrical Engineering eindhoven university of technology Processor support devices Part 1:Interrupts and shared memory dr.ir. A.C. Verschueren.

CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.

CMPT 300: Final Review Chapters 8 – Memory Management: Ch. 8, 9 Address spaces Logical (virtual): generated by the CPU Physical: seen by the memory.

Operating System Support Focus on Architecture

Multiprocessing Memory Management

CMPT 300: Final Review Chapters 8 – Memory Management: Ch. 8, 9 Address spaces Logical (virtual): generated by the CPU Physical: seen by the memory.

CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.

Memory Management Chapter 5.

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Input/Output. Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 346, Royden, Operating System Concepts Operating Systems Lecture 24 Paging.

Review of Memory Management, Virtual Memory CS448.

Chapter 8 Memory Management Dr. Yingwu Zhu. Outline Background Basic Concepts Memory Allocation.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Automated Design of Custom Architecture Tulika Mitra

Computer Architecture Lecture10: Input/output devices Piotr Bilski.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Background Program must be brought into memory and placed within a process for it to be run. Input queue – collection of processes on the disk that are.

2009 Sep 10SYSC Dept. Systems and Computer Engineering, Carleton University F09. SYSC2001-Ch7.ppt 1 Chapter 7 Input/Output 7.1 External Devices 7.2.

Fall 2000M.B. Ibáñez Lecture 17 Paging Hardware Support.

Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.

1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.

Chapter 4: Multithreaded Programming. 4.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts What is Thread “Thread is a part of a program.

EFLAG Register of The The only new flag bit is the AC alignment check, used to indicate that the microprocessor has accessed a word at an odd.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.

Processor Memory Processor-memory bus I/O Device Bus Adapter I/O Device I/O Device Bus Adapter I/O Device I/O Device Expansion bus I/O Bus.

Sunpyo Hong, Hyesoon Kim

Module 3 Distributed Multiprocessor Architectures.

Operating Systems: Summary INF1060: Introduction to Operating Systems and Data Communication.

Chapter 7 Memory Management Eighth Edition William Stallings Operating Systems: Internals and Design Principles.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Operating Systems Lecture 9 Introduction to Paging Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of.

CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.

1 load [2], [9] Transfer contents of memory location 9 to memory location 2. Illegal instruction.

W4118 Operating Systems Instructor: Junfeng Yang.

Introduction to Operating Systems Concepts

PARADE: A Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Architectural Design and Exploration Zhenman Fang, Michael Gill Jason Cong,

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Computer Organization

Process Management Process Concept Why only the global variables?

Chapter 8: Main Memory Source & Copyright: Operating System Concepts, Silberschatz, Galvin and Gagne.

Chapter 4: Multithreaded Programming

Assembly Language for Intel-Based Computers, 5th Edition

Hierarchical Architecture

Hyperthreading Technology

CMSC 611: Advanced Computer Architecture

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Background Program must be brought into memory and placed within a process for it to be run. Input queue – collection of processes on the disk that are.

Memory Management-I 1.

Main Memory Background Swapping Contiguous Allocation Paging

Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.

A High Performance SoC: PkunityTM

Multithreaded Programming

CSE 471 Autumn 1998 Virtual memory

COMP755 Advanced Operating Systems

Virtual Memory 1 1.

Presentation transcript:

1 Executive Summary

2 Overall Architecture of ARC ♦ Architecture of ARC  Multiple cores and accelerators  Global Accelerator Manager (GAM)  Shared L2 cache banks and NoC routers between multiple accelerators GAM Accelerator + DMA+SPM Shared Router Core Shared L2 $ Memory controller

3 What are the Problems with ARC? ♦ Dedicated accelerators are inflexible  An LCA may be useless for new algorithms or new domains  Often under-utilized  LCAs contain many replicated structures Things like fp-ALUs, DMA engines, SPM Unused when the accelerator is unused ♦ We want flexibility and better resource utilization  Solution: CHARM ♦ Private SPM is wasteful  Solution: BiN

4 A Composable Heterogeneous Accelerator-Rich Microprocessor (CHARM) [ISLPED’12] ♦ Motivation  Great deal of data parallelism Tasks performed by accelerators tend to have a great deal of data parallelism  Variety of LCAs with possible overlap Utilization of any particular LCA being somewhat sporadic  It is expensive to have both: Sufficient diversity of LCAs to handle the various applications Sufficient quantity of a particular LCA to handle the parallelism  Overlap in functionality LCAs can be built using a limited number of smaller, more general LCAs: Accelerator building blocks (ABBs) ♦ Idea  Flexible accelerator building blocks (ABB) that can be composed into accelerators ♦ Leverage economy of scale

5 Micro Architecture of CHARM ♦ ABB  Accelerator building blocks (ABB)  Primitive components that can be composed into accelerators ♦ ABB islands  Multiple ABBs  Shared DMA controller, SPM and NoC interface ♦ ABC  Accelerator Block Composer (ABC) To orchestrate the data flow between ABBs to create a virtual accelerator Arbitrate requests from cores ♦ Other components  Cores  L2 Banks  Memory controllers

6 CAMEL ♦ What are the problems with CHARM?  What if new algorithm introduce new ABBs?  What if we want to use this architecture on multiple domains? ♦ Adding programmable fabric to increase  Longevity  Domain span

7 CAMEL ♦ What are the problems with CHARM?  What if new algorithm introduce new ABBs?  What if we want to use this architecture on multiple domains? ♦ Adding programmable fabric to increase  Longevity  Domain span ♦ ABC is now responsible to allocate programmable fabric

8

9 Extensive Use of Accelerators ♦ Accelerators provide high power-efficiency over general-purpose processors  IBM wire-speed processor  Intel Larrabee ♦ ITRS 2007 System drivers prediction: Accelerator number close to 1500 by 2022 ♦ Two kinds of accelerators  Tightly coupled – part of datapath  Loosely coupled – shared via NoC ♦ Challenges  Accelerator extraction and synthesis  Efficient accelerator management Scheduling Scheduling Sharing Sharing Virtualization … Virtualization …  Friendly programming models

10 Architecture Support for Accelerator-Rich CMPs (ARC) [DAC’2012] ♦ Managing accelerators through the OS is expensive ♦ In an accelerator rich CMP, management should be cheaper both in terms of time and energy  Invoke “Open”s the driver and returns the handler to driver. Called once.  RD/WR is called multiple times. OperationLatency (# Cycles) 1 core2 cores4 cores8 cores16 cores Invoke RD/WR CPUCPU App OS Accelerator Manager AcceleratorAccelerator  Motivation

11 Overall Architecture of ARC ♦ Architecture of ARC  Multiple cores and accelerators  Global Accelerator Manager (GAM)  Shared L2 cache banks and NoC routers between multiple accelerators GAM Accelerator + DMA+SPM Shared Router Core Shared L2 $ Memory controller

12 Overall Communication Scheme in ARC 1.The core requests for a given type of accelerator (lcacc-req). New ISA lcacc-req t lcacc-rsrv t, e lcacc-cmd id, f, addr lcacc-free id CPU Memory LCA GAM 1

13 Overall Communication Scheme in ARC 2.The GAM responds with a “list + waiting time” or NACK New ISA lcacc-req t lcacc-rsrv t, e lcacc-cmd id, f, addr lcacc-free id CPU Memory LCA GAM 2

14 Overall Communication Scheme in ARC 3.The core reserves (lcacc-rsv) and waits. New ISA lcacc-req t lcacc-rsrv t, e lcacc-cmd id, f, addr lcacc-free id CPU Memory LCA GAM 3

15 Overall Communication Scheme in ARC 4.The GAM ACK the reservation and send the core ID to accelerator New ISA lcacc-req t lcacc-rsrv t, e lcacc-cmd id, f, addr lcacc-free id CPU Memory LCA GAM 4 4

16 Overall Communication Scheme in ARC 5.The core shares a task description with the accelerator through memory and starts it (lcacc-cmd). Task description consists of: Task description consists of: o Function ID and input parameters o Input/output addresses and strides New ISA lcacc-req t lcacc-rsrv t, e lcacc-cmd id, f, addr lcacc-free id CPU Memory Task description Accelerator GAM 5 5

17 Overall Communication Scheme in ARC 6.The accelerator reads the task description, and begins working Overlapped Read/Write from/to Memory and Compute Overlapped Read/Write from/to Memory and Compute Interrupting core when TLB miss Interrupting core when TLB miss New ISA lcacc-req t lcacc-rsrv t, e lcacc-cmd id, f, addr lcacc-free id CPU Memory Task description LCA GAM 6 6

18 Overall Communication Scheme in ARC 7.When the accelerator finishes its current task it notifies the core. New ISA lcacc-req t lcacc-rsrv t, e lcacc-cmd id, f, addr lcacc-free id CPU Memory Task description LCA GAM 7

19 Overall Communication Scheme in ARC 8.The core then sends a message to the GAM freeing the accelerator (lcacc-free). New ISA lcacc-req t lcacc-rsrv t, e lcacc-cmd id, f, addr lcacc-free id CPU Memory Task description LCA GAM 8

20 Accelerator Chaining and Composition ♦ Chaining  Efficient accelerator to accelerator communication ♦ Composition  Constructing virtual accelerators Accelerator1 Scratchpad DMA controller Accelerator2 Scratchpad DMA controller M-point 1D FFT 3D FFT N-point 2D FFT virtualization M-point 1D FFT

21 Accelerator Virtualization ♦ Application programmer or compilation framework selects high- level functionality ♦ Implementation via  Monolithic accelerator  Distributed accelerators composed to a virtual accelerator  Software decomposition libraries ♦ Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT

22 Accelerator Virtualization ♦ Application programmer or compilation framework selects high- level functionality ♦ Implementation via  Monolithic accelerator  Distributed accelerators composed to a virtual accelerator  Software decomposition libraries ♦ Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT Step 1: 1D FFT on Row 1 and Row 2

23 Accelerator Virtualization ♦ Application programmer or compilation framework selects high- level functionality ♦ Implementation via  Monolithic accelerator  Distributed accelerators composed to a virtual accelerator  Software decomposition libraries ♦ Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT Step 2: 1D FFT on Row 3 and Row 4

24 Accelerator Virtualization ♦ Application programmer or compilation framework selects high- level functionality ♦ Implementation via  Monolithic accelerator  Distributed accelerators composed to a virtual accelerator  Software decomposition libraries ♦ Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT Step 3: 1D FFT on Col 1 and Col 2

25 Accelerator Virtualization ♦ Application programmer or compilation framework selects high- level functionality ♦ Implementation via  Monolithic accelerator  Distributed accelerators composed to a virtual accelerator  Software decomposition libraries ♦ Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT Step 4: 1D FFT on Col 3 and Col 4

26 Light-Weight Interrupt Support CPU LCA GAM

27 Light-Weight Interrupt Support CPU LCA GAM Request/Reserve Confirmation and NACK Sent by GAM

28 Light-Weight Interrupt Support CPU LCA GAM TLB Miss Task Done

29 Light-Weight Interrupt Support CPU LCA GAM TLB Miss Task Done Core Sends Logical Addresses to LCA LCA keeps a small TLB for the addresses that it is working on

30 Light-Weight Interrupt Support CPU LCA GAM TLB Miss Task Done Core Sends Logical Addresses to LCA LCA keeps a small TLB for the addresses that it is working on Why Logical Address? 1- Accelerators can work on irregular addresses (e.g. indirect addressing) 2- Using large page size can be a solution but will effect other applications

31 Light-Weight Interrupt Support CPU LCA GAM Operation Latency to switch to ISR and back (# Cycles) 1 core2 cores4 cores8 cores16 cores Interrupt16 K20 K24 K27 K29 K It’s expensive to handle the interrupts via OS

32 Light-Weight Interrupt Support CPU LCA GAM Extending the core with a light-weight interrupt support LWILWI LWILWI

33 Light-Weight Interrupt Support CPU LCA GAM Extending the core with a light-weight interrupt support LWILWI LWILWI u Two main components added:  A table to store ISR info  An interrupt controller to queue and prioritize incoming interrupt packets u Each thread registers:  Address of the ISR and its arguments and lw-int source u Limitations:  Only can be used when running the same thread which LW interrupt belongs to  OS-handled interrupt otherwise

34 Programming interface to ARC Platform creation Application Mapping & Development

35 Evaluation methodology ♦ Benchmarks  Medical imaging  Vision & Navigation

36 compressive sensing level set methods fluid registration total variational algorithm Application Domain: Medical Image Processing  denoising  registration  segmentation  analysis  reconstruction Navier-Stokes equations

37 Area Overhead ♦ AutoESL (from Xilinx) for C to RTL synthesis ♦ Synopsys for ASIC synthesis  32 nm Synopsys Educational library ♦ CACTI for L2 ♦ Orion for NoC ♦ One UltraSparc IIIi core (area scaled to 32 nm)  mm^2 in 0.13 um ()  mm^2 in 0.13 um ( CoreNoCL2DeblurDenoiseSegmentationRegistrationSPM Banks Number of instance/Size118MB x 2KB Area(mm^2) Percentage (%) Total ARC: 14.3 %

38 Experimental Results – Performance (N cores, N threads, N accelerators) Performance improvement over OS based approaches: on average 51x, up to 292x Performance improvement over OS based approaches: on average 51x, up to 292x Performance improvement over SW only approaches: on average 168x, up to 380x Performance improvement over SW only approaches: on average 168x, up to 380x

39 Experimental Results – Energy (N cores, N threads, N accelerators) Energy improvement over OS-based approaches: on average 17x, up to 63x Energy improvement over OS-based approaches: on average 17x, up to 63x Energy improvement over SW-only approaches: on average 241x, up to 641x Energy improvement over SW-only approaches: on average 241x, up to 641x

40 What are the Problems with ARC? ♦ Dedicated accelerators are inflexible  An LCA may be useless for new algorithms or new domains  Often under-utilized  LCAs contain many replicated structures Things like fp-ALUs, DMA engines, SPM Unused when the accelerator is unused ♦ We want flexibility and better resource utilization  Solution: CHARM ♦ Private SPM is wasteful  Solution: BiN

41 A Composable Heterogeneous Accelerator-Rich Microprocessor (CHARM) [ISLPED’12] ♦ Motivation  Great deal of data parallelism Tasks performed by accelerators tend to have a great deal of data parallelism  Variety of LCAs with possible overlap Utilization of any particular LCA being somewhat sporadic  It is expensive to have both: Sufficient diversity of LCAs to handle the various applications Sufficient quantity of a particular LCA to handle the parallelism  Overlap in functionality LCAs can be built using a limited number of smaller, more general LCAs: Accelerator building blocks (ABBs) ♦ Idea  Flexible accelerator building blocks (ABB) that can be composed into accelerators ♦ Leverage economy of scale

42 Micro Architecture of CHARM ♦ ABB  Accelerator building blocks (ABB)  Primitive components that can be composed into accelerators ♦ ABB islands  Multiple ABBs  Shared DMA controller, SPM and NoC interface ♦ ABC  Accelerator Block Composer (ABC) To orchestrate the data flow between ABBs to create a virtual accelerator Arbitrate requests from cores ♦ Other components  Cores  L2 Banks  Memory controllers

43 An Example of ABB Library (for Medical Imaging) Internal of Poly

44 Example of ABB Flow-Graph (Denoise) 2

* * - - * * - - * * - - * * - - * * - - * * sqrt 1/x 2

46 Example of ABB Flow-Graph (Denoise) - - * * - - * * - - * * - - * * - - * * - - * * sqrt 1/x 2 ABB1: Poly ABB2: Poly ABB3: Sqrt ABB4: Inv

47 Example of ABB Flow-Graph (Denoise) - - * * - - * * - - * * - - * * - - * * - - * * sqrt 1/x 2 ABB1:Poly ABB2: Poly ABB3: Sqrt ABB4: Inv

48 LCA Composition Process ABBISLAND1ABBISLAND1ABBISLAND2ABBISLAND2 ABB ISLAND3 ISLAND3ABB ABBISLAND4ABBISLAND4 x y x w z w y z

49 ABBISLAND1ABBISLAND1ABBISLAND2ABBISLAND2 ABB ABB ABBISLAND4ABBISLAND4 LCA Composition Process 1. Core initiation  Core sends the task description: task flow- graph of the desired LCA to ABC together with polyhedral space for input and output x y x w z w y z x x y y z z 10x10 input and output Task description

50 ABBISLAND1ABBISLAND1ABBISLAND2ABBISLAND2 ABB ISLAND3 ISLAND3ABB ABBISLAND4ABBISLAND4 LCA Composition Process 2. Task-flow parsing and task-list creation  ABC parses the task-flow graph and breaks the request into a set of tasks with smaller data size and fills the task list x y x w z w y z  Needed ABBs: “x”, “y”, “z”  With task size of 5x5 block, ABC generates 4 tasks ABC generates internally

51 ABBISLAND1ABBISLAND1ABBISLAND2ABBISLAND2 ABB ISLAND3 ISLAND3ABB ABBISLAND4ABBISLAND4 LCA Composition Process 3. Dynamic ABB mapping  ABC uses a pattern matching algorithm to assign ABBs to islands  Fills the composed LCA table and resource allocation table x y x w z w y z Island ID ABB Type ABB IDStatus 1x1Free 1y1 2x1 2w1 3z1 3w1 4y1 4z1

52 ABBISLAND1ABBISLAND1ABBISLAND2ABBISLAND2 ABB ISLAND3 ISLAND3ABB ABBISLAND4ABBISLAND4 LCA Composition Process 3. Dynamic ABB mapping  ABC uses a pattern matching algorithm to assign ABBs to islands  Fills the composed LCA table and resource allocation table x y x w z w y z Island ID ABB Type ABB IDStatus 1x1Busy 1y1 2x1Free 2w1 3z1Busy 3w1Free 4y1 4z1

53 ABBISLAND1ABBISLAND1ABBISLAND2ABBISLAND2 ABB ISLAND3 ISLAND3ABB ABBISLAND4ABBISLAND4 LCA Composition Process 4. LCA cloning  Repeat to generate more LCAs if ABBs are available x y x w z w y z Core ID ABB Type ABB IDStatus 1x1Busy 1y1 2x1 2w1Free 3z1Busy 3w1Free 4y1Busy 4z1

54 ABBISLAND1ABBISLAND1ABBISLAND2ABBISLAND2 ABB ISLAND3 ISLAND3ABB ABBISLAND4ABBISLAND4 LCA Composition Process 5. ABBs finishing task  When ABBs finish, they signal the ABC. If ABC has another task it sends otherwise it frees the ABBs x y x w z w y z Island ID ABB Type ABB IDStatus 1x1Busy 1y1 2x1 2w1Free 3z1Busy 3w1Free 4y1Busy 4z1 DONE

55 ABBISLAND1ABBISLAND1ABBISLAND2ABBISLAND2 ABB ISLAND3 ISLAND3ABB ABBISLAND4ABBISLAND4 LCA Composition Process 5. ABBs being freed  When an ABB finishes, it signals the ABC. If ABC has another task it sends otherwise it frees the ABBs x y x w z w y z Island ID ABB Type ABB IDStatus 1x1Busy 1y1 2x1Free 2w1 3z1Busy 3w1Free 4y1 4z1

56 ABBISLAND1ABBISLAND1ABBISLAND2ABBISLAND2 ABB ISLAND3 ISLAND3ABB ABBISLAND4ABBISLAND4 LCA Composition Process 6. Core notified of end of task  When the LCA finishes ABC signals the core x y x w z w y z Island ID ABB Type ABB IDStatus 1x1Free 1y1 2x1 2w1 3z1 3w1 4y1 4z1 DONE

57 ABC Internal Design ♦ ABC sub-components  Resource Table(RT) : To keep track of available/used ABBs  Composed LCA Table (CLT): Eliminates the need to re-compose LCAs  Task List (TL): To queue the broken LCA requests (to smaller data size)  TLB : To service and share the translation requests by ABBs  Task Flow-Graph Interpreter (TFGI): Breaks the LCA DFG into ABBs  LCA Composer (LC): Compose the LCA using available ABBs ♦ Implementation  RT, CLT, TL and TLB are implemented using RAM  TFGI has a table to keep ABB types and an FSM to read task-flow-graph and compares  LC has an FSM to go over CLT and RT and check mark the available ABBs Resource Table Composed LCA Table TLB Task List DFG Interpreter LCA Composer From ABBs (Done signal) Cores Accelerator Block Composer To ABBs(allocate) ABBs (TLB service)

58 CHARM Software Infrastructure ♦ ABB type extraction  Input: compute-intensive kernels from different application  Output: ABB Super-patterns  Currently semi-automatic ♦ ABB template mapping  Input: Kernels + ABB types  Output: Covered kernels as an ABB flow-graph ♦ CHARM uProgram generation  Input: ABB flow-graph  Output:

59 Evaluation Methodology ♦ Simics+GEMS based simulation ♦ AutoPilot/Xilinx+ Synopsys for ABB/ABC/DMA-C synthesis ♦ Cacti for memory synthesis (SPM) ♦ Automatic flow to generate the CHARM software and simulation modules ♦ Case studies  Physical LCA sharing with Global Accelerator Manager (LCA+GAM)  Physical LCA sharing with ABC (LCA+ABC)  ABB composition and sharing with ABC (ABB+ABC) ♦ Medical imaging benchmarks  Denoise, Deblur, Segmentation and Registration

60 Area Overhead Analysis ♦ Area-equivalent  The total area consumed by the ABBs equals the total area of all LCAs required to run a single instance of each benchmark ♦ Total CHARM area is 14% of the 1cmx1cm chip  A bit less than LCA-based design

61 Results: Improvement Over LCA-based Design ♦ N’x’ has N times area - equivalent accelerators ♦ Performance  2.5X vs. LCA+GAM (max 5X)  1.4X vs. LCA+ABC (max 2.6X) ♦ Energy  1.9X vs. LCA+GAM (max 3.4X)  1.3X vs. LCA+ABC (max 2.2X) ♦ ABB+ABC better energy and performance  ABC starts composing ABBs to create new LCAs  Creates more parallelism

62 Results: Platform Flexibility ♦ Two applications from two unrelated domains to MI  Computer vision Log-Polar Coordinate Image Patches (LPCIP)  Navigation Extended Kalman Filter-based Simultaneous Localization and Mapping (EKF-SLAM) ♦ Only one ABB is added  Indexed Vector Load MAX Benefit over LCA+GAM3.64X AVG Benefit over LCA+GAM2.46X MAX Benefit over LCA+ABC3.04X AVG Benefit over LCA+ABC2.05X

63 CAMEL ♦ What are the problems with CHARM?  What if new algorithm introduce new ABBs?  What if we want to use this architecture on multiple domains? ♦ Adding programmable fabric to increase  Longevity  Domain span

64 CAMEL ♦ What are the problems with CHARM?  What if new algorithm introduce new ABBs?  What if we want to use this architecture on multiple domains? ♦ Adding programmable fabric to increase  Longevity  Domain span ♦ ABC is now responsible to allocate programmable fabric

65 CAMEL Microarchitecture

66 CAMEL Programmable Fabric Block

67 Programmable Fabric Allocation by ABC ♦ ABB replication in order to rate-match ♦ ABB assignment on those paths that have larger slack in order to decrease negative effect ♦ Interval-based allocation to increase the performance between multiple requests ♦ When on Programmable Fabric?  ABB does not exist in system  ABB is being used by other accelerators ● Beneficial if can be rate-matched ● Potential slow-down

68 FPGA-allocation by ABC – Block Diagram ♦ Greedy approach  Single application service ♦ Interval-based approach  Wait for some fixed interval  Use a greedy-approach to find the best to satisfy multiple request LCA Task Flow Graph Find Needed ABBs on PF Available ABBs Min Area Feasible? Return FALSE Fit PF? Return True Set Maximum Rate Yes No Program Fabric Decrease Rate No Lowest rate? Return FALSE

69 CAMEL Results – All Domains

70 CAMEL Results - Speedup

71 CAMEL Results – Energy improvement