Jason Cong, Yu-Ting Chen, Zhenman Fang, Bingjun Xiao, Peipei Zhou

Slides:



Advertisements
Similar presentations
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Advertisements

1 Hardware Support for Isolation Krste Asanovic U.C. Berkeley MURI “DHOSA” Site Visit April 28, 2011.
Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou.
Feng-Xiang Huang A Low-Cost SOC Debug Platform Based on On-Chip Test Architectures.
1 Network Packet Generator Characterization presentation Supervisor: Mony Orbach Presenting: Eugeney Ryzhyk, Igor Brevdo.
ECE 699: Lecture 2 ZYNQ Design Flow.
ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Programmable Logic- How do they do that? 1/16/2015 Warren Miller Class 5: Software Tools and More 1.
SOC Consortium Course Material ASIC Logic National Taiwan University Adopted from National Chiao-Tung University IP Core Design.
Department of Electrical Engineering Electronics Computers Communications Technion Israel Institute of Technology High Speed Digital Systems Lab. High.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
© 2004 Mercury Computer Systems, Inc. FPGAs & Software Components Graham Bardouleau & Jim Kulp Mercury Computer Systems, Inc. High Performance Embedded.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.
Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.
PROJECT - ZYNQ Yakir Peretz Idan Homri Semester - winter 2014 Duration - one semester.
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
A new perspective on processing-in-memory architecture design These data are submitted with limited rights under Government Contract No. DE-AC52-8MA27344.
Hardware Accelerator for Hot-word Recognition Gautam Das Govardan Jonathan Mathews Wasim Shaikh Mojes Koli.
Full and Para Virtualization
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Final Presentation Hardware DLL Real Time Partial Reconfiguration Management of FPGA by OS Submitters:Alon ReznikAnton Vainer Supervisors:Ina RivkinOz.
Los Alamos National Laboratory Streams-C Maya Gokhale Los Alamos National Laboratory September, 1999.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
SUBJECT : DIGITAL ELECTRONICS CLASS : SEM 3(B) TOPIC : INTRODUCTION OF VHDL.
Introduction to Operating Systems Concepts
Introduction to the FPGA and Labs
PARADE: A Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Architectural Design and Exploration Zhenman Fang, Michael Gill Jason Cong,
Fast iteration and prototyping in high-performance computing medical applications: a case study with Mentor Vista 8th INFIERI WORKSHOP 10/21/2016.
Lab 4 HW/SW Compression and Decompression of Captured Image
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
NFV Compute Acceleration APIs and Evaluation
Please do not distribute
System-on-Chip Design
Please do not distribute
Lynn Choi School of Electrical Engineering
Please do not distribute
Current Generation Hypervisor Type 1 Type 2.
Please do not distribute
Microarchitecture.
Ph.D. in Computer Science
James Coole PhD student, University of Florida Aaron Landy Greg Stitt
Self Healing and Dynamic Construction Framework:
Introduction to Programmable Logic
Please do not distribute
Application-Specific Customization of Soft Processor Microarchitecture
Texas Instruments TDA2x and Vision SDK
ENG3050 Embedded Reconfigurable Computing Systems
FPGA: Real needs and limits
CS 286 Computer Organization and Architecture
FPGAs in AWS and First Use Cases, Kees Vissers
CMPE419 Mobile Application Development
Improving java performance using Dynamic Method Migration on FPGAs
A Fully Pipelined and Dynamically Composable Architecture of CGRA
Xuechao Wei, Peng Zhang, Cody Hao Yu, and Jim Wu
Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform
Hot & Spicy: Improving Productivity with Python and HLS for FPGAs
FPro Bus Protocol and MMIO Slot Specification
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
ECE 699: Lecture 3 ZYNQ Design Flow.
Implementation of a GNSS Space Receiver on a Zynq
Network-on-Chip Programmable Platform in Versal™ ACAP Architecture
Application-Specific Customization of Soft Processor Microarchitecture
CMPE419 Mobile Application Development
(Lecture by Hasan Hassan)
Presentation transcript:

ARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architecture Jason Cong, Yu-Ting Chen, Zhenman Fang, Bingjun Xiao, Peipei Zhou Computer Science Department, UCLA Center for Domain-Specific Computing Center for Future Architectures Research

Our Motivation and Goal A stack of research tools for accelerator-rich architecture Standalone accelerator simulation: Aladdin Standalone accelerator generation: HLS System-level HLS-based ARA simulation: PARADE System-level pre-RTL SoC simulation: gem5 + Aladdin ARA FPGA prototyping: ARAPrototyper Evaluate an ARA on the real prototype Bridge the gap between simulations and real chips Reduce the hardware design efforts and system integration efforts for the users early-stage late-stage

Evaluation of an Accelerator-Rich Architecture Final stage FPGA prototyping Collect real runtime and system power numbers System integration (software + hardware)

Prototyping an ARA Using Xilinx Zynq SoC (FPGA fabrics + ARM) Major components of an ARA General processor cores A sea of heterogeneous accelerators Memory system + interconnects (NoC)

ARAPrototyper: A Rapid Prototyping Flow ARAPrototyper features Highly-automated accelerator prototyping flow Reusable baseline prototype Application programming interfaces (user APIs) Efficient system software stack (runtime for accelerators) Prototype platform Xilinx Zynq ZC706 SoC: Dual-core Cortex-A9 ARM + FPGA fabrics FPGA: implement accelerators, on-chip memories, and interconnects Linux is ported

Advantages of ARAPrototyper Design efforts reduction Automatically generated hardware IPs, system software, and APIs Reusable interconnect and memory system templates Push-button flow Short evaluation cycle Can run native binaries Can run large input data set User can adjust the microarchitecture of an accelerator easily Use HLS design flow See the impact at the system-level interconnect and memory

Architecture Overview

System Overview (What Can be Modeled?) IOMMU TLB Memory Requests (Virtual) Linux System software stack reservations, starts, releases App1 App2 Appk CMP C L1 LLC Acc1 Acc2 Acc3 Accm ARA memory system MPC MP MP1 MP2 MP3 MPN Memory Controller DRAM DIMMs 1. General-purpose cores 5. Coherency level 2. Heterogeneous accelerators 6. Virtual -> physical address translation 3. on-chip ARA memory system 7. System software (runtime) & APIs 4. Off-chip memory system

ARA Template (What Can be Reused?) User-Designed Accelerators Highly Parameterized Hardware Templates Memory Requests (Physical) IOMMU TLB Memory Requests (Virtual) IOMMU and Dedicated TLB Heterogeneous Accelerators Acc1 Acc2 Acc3 Acc4 AccM DMAC1 DMAC2 DMAC3 DMACN MP1 MP2 MP3 MPN Interconnect Layer 1 Partial Crossbar Homogeneous Shared Memory Banks Interconnect Layer 2 Interleaved Network DMACs Physical Memory Ports

Hardware Prototyping Flow

ARA Design Flow Accelerator design Integrated with high-level synthesis (Xilinx Vivado HLS) ARA system design: ARA specification file Specify accelerators, on-chip memories, and interconnects Highly parameterized HW templates of on-chip memories and interconnects can be reused Push-button flow Use Vivado HLS, Xilinx PlanAhead flow for FPGA bitstream synthesis Automatically generates the APIs based on the ARA spec. file

ARA Specification File Accelerator kernels <system> <ACCs> <acc type=“gradient” num=“2” num_params=“5”> <port size=“16384” num=“6”/> </acc> <acc type=“segmentation” num=“1” num_params=“13”> <port size=“16384” num=“8”/> <acc type=“rician” num=“1” num_params=“7”> <port size=“16384” num=“12”/> <acc type=“gaussian” num=“1” num_params=“7”> <port size=“16384” num=“5”/> </ACCs> <SharedBuffers size=“16384” num=“32” numDMACs=“4” /> <Interconnects> <ACCs_to_Buffers type=“crossbar” ON=“4”/> <Buffers_to_DMACs type=“interleaved”/> </Interconnects> <IOMMU> <TLB size=“8192” evict=“LRU”/> </IOMMU> <CoherentCache use=“0”/> <AccFrequency hz=“75000000”/> </system> IOMMU TLB grad1 grad2 seg1 rician1 gauss1 1 2 3 4 5 31 32 Shared memory banks & DMACs Interconnect configurations DMAC1 DMAC2 DMAC3 DMAC4 MP1 MP2 MP3 MP4 IOMMU/TLB Coherent L2 cache Accelerator frequency

Accelerator Design with HLS void gradient ( volatile unsigned int* param1, … volatile unsinged int* param5, volatile unsigned float mem_port1[SIZE], volatile unsigned float mem_port6[SIZE], volatile unsigned int* toIOMMU_FIFO, volatile unsigned int* fromIOMMU_FIFO) { // read function parameters int image_vddr = *param1; // main computation for(int i = 1; i < P; i++) for(int j = 1; j < M; j++) for(int k = 1; k < N; k++) { g[CENTER] = 1.0/sqrt( EPSILON + SQR(u[CENTER] – u[RIGHT]) + SQR(u[CENTER] – u[LEFT]) + SQR(u[CENTER] – u[DOWN]) + SQR(u[CENTER] – u[UP]) + SQR(u[CENTER] – u[ZOUT]) + SQR(u[CENTER] – u[ZIN]) ); } Parameters sent from the application (passed through AXILite interface) grad1 Connection to the homogeneous memory banks Interfaces to IOMMU Read input parameters Perform optimization on the original C codes: Pipelining Unrolling (Manually) partition memory into multiple memory banks Goal (initiation interval = 1) Minimizing off-chip bandwidth (data reuse optimization)

Hardware Design Automation Flow ARA memory system and interconnects (highly parameterized) User-designed accelerators (HLS) User-designed ACCs in C ACC designs in RTL High-level synthesis Platform- specific modules Platform name ARA specification file Module configurations Platform-independent modules (interconnects and memory system) Hardware templates ARA memory system optimization Module instantiation ARAPrototyper design automation flow System integration Platform backend flow: logic synthesis, mapping, placement and route (Xilinx PlanAhead) Xilinx PlanAhead flow FPGA bitstream

APIs and System Software Stack

User APIs (How to Program Accelerators) Include the header files of accelerator definition Header is generated from the ARA description file (XML format) User APIs: reserve(), check_reserved(), send_param(), check_done(), free() Communicate with the software global accelerator manager (GAM) Header file; declare a class for each type of an accelerator Declaration: use acc object to manipulate Gaussian accelerator in the ARA 1. reserve(): reserve the Gaussian accelerator (send a signal to GAM) 2. check_reserved(): wait until GAM confirms the reservation send_param(): send parameters; start the Gaussian accelerator; M, P, N: sizes of the three dimensions; a: the input image array 1. check_done(): check the done signal from GAM 2. free(): free the Gaussian accelerator; GAM can use it for the other applications

User APIs (Cont’d) What’s behind run(): acc.run(7, Image::get_M(), Image::get_M(), Image::get_M(), a.get_ptr(), 1, 1, 1); 1. reserve(): reserve the Gaussian accelerator (send a signal to GAM) 2. check_reserved(): wait until GAM confirms the reservation acc.reserve(); while(acc.check_reserved() == 0); acc.send_param(7, Image::get_M(), Image::get_M(), Image::get_M(), a.get_ptr(), 1, 1, 1); while(acc.check_done() == 0); acc.free(); send_param(): send parameters; start the Gaussian accelerator; M, P, N: sizes of the three dimensions; a: the input image array 1. check_done(): check the done signal from GAM 2. free(): free the Gaussian accelerator; GAM can use it for the other applications

Underlying System Software Stack Major components: GAM: global accelerator manager DBA: dynamic buffer allocator TLB miss handler Coherence manager

Flow Efficiency & Case Study

ARA Evaluation Runtime One flow can be finished within four hours (1) ARAPrototyper Flow runtime, (2) Xilinx tool (> 98%): Vivado HLS, PlanAhead (Map, P&R) and (3) native execution on prototype 2.9x ~ 42.6x evaluation time reduction compared to full-system simulations Native execution on prototype vs. simulation: 105 ~ 106 difference

Case Study: A Real ARA Prototype Medical Imaging Applications Heterogeneous Accelerators Acc0: gradient Acc1: gaussian Acc2: rician Acc3: segmentation

ARA Prototype vs. Xeon vs. ARM Measured from the real prototype and real machines 7.44x better energy efficiency over Intel Sandy-Bridge processor 2.22x better energy efficiency over ARM Cortex-A9 24x - 84x energy efficiency for ASIC projection Cortex-A9 Xeon (24 threads, OpenMP) ARA Frequency 667 MHz 1.9 GHz ACCs@100MHz CPU@667MHz Runtime (seconds) 28.34 0.55 4.53 Power 1.1W 190W(TDP) 3.1W Total Energy 2.22x 7.44x 1x

Use Case 1: Accelerator Microarchitecture Modification Data reuse optimization Goal: II=1 and reduce the required off-chip bandwidth by storing data in the local memory banks 2.5x – 5.9x performance gain

Use Case 2: Interconnect Resource Utilization Use FPGA prototype to project resource utilization Two partial crossbar synthesis methods (Acc #: 30) Left-hand side (much more congested and consume more resources)

Summary An alternative for ARA evaluation Goals of the ARAPrototyper Collect the performance and energy data from real silicon Goals of the ARAPrototyper Reduce the design efforts Shorten the evaluation time ARAPrototyper features: Highly automated ARA prototyping flow Reusable interconnect and memory system templates Automatically generated APIs System software stack (runtime)

Thank You! Zhenman is on the academic job market, please check his website: https://sites.google.com/site/fangzhenman/