ARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architecture Jason Cong, Yu-Ting Chen, Zhenman Fang, Bingjun Xiao, Peipei Zhou Computer Science Department, UCLA Center for Domain-Specific Computing Center for Future Architectures Research
Our Motivation and Goal A stack of research tools for accelerator-rich architecture Standalone accelerator simulation: Aladdin Standalone accelerator generation: HLS System-level HLS-based ARA simulation: PARADE System-level pre-RTL SoC simulation: gem5 + Aladdin ARA FPGA prototyping: ARAPrototyper Evaluate an ARA on the real prototype Bridge the gap between simulations and real chips Reduce the hardware design efforts and system integration efforts for the users early-stage late-stage
Evaluation of an Accelerator-Rich Architecture Final stage FPGA prototyping Collect real runtime and system power numbers System integration (software + hardware)
Prototyping an ARA Using Xilinx Zynq SoC (FPGA fabrics + ARM) Major components of an ARA General processor cores A sea of heterogeneous accelerators Memory system + interconnects (NoC)
ARAPrototyper: A Rapid Prototyping Flow ARAPrototyper features Highly-automated accelerator prototyping flow Reusable baseline prototype Application programming interfaces (user APIs) Efficient system software stack (runtime for accelerators) Prototype platform Xilinx Zynq ZC706 SoC: Dual-core Cortex-A9 ARM + FPGA fabrics FPGA: implement accelerators, on-chip memories, and interconnects Linux is ported
Advantages of ARAPrototyper Design efforts reduction Automatically generated hardware IPs, system software, and APIs Reusable interconnect and memory system templates Push-button flow Short evaluation cycle Can run native binaries Can run large input data set User can adjust the microarchitecture of an accelerator easily Use HLS design flow See the impact at the system-level interconnect and memory
Architecture Overview
System Overview (What Can be Modeled?) IOMMU TLB Memory Requests (Virtual) Linux System software stack reservations, starts, releases App1 App2 Appk CMP C L1 LLC Acc1 Acc2 Acc3 Accm ARA memory system MPC MP MP1 MP2 MP3 MPN Memory Controller DRAM DIMMs 1. General-purpose cores 5. Coherency level 2. Heterogeneous accelerators 6. Virtual -> physical address translation 3. on-chip ARA memory system 7. System software (runtime) & APIs 4. Off-chip memory system
ARA Template (What Can be Reused?) User-Designed Accelerators Highly Parameterized Hardware Templates Memory Requests (Physical) IOMMU TLB Memory Requests (Virtual) IOMMU and Dedicated TLB Heterogeneous Accelerators Acc1 Acc2 Acc3 Acc4 AccM DMAC1 DMAC2 DMAC3 DMACN MP1 MP2 MP3 MPN Interconnect Layer 1 Partial Crossbar Homogeneous Shared Memory Banks Interconnect Layer 2 Interleaved Network DMACs Physical Memory Ports
Hardware Prototyping Flow
ARA Design Flow Accelerator design Integrated with high-level synthesis (Xilinx Vivado HLS) ARA system design: ARA specification file Specify accelerators, on-chip memories, and interconnects Highly parameterized HW templates of on-chip memories and interconnects can be reused Push-button flow Use Vivado HLS, Xilinx PlanAhead flow for FPGA bitstream synthesis Automatically generates the APIs based on the ARA spec. file
ARA Specification File Accelerator kernels <system> <ACCs> <acc type=“gradient” num=“2” num_params=“5”> <port size=“16384” num=“6”/> </acc> <acc type=“segmentation” num=“1” num_params=“13”> <port size=“16384” num=“8”/> <acc type=“rician” num=“1” num_params=“7”> <port size=“16384” num=“12”/> <acc type=“gaussian” num=“1” num_params=“7”> <port size=“16384” num=“5”/> </ACCs> <SharedBuffers size=“16384” num=“32” numDMACs=“4” /> <Interconnects> <ACCs_to_Buffers type=“crossbar” ON=“4”/> <Buffers_to_DMACs type=“interleaved”/> </Interconnects> <IOMMU> <TLB size=“8192” evict=“LRU”/> </IOMMU> <CoherentCache use=“0”/> <AccFrequency hz=“75000000”/> </system> IOMMU TLB grad1 grad2 seg1 rician1 gauss1 1 2 3 4 5 31 32 Shared memory banks & DMACs Interconnect configurations DMAC1 DMAC2 DMAC3 DMAC4 MP1 MP2 MP3 MP4 IOMMU/TLB Coherent L2 cache Accelerator frequency
Accelerator Design with HLS void gradient ( volatile unsigned int* param1, … volatile unsinged int* param5, volatile unsigned float mem_port1[SIZE], volatile unsigned float mem_port6[SIZE], volatile unsigned int* toIOMMU_FIFO, volatile unsigned int* fromIOMMU_FIFO) { // read function parameters int image_vddr = *param1; // main computation for(int i = 1; i < P; i++) for(int j = 1; j < M; j++) for(int k = 1; k < N; k++) { g[CENTER] = 1.0/sqrt( EPSILON + SQR(u[CENTER] – u[RIGHT]) + SQR(u[CENTER] – u[LEFT]) + SQR(u[CENTER] – u[DOWN]) + SQR(u[CENTER] – u[UP]) + SQR(u[CENTER] – u[ZOUT]) + SQR(u[CENTER] – u[ZIN]) ); } Parameters sent from the application (passed through AXILite interface) grad1 Connection to the homogeneous memory banks Interfaces to IOMMU Read input parameters Perform optimization on the original C codes: Pipelining Unrolling (Manually) partition memory into multiple memory banks Goal (initiation interval = 1) Minimizing off-chip bandwidth (data reuse optimization)
Hardware Design Automation Flow ARA memory system and interconnects (highly parameterized) User-designed accelerators (HLS) User-designed ACCs in C ACC designs in RTL High-level synthesis Platform- specific modules Platform name ARA specification file Module configurations Platform-independent modules (interconnects and memory system) Hardware templates ARA memory system optimization Module instantiation ARAPrototyper design automation flow System integration Platform backend flow: logic synthesis, mapping, placement and route (Xilinx PlanAhead) Xilinx PlanAhead flow FPGA bitstream
APIs and System Software Stack
User APIs (How to Program Accelerators) Include the header files of accelerator definition Header is generated from the ARA description file (XML format) User APIs: reserve(), check_reserved(), send_param(), check_done(), free() Communicate with the software global accelerator manager (GAM) Header file; declare a class for each type of an accelerator Declaration: use acc object to manipulate Gaussian accelerator in the ARA 1. reserve(): reserve the Gaussian accelerator (send a signal to GAM) 2. check_reserved(): wait until GAM confirms the reservation send_param(): send parameters; start the Gaussian accelerator; M, P, N: sizes of the three dimensions; a: the input image array 1. check_done(): check the done signal from GAM 2. free(): free the Gaussian accelerator; GAM can use it for the other applications
User APIs (Cont’d) What’s behind run(): acc.run(7, Image::get_M(), Image::get_M(), Image::get_M(), a.get_ptr(), 1, 1, 1); 1. reserve(): reserve the Gaussian accelerator (send a signal to GAM) 2. check_reserved(): wait until GAM confirms the reservation acc.reserve(); while(acc.check_reserved() == 0); acc.send_param(7, Image::get_M(), Image::get_M(), Image::get_M(), a.get_ptr(), 1, 1, 1); while(acc.check_done() == 0); acc.free(); send_param(): send parameters; start the Gaussian accelerator; M, P, N: sizes of the three dimensions; a: the input image array 1. check_done(): check the done signal from GAM 2. free(): free the Gaussian accelerator; GAM can use it for the other applications
Underlying System Software Stack Major components: GAM: global accelerator manager DBA: dynamic buffer allocator TLB miss handler Coherence manager
Flow Efficiency & Case Study
ARA Evaluation Runtime One flow can be finished within four hours (1) ARAPrototyper Flow runtime, (2) Xilinx tool (> 98%): Vivado HLS, PlanAhead (Map, P&R) and (3) native execution on prototype 2.9x ~ 42.6x evaluation time reduction compared to full-system simulations Native execution on prototype vs. simulation: 105 ~ 106 difference
Case Study: A Real ARA Prototype Medical Imaging Applications Heterogeneous Accelerators Acc0: gradient Acc1: gaussian Acc2: rician Acc3: segmentation
ARA Prototype vs. Xeon vs. ARM Measured from the real prototype and real machines 7.44x better energy efficiency over Intel Sandy-Bridge processor 2.22x better energy efficiency over ARM Cortex-A9 24x - 84x energy efficiency for ASIC projection Cortex-A9 Xeon (24 threads, OpenMP) ARA Frequency 667 MHz 1.9 GHz ACCs@100MHz CPU@667MHz Runtime (seconds) 28.34 0.55 4.53 Power 1.1W 190W(TDP) 3.1W Total Energy 2.22x 7.44x 1x
Use Case 1: Accelerator Microarchitecture Modification Data reuse optimization Goal: II=1 and reduce the required off-chip bandwidth by storing data in the local memory banks 2.5x – 5.9x performance gain
Use Case 2: Interconnect Resource Utilization Use FPGA prototype to project resource utilization Two partial crossbar synthesis methods (Acc #: 30) Left-hand side (much more congested and consume more resources)
Summary An alternative for ARA evaluation Goals of the ARAPrototyper Collect the performance and energy data from real silicon Goals of the ARAPrototyper Reduce the design efforts Shorten the evaluation time ARAPrototyper features: Highly automated ARA prototyping flow Reusable interconnect and memory system templates Automatically generated APIs System software stack (runtime)
Thank You! Zhenman is on the academic job market, please check his website: https://sites.google.com/site/fangzhenman/