1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou.

Slides:

Advertisements

Similar presentations

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

System Level Benchmarking Analysis of the Cortex™-A9 MPCore™ John Goodacre Director, Program Management ARM Processor Division October 2009 Anirban Lahiri.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

1 Executive Summary. 2 Overall Architecture of ARC ♦ Architecture of ARC  Multiple cores and accelerators  Global Accelerator Manager (GAM)  Shared.

Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.

Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.

On-Chip Cache Analysis A Parameterized Cache Implementation for a System-on-Chip RISC CPU.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Xilinx Public System Interfaces & Caches RAMP Retreat Austin, TX June 2009.

1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.

Optimizing FPGA Accelerator Design for Deep Convolution neural Networks By: Mohamad Kanafanai.

OpenSPARC-Xilinx Collaboration Durgam Vahia Paul Hartke OpenSPARC.

Students:Gilad Goldman Lior Kamran Supervisor:Mony Orbach Mid-Semester Presentation Spring 2005 Network Sniffer.

1 Fast Communication for Multi – Core SOPC Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab.

Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.

ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.

Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.

Using FPGAs with Embedded Processors for Complete Hardware and Software Systems Jonah Weber May 2, 2006.

System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.

A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.

1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.

General Purpose FIFO on Virtex-6 FPGA ML605 board midterm presentation

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

Out-of-Order OpenRISC 2 semesters project Semester A: Implementation of OpenRISC on XUPV5 board Final A Presentation By: Vova Menis-Lurie Sonia Gershkovich.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf

Computer Architecture and Organization

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1

Computer System Architectures Computer System Software

2017/4/21 Towards Full Virtualization of Heterogeneous Noc-based Multicore Embedded Architecture 2012 IEEE 15th International Conference on Computational.

February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.

Out-of-Order OpenRISC 2 semesters project Semester A: Implementation of OpenRISC on XUPV5 board Midterm Presentation By: Vova Menis-Lurie Sonia Gershkovich.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Performance and Overhead in a Hybrid Reconfigurable Computer O. D. Fidanci 1, D. Poznanovic 2, K. Gaj 3, T. El-Ghazawi 1, N. Alexandridis 1 1 George Washington.

Department of Electrical Engineering Electronics Computers Communications Technion Israel Institute of Technology High Speed Digital Systems Lab. High.

Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.

2012/03/06 匡建慈. goals  To build a multi-core platform with Hadoop environment.  Hardware architecture  What is Hadoop ?  What to do and what we have.

RiceNIC: A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Dr. Scott Rixner Rice Computer Architecture:

GBT Interface Card for a Linux Computer Carson Teale 1.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

© 2007 Xilinx, Inc. All Rights Reserved This material exempt per Department of Commerce license exception TSU Hardware Design INF3430 MicroBlaze 7.1.

2013/10/21 Yun-Chung Yang An Energy-Efficient Adaptive Hybrid Cache Jason Cong, Karthik Gururaj, Hui Huang, Chunyue Liu, Glenn Reinman, Yi Zou Computer.

History of Microprocessor MPIntroductionData BusAddress Bus

Design and Performance of a PCI Interface with four 2 Gbit/s Serial Optical Links Stefan Haas, Markus Joos CERN Wieslaw Iwanski Henryk Niewodnicznski Institute.

Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

1 DSP handling of Video sources and Etherenet data flow Supervisor: Moni Orbach Students: Reuven Yogev Raviv Zehurai Technion – Israel Institute of Technology.

Full and Para Virtualization

CPU/BIOS/BUS CES Industries, Inc. Lesson 8.  Brain of the computer  It is a “Logical Child, that is brain dead”  It can only run programs, and follow.

Additional Hardware Optimization m Yumiko Kimezawa October 25, 20121RPS.

By Islam Atta Supervised by Dr. Ihab Talkhan

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Content Project Goals. Workflow Background. System configuration. Working environment. System simulation. System synthesis. Benchmark. Multicore.

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

The Standford Hydra CMP  Lance Hammond  Benedict A. Hubbert  Michael Siu  Manohar K. Prabhu  Michael Chen  Kunle Olukotun Presented by Jason Davis.

Performed By: Tal Goihman & Irit Kaufman Instructor: Mony Orbach Bi-semesterial Spring /04/2011.

Corflow Online Tutorial Eric Chung

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

System on a Programmable Chip (System on a Reprogrammable Chip)

PARADE: A Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Architectural Design and Exploration Zhenman Fang, Michael Gill Jason Cong,

Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin

NFV Compute Acceleration APIs and Evaluation

Jason Cong, Yu-Ting Chen, Zhenman Fang, Bingjun Xiao, Peipei Zhou

Ming Liu, Wolfgang Kuehn, Zhonghai Lu, Axel Jantsch

FPGA Implementation of Multicore AES 128/192/256

A Fully Pipelined and Dynamically Composable Architecture of CGRA

LANMC: LSTM-Assisted Non-Rigid Motion Correction

Presentation transcript:

1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou

2 Accelerator-Rich Architectures: ARC, CHARM, BiN

3 Goals u Implement the architecture features & supports into the prototype system  Architecture Proposals Architecture-rich CMPs CHARM Hybrid cache Buffer-in NUCA etc  Bridge different thrusts in CDSC

4 Server-Class Platform: HC-1ex Architecture Xeon Quad Core LV W TDP Tesla C GB/s off-chip bandwidth 200W TDP 4XC6vlx760 FPGAs 80GB/s off-chip bandwidth 90W Design Power

5 Drawback of the Commodity Systems u Limited ability to customize from the architecture point of view u Board-level integration rather than chip-level integration u Commodity systems can only reach certain-level, we need further innovations

6 CHP Prototyping Plan u Create the working hardware and software  Use FPGA Extensible Processing Platform ( EPP) as the platform Reuse existing FPGA IPs as much as possible u Working in multiple phases

7 Target Platforms: Xilinx ML605 and Zynq u Dual-core A9 with programmable logics u Virtex6-based board

8 CHP Prototyping Phases u ARC Implementation  Phase 1: Basic platform Accelerator and Software GAM  Phase 2: Adding modularity using available IP E.g. Xilinx DMAC IP  Phase 3: First step toward BiN Shared buffer Customized modules (e.g. DMA-controller, plug-n-play accelerator)  Phase 4: System Enhancement Crossbar AXI implementation u CHARM Implementation

9 ARC Phase 1 Goals u Setting up a basic environment  Multi-core + simple accelerators + OS Understanding the system interactions in more detail  Simple controller as GAM (global accelerator manager) Supports sharing at system-level for multiple accelerators of a same type

10 Microblaze-0 (Linux with MMU) Microblaze-1 (GAM) (Bare-metal; no MMU) AXI4 (xbar) AXI4lite (bus) DDR3 Mailbox (vecadd) FSL vecadd timeruartmutex FSL vecsub Mailbox (vecsub) FSL ARC Phase 1 Example System Diagram

11 ARC Phase-2 Goals u Implementing a system similar to ARC original design  GAM, Accelerator, DMA-Controller, SPM u Adding modularity using available IP  E.g. Xilinx DMAC IP

12 ARC Phase-2 Architecture

ARC Phase-2 Performance and Power Results u Benchmarking kernel: u Results Runtime (us) Power (W) EDP (Energy delay product) Gain CHP prototye on Xilinx FPGA 100MHz 1,746217,570X 2x Quad-core Intel Xeon CPU E GHz, 1 FPU per core ,365X Dual-core Intel Xeon CPU GHz, 1 FPU per core 10, X 16-Core UltraSPARC 1.2 GHz, 1 shared FPU 852,163721X

ARC Phase-2 Runtime Breakdown

ARC Phase-2 Area Breakdown u Slice Logic Utilization  Number of Slice Registers: 45,283 out of 301,440: 15%  Number of Slice LUTs: 40,749 out of 150,720: 27% Number used as logic: 32,505 out of 150,720: 21% Number used as logic: 32,505 out of 150,720: 21% Number used as Memory: 5,248 out of 58,400: 8% Number used as Memory: 5,248 out of 58,400: 8% u Slice Logic Distribution:  Number of occupied Slices: 17,621 out of 37,680: 46%  Number of LUT Flip Flop pairs used: 54,323 Number with an unused Flip Flop: 14,617 out of 54,323: 26% Number with an unused Flip Flop: 14,617 out of 54,323: 26% Number with an unused LUT: 13,574 out of 54,323: 24% Number with an unused LUT: 13,574 out of 54,323: 24% Number of fully used LUT-FF pairs: 26,132 out of 54,323: 48% Number of fully used LUT-FF pairs: 26,132 out of 54,323: 48%

ARC Phase-3 Goals u First step toward BiN:  Shared buffer u Designing our customized modules  Customized DMA-controller Handles batch TLB misses Handles batch TLB misses  Plug-n-play accelerator design Making the interface general enough at least for a class of accelerators Making the interface general enough at least for a class of accelerators

ARC Phase-3 Architecture u A partial realization of the proposed accelerator-rich CMP onto Xilinx ML605 (Virtex-6)  Global accelerator manager (GAM) for accelerator sharing  Shared on-chip buffers: Much more accelerators than buffer bank resources  Virtual addressing in the accelerators, accelerator virtualization  Virtual addressing DMA, with on-demand TLB filling from core  No network-on-chip, no buffer sharing with cache, no customized instruction in the core

Performance and Power Results u Benchmarking kernel: u Results Runtime (us) Power (W) EDP (Energy delay product) Gain CHP prototye on Xilinx FPGA 100MHz 1,80228,050,786X 2x Quad-core Intel Xeon CPU E GHz, 1 FPU per core ,069,261X Dual-core Intel Xeon CPU GHz, 1 FPU per core 10,061657,947X 16-Core UltraSPARC 1.2 GHz, 1 shared FPU 852,163721X

Impact of Communication & Computation Overlapping 19% Pipelined Communication & Computation No pipeline

Overhead of Buffer Sharing: Bank Access Contention (1) 3.2% The 4 logic buffers are allocated to 4 separate buffer banks The 4 logic buffers are allocated to 1 buffer bank Reason: AXI bus allow masters simultaneously issue transactions. and the AXI transaction time dominates buffer access time

Overhead of Buffer Sharing: Bank Access Contention (2) 2.7% The 4 logic buffers are allocated to 4 separate buffer banks The 4 logic buffers are allocated to 1 buffer bank

Area Breakdown u Slice Logic Utilization  Number of Slice Registers: 105,969 out of 301,440: 35%  Number of Slice LUTs: 93,755 out of 150,720: 62% Number used as logic: 80,410 out of 150,720: 53% Number used as logic: 80,410 out of 150,720: 53% Number used as Memory: 7,406 out of 58,400: 12% Number used as Memory: 7,406 out of 58,400: 12% u Slice Logic Distribution:  Number of occupied Slices: 32,779 out of 37,680: 86%  Number of LUT Flip Flop pairs used: 112,772 Number with an unused Flip Flop: 25,037 out of 112,772: 22% Number with an unused Flip Flop: 25,037 out of 112,772: 22% Number with an unused LUT: 19,017 out of 112,772: 16% Number with an unused LUT: 19,017 out of 112,772: 16% Number of fully used LUT-FF pairs: 68,718 out of 112,772: 60% Number of fully used LUT-FF pairs: 68,718 out of 112,772: 60%

Phase-4 ARC Goals u Finding bottlenecks and system enhancement u Communication bottleneck  Crossbar design instead of AXI-bus  Speed-up AXI non-burst implementation

24 u Crossbar  In addition to previously proposed  now support partial configuration will not affect working LCAs will not affect working LCAs  Passed on-board test u Hierarchical DMACs  Data transfer between Main memory Main memory Shared buffer banks Shared buffer banks  # of buffer banks can be large  want to keep AXI bus size  Hierarchical DMACs and buses Accelerator Memory System Design IOMMU Buffer bank1 Buffer bank2 Buffer bank3 Buffer bank4 Buffer bank9 AXI buses DMAC1 DMAC2 DMAC3 Select-bit Receiver GAM Main AXI bus to DDR LCA1 LCA2 LCA3 LCA4 OC core

25 u Crossbar Results