Towards a Heterogeneous Computer Architecture for CACTuS Anthony Milton Supervisors: Assoc. Prof. David Kearney (UniSA) Dr. Sebastien Wong (DSTO) Reconfigurable.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Lecture 6: Multicore Systems

Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

1 Performed By: Khaskin Luba Einhorn Raziel Einhorn Raziel Instructor: Rivkin Ina Spring 2004 Spring 2004 Virtex II-Pro Dynamical Test Application Part.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

EE694v-Verification-Lect5-1- Lecture 5 - Verification Tools Automation improves the efficiency and reliability of the verification process Some tools,

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

GPGPU platforms GP - General Purpose computation using GPU

© 2011 Xilinx, Inc. All Rights Reserved Intro to System Generator This material exempt per Department of Commerce license exception TSU.

Delevopment Tools Beyond HDL

Ross Brennan On the Introduction of Reconfigurable Hardware into Computer Architecture Education Ross Brennan

GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Institute of Information Sciences and Technology Towards a Visual Notation for Pipelining in a Visual Programming Language for Programming FPGAs Chris.

ASIC/FPGA design flow. FPGA Design Flow Detailed (RTL) Design Detailed (RTL) Design Ideas (Specifications) Design Ideas (Specifications) Device Programming.

GBT Interface Card for a Linux Computer Carson Teale 1.

Automated Design of Custom Architecture Tulika Mitra

Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |

Accelerating a Software Radio Astronomy Correlator By Andrew Woods Supervisor: Prof. Inggs & Dr Langman.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Tracking with CACTuS on Jetson Running a Bayesian multi object tracker on a low power, embedded system School of Information Technology & Mathematical.

Tracking with CACTuS on Jetson Running a Bayesian multi object tracker on an embedded system School of Information Technology & Mathematical Sciences September.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Novel, Emerging Computing System Technologies Smart Technologies for Effective Reconfiguration: The FASTER approach.

Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Computer Engineering 1502 Advanced Digital Design Professor Donald Chiarulli Computer Science Dept Sennott Square

Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Computing Systems: Next Call for Proposals Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.

Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

ASIC/FPGA design flow. Design Flow Detailed Design Detailed Design Ideas Design Ideas Device Programming Device Programming Timing Simulation Timing Simulation.

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,

M. Bellato INFN Padova and U. Marconi INFN Bologna

Two-Dimensional Phase Unwrapping On FPGAs And GPUs

Enabling machine learning in embedded systems

FPGAs in AWS and First Use Cases, Kees Vissers

Anne Pratoomtong ECE734, Spring2002

Serial Data Hub (Proj Dec13-13).

Matlab as a Development Environment for FPGA Design

CSE 502: Computer Architecture

Presentation transcript:

Towards a Heterogeneous Computer Architecture for CACTuS Anthony Milton Supervisors: Assoc. Prof. David Kearney (UniSA) Dr. Sebastien Wong (DSTO) Reconfigurable Computing Lab

1 Collaboration Partners

 CACTuS originally developed and prototyped in MATLAB:  Great testbed for algorithm development,  BUT poor computational performance  As CACTuS is a visual tracking algorithm real-time operation is desired. 2 Motivation for Heterogeneous CACTuS VariantTime/frameFrames/sec “Standard”269.24ms3.71 “Frequency Domain”203.74ms4.91

3 Motivation – Data Parallelism Input Frame Posterior Position Observed Image Posterior Velocity

4 Motivation – Task Parallelism SEF #1 SEF #2 SEF #n-1SEF #n

 It is well known that GPUs and FPGAs are well suited to data-parallel computation  GPUs originally used for computer graphics, now used in a huge number of application areas (GPGPU)  FPGAs used for specialized applications requiring high performance but low power (Radar processing, TCP/IP packet processing…) 5 Motivation – GPUs & FPGAs

 Each computational resource has strengths and weaknesses  Using a mix of different (heterogeneous) computing resources for computation, drawing on the strengths of each resource. Heterogeneous Computing 6 CPUs FPGAs GPUs

 Construction of a hardware prototype with disparate compute resources is easy  Application development for such a system is hard:  Algorithm translation  Design partitioning  Languages and development environments  Models of computation  Communication and data transfer  Etc..  How to create designs that are partitioned across the different computing resources? Heterogeneous Computing Systems 7

 Develop a heterogeneous computer architecture for CACTuS  Maintain tracking metric compared to MATLAB “gold standard”  Improve execution performance of the algorithm 8 Project Goals

 Xenon Systems workstation:  Intel X5677 Xeon quad-core 3.46GHz  6GB DDR3 DRAM  NVIDIA Quadro 4000 GPU (2GB GDDR5 DRAM, OpenCL 1.1 device, CUDA 2.0 device)  Alpha Data ADM-XRC-6T1 FPGA board  Xilinx Virtex-6 XC6VLX550T FPGA (549,888 logic cells, 864 DSP slices, 2.844MB BRAM)  2GB off-chip DDR3 DRAM  Connects to host via PCIe 2.0 x4 9 Our Research Platform

 Maintain similar high-level abstractions across all versions  Use 3 rd party libraries and designs, open source where possible  Incremental approach to overall development 10 Development Approach

 Necessary to develop the C++/CPU version as much of the infrastructure code would need to be re-used for GPU & FPGA versions.  This included video, MATLAB and text file I/O, visualisation, timing & unit testing.  Third party libraries used for this infrastructure included:  Qt – visualisation  Boost – non-MATLAB file I/O, timing, unit testing  MatIO – MATLAB file I/O 11 Design Decision – Common Infrastructure

 To reduce development time, and help ensure high-level similarity with MATLAB code, the open source C++ linear algebra library, Armadillo, was utilised.  At the start of development (late 2011), Armadillo did not feature any 2D FFT implementations 1, so the industry standard FFTW library was used Design Decisions – C++/CPU 1. Since been added in September MATLAB itself uses FFTW for computing FFTs.

 Essentially 2 choices for GPU programming framework: CUDA and OpenCL  CUDA limited to NVIDIA hardware, mature, has good supporting libraries such as CUBLAS, CUFFT, good development tools  OpenCL vendor agnostic, less mature, not limited to just GPUs - multicore, GPU, DSP, FPGA (portable)  OpenCL selected: avoid vendor lock-in and eye to the future, as OpenCL likely to become dominant in the future. 13 Design Decisions – OpenCL/GPU

 To reduce development time, and help ensure high-level similarity with MATLAB code, the open source OpenCL computing library, ViennaCL, was utilised.  Provided methods for most linear algebra operations required for CACTuS  Did not support complex numbers, but as complex numbers only required for 2D f -domain convolution, workarounds possible. 14 Design Decisions – OpenCL/GPU

 Traditional HDL, Verilog & VHDL, very low level and require designer to design control logic, implement hardware flow control etc.  Design flexibility but lower productivity  Bluespec (BSV) – modern, rule-based, high-level HDL:  Rule based approach naturally matches parallel nature of HW  Designer freed from (error prone) control logic design  Alpha Data hardware infrastructure & SDK  Get data to and from FPGA via 4 DMA channels.  Drivers & SDK on PC side, support hardware and reference designs on FPGA side. 15 Design Decisions – FPGA/Bluespec

 How to best map the algorithm to the heterogeneous platform?  Still a work-in-progress and currently being explored 16 Design Decisions - Heterogeneous

 The Observe Velocity stage of CACTuS was primary focus for FPGA, is generally between 40% & 90%+ of total FLOPs of algorithm  To perform Observe Velocity in f -domain :  2D-FFT on X s to give X s _freq  2D-FFT on X m to give X m _freq  Per-element-multiply between X s _freq and X m _freq to give V m _freq  2D-IFFT on V m _fre q to give V m 17 Bottleneck - Observe Velocity

18 Implementation - Hardware

 Need for evaluation of computational performance and tracking accuracy  Verification of development  Provide a basis for comparison  Functionality for evaluating computational performance (timing) and tracking accuracy (tracking metrics) integrated into common infrastructure  Allows evaluation for single executions  External scripts allow for evaluation of batch jobs 19 Evaluation Methods – Performance and Accuracy

MATLABCPUGPUFPGAHetero stdfreqstdfreqstdstd 1 std 2 stdfreqstdfreq Time/frame (ms) ~110~98 Frame/sec (fps) ~9.09~ Results – Performance 1. V m performed on CPU 2. V m on GPU, padded to nearest-power-of-2 Fps

 Early phase of exploring algorithm mappings to heterogeneous platform  3 rd party libraries not efficient  Task parallelism not yet exploited  Refactor algorithm flow control – lose connection with MATLAB “gold standard” version  Software aspect of project is complex  Multiple developers, multiple third party libraries,  FPGA pipeline currently limited to 2D f -domain convolution, only relevant to predict and observe stages  Also limited in size due to resource utilisation constraints  Many issues encountered with FPGA development 21 Limitations & Problems

 Developer (in)experience impacts on development time and achieved performance greatly.  OpenCL difficult to develop with, becoming easier as it matures and associated libraries improve  CUDA might have been a better initial choice  Use of immature libraries not the best idea (unless frequent code changes are your idea of fun)  FPGA functionality takes a lot of time and effort to develop  Evaluate exactly what functionality is required to meet performance constraints. 22 Lessons Learnt

 Continue to improve exploitation of data parallelism  Likely to be inefficient due to use of small kernels, consider combining small kernels  Task parallelism not yet exploited  Incorporate multi-core threading to fully exploit  Investigate problem of scheduling computational resources in system  DRAM integration would benefit FPGA performance greatly (images currently not large enough to amortise DMA overheads), open up further application mappings. 23 Future Work

Questions?

 Single Instruction Multiple Data (SIMD)  Excel at data parallel workloads with deterministic memory accesses  Best architecture for floating point arithmetic  Moderate to develop for: few languages but rapidly maturing ecosystem  Moderately complex memory architecture – developer must be aware of structure 25 CPUs GPUs FPGAs  Single Instruction Single Data (SISD)  Excel at sequential code, heavily branched code and task-parallel workloads  Easiest to develop for: software languages, environments, strong debugging  Easily understood memory architecture – generally transparent to developer  No fixed model model of computation, designed defined  Flexible enough to excel at a variety of tasks  Best architecture for fixed point arithmetic  Excel at bit, integer and logic operations  Difficult to develop for: hardware languages (HDLs), simulators  Memory architecture required to be defined by designer

26 Specs – Alpha Data ADM-XRC-6T1

27 Tracking Metrics

28 Results – Detailed Accuracy

29 Results - Detailed Performance

 The reconfigurable nature of FGPAs is both the major strength & weakness of the platform:  Freedom to create custom hardware structures & circuits for specific purposes = specialised, efficient, high-performance HW  No existing microarchitecture, designer needs to create = long development time, hard to debug, huge number of options to consider & choices to be made.  Design Space Exploration (DSE) – DS encapsulates all possible variations & permutations of designs that implement a system. 30 Nature of FPGA Design

Debugging FPGA designs is hard and time consuming – combination of simulation and run-time techniques  To simulate in software need to develop testbenches and analyse waveform:  To analyse behaviour at run-time in hardware need to insert Chipscope cores (modify design), anaylse waveform: 31 Challenges of HW Design

 Using 128-point configuration:  BRAM utilisation is 24%  Timing constraints are just met  Using 256-point configuration:  BRAM utilisation is 85%  Timing constraints are not met: timing paths associated with the larger BRAMs are the main cause of problems.  Because of the failing timing constraints of the 256-point configuration, currently restricted to 128-point configuration (2D- FFTs on 128 x 128 images).  Moving away from exclusive use of BRAM by incorporating off- chip DRAM will likely allow much larger input images. 32 Limitations of Current FPGA Implementation

 Complete integration of DRAM into infrastructure, unfortunately not a PnP solution:  Have a reference design, but additional components to interface between existing Alpha Data infrastructure and system components developed with Bluespec have been required.  Also additional clock domains, and many additional constraints to be considered.  Close to finalising a design for testing initial integration of DRAM into system.  Modules to perform transpose operations in DRAM have already been developed, so once integration is verified, using DRAM with 2D frequency domain convolution design will be straightforward. 33 Future Work FPGA - DRAM

 Developed a functional spatial convolution array in VHDL:  Not yet used or integrated into system  Has transpose linear filtering architecture, essentially systolic array.  Highly parallel so exhibits high performance, but high DSP utilisation. 34 Future work FPGA – Further Integration of HW Modules

 R2013a version of MATLAB used, with Image Processing toolbox 35 Misc Information