Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.

Slides:



Advertisements
Similar presentations
Clare Smtih SHARC Presentation1 The SHARC Super Harvard Architecture Computer.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
DSPs Vs General Purpose Microprocessors
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Computer Science, University of Oklahoma Reconfigurable Versus Fixed Versus Hybrid Architectures John K. Antonio Oklahoma Supercomputing Symposium 2008.
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Jan 28, 2004Blackfin Compute Unit REV B A comparison of DSP Architectures BlackFin ADSP-BFXXX Compute Unit Based on a ENEL white paper prepared by.
Computer Architecture & Organization
Blackfin ADSP Versus Sharc ADSP-21061
TigerSHARC and Blackfin Different Applications. Introduction Quick overview of TigerSHARC Quick overview of Blackfin low power processor Case Study: Blackfin.
1 SHARC ‘S’uper ‘H’arvard ‘ARC’hitecture Nagendra Doddapaneni ER hit HAR ect VARD ure SUP Arc.
Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.
20 October 2003WASPAA New Paltz, NY1 Implementation of real time partitioned convolution on a DSP board Enrico Armelloni, Christian Giottoli, Angelo.
ClearSpeed CSX620 Overview. References ClearSpeed Technical Training Slides for ClearSpeed Accelerator 620, software version 3.0, Slide Sets 1-6, Presentor:
Performance Analysis of Processor Midterm Presentation Performed by : Winter 2005 Alexei Iolin Alexander Faingersh Instructor: Evgeny.
Embedded Systems Programming
PlayStation 2 Architecture Irin Jose Farid Momin Quy Ngo Olivia Wong.
Eye-RIS. Vision System sense – process - control autonomous mode Program stora.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Real time DSP Professors: Eng. Julian Bruno Eng. Mariano Llamedo Soria.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
2007 Sept 06SYSC 2001* - Fall SYSC2001-Ch1.ppt1 Computer Architecture & Organization  Instruction set, number of bits used for data representation,
Computational Technologies for Digital Pulse Compression
Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
NDA Confidential. Copyright ©2005, Nallatech.1 Implementation of Floating- Point VSIPL Functions on FPGA-Based Reconfigurable Computers Using High- Level.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 24, 2003 Authors Ken Cameron
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)
M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.
A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
Playstation2 Architecture Architecture Hardware Design.
ZHAO(176/MAPLD2004)1 FFT Mapping on Mathstar’s FPOA FilterBuilder Platform MathStar, Inc. Sept 2004.
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
DIGITAL SIGNAL PROCESSORS. What are Digital Signals? Digital signals have finite precision in both the time (sampled) and amplitude (quantized) domains.
Approximate Computing on FPGA using Neural Acceleration Presented By: Mikkel Nielsen, Nirvedh Meshram, Shashank Gupta, Kenneth Siu.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
WorldScape Defense Company, L.L.C. Company Proprietary Slide 1 An Ultra-High Performance Scalable Processing Architecture for HPC and Embedded Applications.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
Advanced Rendering Technology The AR250 A New Architecture for Ray Traced Rendering.
HPEC-1 SMHS 7/7/2016 MIT Lincoln Laboratory Focus 3: Cell Sharon Sacco / MIT Lincoln Laboratory HPEC Workshop 19 September 2007 This work is sponsored.
A next-generation many-core processor with reliability, fault tolerance and adaptive power management features optimized for embedded.
HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
GCSE Computing - The CPU
Hardware design considerations of implementing neural-networks algorithms Presenter: Nir Hasidim.
Embedded Systems Design
Presented by: Tim Olson, Architect
Session 4: Reconfigurable Computing
FPGAs in AWS and First Use Cases, Kees Vissers
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
ClearSpeed CSX620 Overview
GCSE Computing - The CPU
CSE 502: Computer Architecture
ADSP 21065L.
Martin Croome VP Business Development GreenWaves Technologies.
Presentation transcript:

Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004 Authors Stewart Reddaway / WorldScape Inc. Brad Atwater / Lockheed Martin MS2 Paul Bruno / WorldScape Inc. Dairsie Latimer / ClearSpeed Technology, plc. Rick Pancoast / Lockheed Martin MS2 Pete Rogina / WorldScape Inc. Leon Trevito / Lockheed Martin MS2

Overview  Work Objective ŸProvide working hardware benchmark for Multi-Threaded Array Processing Technology –Enable embedded processing decisions to be accelerated for upcoming platforms (radar and others) –Validate Pulse Compression benchmark with hardware, and with data flowing from and to external DRAM –Support customers’ strategic technology investment decisions ŸShare results with industry –New standard for performance AND performance per watt

Architecture  ClearSpeed’s Multi Threaded Array Processor Architecture – MTAP  Fully programmable at high level with Cn (parallel variant of C)  Hardware multi-threading  Extensible instruction set  Fast processor initialization and restart  High performance, low power –~ 10 GFLOPS/Watt  Scalable internal parallelism –Array of Processor Elements (PEs) –Compute and bandwidth scale together –From 10s to 1,000s of PEs –Multiple specialized execution units per PE  Multiple high speed I/O channels Architectural DSP Features:  Multiple operations per cycle –Data-parallel array processing –Internal PE parallelism –Concurrent I/O and compute –Simultaneous mono and poly operations  Specialized execution units in each PE –Integer MAC, Floating-Point Units  On-chip memories –Instruction and data caches –High bandwidth PE “poly” memories –Large scratchpad “mono” memory  Zero overhead looping –Concurrent mono and poly operations

Architecture  Processor Element Structure  ALU + accelerators: integer MAC, Dual FPU, DIV/SQRT  High-bandwidth, multi-port register file  Closely-coupled SRAM for data  High-bandwidth per PE DMA: PIO, SIO  High-bandwidth inter-PE communication  Supports multiple data types: –8, 16, 24, 32-bit,... fixed point –32-bit IEEE floating point

Applications  Power Comparison Results (Table presented at HPEC 2003) ProcessorClockPower FFT/sec /Watt PC/sec/ Watt Mercury PowerPC MHz 8.3 Watts WorldScape / ClearSpeed 64 PE Chip 200 MHz 2.0 Watts** Speedup X 31.9 X ** 2.0 Watts was the worst case result from Mentor Mach PA Tools. Actual Measured Hardware Results < 1.85 Watts HPEC 2003 Cycle Accurate Simulations were validated on actual hardware. Results matched to within 1%.

Benchmark WorldScape and Lockheed Martin collaborated to provide demonstration using realistic Pulse Compression data on actual hardware Input Data FFT Complex Multiply IFFT Output Data Pulse Compression – 1K FFT and IFFT implemented on 8 PEs with 128 complex points per PE (8 FFTs performed in parallel over 64 PEs) –Pulse Compression based upon optimized instructions: FFT, complex multiply by a realistic reference FFT, IFFT –32-bit IEEE standard floating point Reference FFT

Benchmark Benchmark Measurements: Validate Pulse Compression performance with hardware and with data flowing from and to external DRAM (1 MTAP processor) 1) Input Data and reference Function loaded from Host onto DRAM 2) Data input from DRAM to MTAP #1, processed, and output into DRAM 3) Results returned to Host for display MTAP #1 MTAP #2 Host DRAM Per Second Per Watt ( /s/W) Pulse Compression * Per Second ( /s) GFLOP FFTs (within PC) * Adjusted for CM = FFT/s, FFT/s/W

Benchmark  1 KHz PRF (1ms PRI)  20 MHz sampling rate  870 samples  Echo  10 us pulse  LFM chirp up  200 samples ŸPulse Compression Input (MatLab) ŸPulse Compression Reference (MatLab)  Frequency Domain Reference  10 us  LFM chirp up  1024 samples  Hamming weighting  Bit-reversed to match optimized implementation ŸPulse Compression Output (MatLab)  671 samples out of PC

Benchmark ŸPulse Compression Input/Output (Actual) ŸPulse Compression Reference (Actual)*

Benchmark Benchmark Measurements: Validate Pulse Compression performance with hardware and with data flowing from and to external DRAM (Average Performance across 2 MTAP processors) 1) Input Data and reference Function loaded from Host onto DRAM 2) Data input to MTAP #1 and (via MTAP #1) to MTAP #2, processed, and output (via MTAP #1) into DRAM 3) Results returned to Host for display Per Second Per Watt ( /s/W) Pulse Compression * Per Second ( /s) GFLOP FFTs (within PC) * Adjusted for CM = FFT/s, FFT/s/W MTAP #1 MTAP #2 Host DRAM 1 3 2

Summary Hardware validation of HPEC 2003 results to within 1% Optimized Pulse Compression functions modified using COTS SDK and integrated onto Host platform VSIPL Core Lite Libraries under development Image Processing Signal Processing Compression/De-compression Wide Ranging Applicability to DoD/Commercial Processing Requirements World-class radar processing benchmark results Encryption/De-cryption Network Processing Search Engine Supercomputing Applications Application Areas