FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.

Slides:



Advertisements
Similar presentations
David Hansen and James Michelussi
Advertisements

Very Large Fast DFT (VL FFT) Implementation on KeyStone Multicore Applications.
Implementing Large Scale FFTs on Heterogeneous Multicore Systems Yan Li 1, Jeff Diamond 2, Haibo Lin 1, Yudong Yang 3, Zhenxing Han 3 June 4 th,
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Digital Kommunikationselektronik TNE027 Lecture 5 1 Fourier Transforms Discrete Fourier Transform (DFT) Algorithms Fast Fourier Transform (FFT) Algorithms.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Masters Presentation at Griffith University Master of Computer and Information Engineering Magnus Nilsson
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
A Matlab Playground for JPEG Andy Pekarske Nikolay Kolev.
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
Data Parallel Algorithms Presented By: M.Mohsin Butt
Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute.
Parallelization of FFT in AFNI Huang, Jingshan Xi, Hong Department of Computer Science and Engineering University of South Carolina.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Fast Fourier Transformation of Vibration Signals using Microsoft EXCEL Mohammed Aref Department of Information Technology Salalah.
Fast (finite) Fourier Transforms (FFTs) Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com December 5,
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
Processor Architecture Needed to handle FFT algoarithm M. Smith.
High Performance Linear Transform Program Generation for the Cell BE
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.
High Performance Computing on the Cell Broadband Engine
1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek.
Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato
NDA Confidential. Copyright ©2005, Nallatech.1 Implementation of Floating- Point VSIPL Functions on FPGA-Based Reconfigurable Computers Using High- Level.
FFT: Accelerator Project Rohit Prakash Anand Silodia.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Fast Memory Addressing Scheme for Radix-4 FFT Implementation Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Xin Xiao, Erdal Oruklu and.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Transform on the Cell Broadband Engine.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
Reconfigurable FFT architecture
Optimizing Ray Tracing on the Cell Microprocessor David Oguns.
A Data-Driven Approach for Pipelining Sequences of Data-Dependent LOOPs João M. P. Cardoso ITIV, University of Karlsruhe, July 2, 2007 Portugal.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
 presented by- ARPIT GARG ISHU MISHRA KAJAL SINGHAL B.TECH(ECE) 3RD YEAR.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
Martin Kruliš by Martin Kruliš (v1.0)1.
HPEC-1 SMHS 7/7/2016 MIT Lincoln Laboratory Focus 3: Cell Sharon Sacco / MIT Lincoln Laboratory HPEC Workshop 19 September 2007 This work is sponsored.
Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.
CUDA Interoperability with Graphical Environments
High Performance Computing on an IBM Cell Processor --- Bioinformatics
9/18/2018 Accelerating IMA: A Processor Performance Comparison of the Internal Multiple Attenuation Algorithm Michael Perrone Mgr, Cell Solution Dept.,
Array Processor.
Implementation of neural gas on Cell Broadband Engine
Mapping the FFT Algorithm to the IBM Cell Processor
Large data arrays processing on Cell Broadband Engine
Presentation transcript:

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal

Virat Agarwal, FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine 2 Cell System Features  Heterogeneous multi-core system architecture  Power Processor Element for control tasks  Synergistic Processor Elements for data- intensive processing  Synergistic Processor Element (SPE) consists of  Synergistic Processor Unit (SPU)  Synergistic Memory Flow Control (MFC)  Data movement & synchronization  Interface to high-performance Element Interconnect Bus

Virat Agarwal, FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine 3 Fast Fourier Transform FFTs are used in: - Data Compression, Seismic Imaging, Fluid Dynamics, Image Processing, etc. Medium Size FFTs - Complex single-precision 1D FFT. - Input samples and output results reside in main memory. - Radix 2, 3 and 5. - Optimized for 1K-16K points.

Virat Agarwal, FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine 4 Existing FFT Research on Cell/B.E. [Williams et al., 2006], analyzed peak performance. [Cico, Cooper and Greene, 2006] estimated 22.1 GFlops/s for an 8K complex 1D FFT that resides in the Local Store of one SPE. –8 independent FFTs in local store of 8 SPEs gives GFlops/s. [Chow, Fossum and Brokenshire, 2005] achieved 46.8 GFlops/s for 16M complex FFT. –Highly specialized for this particular input size. FFTW is a highly portable FFT library of various types, precision and input size.

Virat Agarwal, FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine 5 Our FFTC is based on Cooley Tukey  Out of Place 1D FFT requires two arrays A & B for computation at each stage

Virat Agarwal, FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine 6  Butterflies of the ordered DIF (Discrete in Frequency) FFT Algorithm.  Does not require bit reveral at the end of all stages. Illustration of the Algorithm

Virat Agarwal, FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine 7 FFTC design on Cell/B.E.  Synchronize step after every stage leads to significant overhead  Load balancing to achieve better SPU utilization  Vectorization difficult for every stage  Limited local store -require space for twiddle factors and input data. -loop unrolling and duplication increases size of the code.  Algorithm is branchy: -Doubly nested for loop within the outer while loop -Lack of branch predictor compromises performance.

Virat Agarwal, FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine 8 Paralleling FFTC  Number of chunks := 2*p p: Number of SPEs  Chunk i and i+p are allocated to SPE i.  Each chunk is fetched using DMA get with multibuffering.  Achieves load balancing.

Virat Agarwal, FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine 9 Tree Synchronization  Each synchronization stage takes 1 microsec - In comparison to PPU coordinated sync. that takes 20 microsec.  Minimizes sync. overhead  Synchronization after every stage using Inter-SPE DMA communication - Achieved in (2*logn) stages.

Virat Agarwal, FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine 10 Optimization for SPE  Loop duplication for Stages 1 & 2 - For vectorization of these stages we need to use spu_shuffle on output vector.  Loop duplication for NP<buffersize and otherwise. - Need to stall for DMA get at different places within the loop.  Code size increases which limits the size of FFT that can be computed.

Virat Agarwal, FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine 11 Pseudo Code for the Algorithm on Cell

Virat Agarwal, FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine 12 Experimental Setup  Manual Loop unrolling, multi-buffering, inter SPE communication, odd-even pipelining, vectorization.  Instruction level profiling and performance analysis using Cell SDK 2.1  FLOP analysis  Operation Count : (5*N log N) floating point operations  For 2 complex value computations we require - One complex subtraction (2 FLOP), One complex addition (2 FLOP) and one complex multiplication (6 FLOP).

Virat Agarwal, FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine 13 Performance analysis : Scaling across SPEs

Virat Agarwal, FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine 14 Performance Comparison of FFTs * Performance numbers from BenchFFT.

Virat Agarwal, FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine 15  FFTC is our high performance design of an FFT for a single 1-Dimensional DIF FFT.  Use various optimization techniques such as Manual Loop unrolling, multi-buffering, inter SPE communication, odd-even pipelining, vectorization.  Demonstrate superior performance of 18.6 GigaFlop/s for an FFT of size 8k-16K.  Code available at Conclusions

Virat Agarwal, FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine 16 Acknowledgment of Support National Science Foundation –CSR : A Framework for Optimizing Scientific Applications ( ) –CAREER: High-Performance Algorithms for Scientific Applications ( ; ) –ITR: Building the Tree of Life -- A National Resource for Phyloinformatics and Computational Phylogenetics (EF/BIO ) –DBI: Acquisition of a High Performance Shared-Memory Computer for Computational Science and Engineering ( ). IBM Shared University Research (SUR) Grant Sony-Toshiba-IBM (STI) Microsoft Research Sun Academic Excellence Grant

Virat Agarwal, FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine 17 Thank you Questions?