Implementing Large Scale FFTs on Heterogeneous Multicore Systems Yan Li 1, Jeff Diamond 2, Haibo Lin 1, Yudong Yang 3, Zhenxing Han 3 June 4 th, 2011 1.

Slides:



Advertisements
Similar presentations
Chapter 19 Fast Fourier Transform (FFT) (Theory and Implementation)
Advertisements

Chapter 19 Fast Fourier Transform
DFT & FFT Computation.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Chapter 6 File Systems 6.1 Files 6.2 Directories
Acceleration of Cooley-Tukey algorithm using Maxeler machine
SE-292 High Performance Computing
Homework Reading Machine Projects Labs
CS 105 Tour of the Black Holes of Computing
1 Networks for Multi-core Chip A Controversial View Shekhar Borkar Intel Corp.
Introduced 1982 Used mostly in embedded applications - controllers, point-of- sale systems, terminals, and the like Used in several MS-DOS non-PC- Compatible.
Presenter : Cheng-Ta Wu Kenichiro Anjo, Member, IEEE, Atsushi Okamura, and Masato Motomura IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39,NO. 5, MAY 2004.
Explicit HW and SW Hierarchies High-Level Abstractions for giving the system what it wants Mattan Erez The University of Texas at Austin Salishan 2011.
Shredder GPU-Accelerated Incremental Storage and Computation
1 Overview Assignment 4: hints Memory management Assignment 3: solution.
© 2006 Cisco Systems, Inc. All rights reserved. MPLS v MPLS VPN Technology Introducing MPLS VPN Architecture.
25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst
25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst
25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst
5 August, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst
Processes Management.
Fourier Transform Fourier transform decomposes a signal into its frequency components Used in telecommunications, data compression, digital signal processing,
SE-292 High Performance Computing
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.
Very Large Fast DFT (VL FFT) Implementation on KeyStone Multicore Applications.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
ELEC 6200, Fall 07, Oct 29 McPherson: Vector Processors1 Vector Processors Ryan McPherson ELEC 6200 Fall 2007.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
Low power and cost effective VLSI design for an MP3 audio decoder using an optimized synthesis- subband approach T.-H. Tsai and Y.-C. Yang Department of.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
High Performance Linear Transform Program Generation for the Cell BE
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
200/MAPLD 2004 Craven1 Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? Stephen Craven Cameron Patterson Peter Athanas Configurable.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Fast Memory Addressing Scheme for Radix-4 FFT Implementation Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Xin Xiao, Erdal Oruklu and.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.
Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI Ashok Srinivasan Florida State.
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
HPEC-1 SMHS 7/7/2016 MIT Lincoln Laboratory Focus 3: Cell Sharon Sacco / MIT Lincoln Laboratory HPEC Workshop 19 September 2007 This work is sponsored.
TI Information – Selective Disclosure
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
Ke Bai and Aviral Shrivastava Presented by Bryce Holton
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Mapping the FFT Algorithm to the IBM Cell Processor
Presentation transcript:

Implementing Large Scale FFTs on Heterogeneous Multicore Systems Yan Li 1, Jeff Diamond 2, Haibo Lin 1, Yudong Yang 3, Zhenxing Han 3 June 4 th, IBM China Research Lab, 2 University of Texas at Austin, 3 IBM Systems Technology Group

Current FFT Libraries 2 nd most important HPC application ◦ after dense matrix multiply Post-PC emerging applications Power efficiency ◦ custom VLSI / augmented DSPs ◦ Increasing interest in heterogeneous MC Target original HMC - IBM Cell B. E.

FFT on Cell Broadband Engine Best implementations not general ◦ FFT must reside on single accelerator (SPE)  Not “large scale” ◦ Only certain FFT sizes supported ◦ Not “end to end” performance First high performance general solution ◦ Any size FFT spanning all cores on two chips ◦ Extensible to any size ◦ Performance 50% greater

Paper Contributions First high performance, general FFT library on HMC ◦ 67% faster than FFTW “end to end” ◦ 36 FFT Gflops for SP 1-D complex FFT Explore FFT design space on HMC ◦ Quantitative performance comparisons  Nontraditional FFT solutions superior ◦ Novel factorization and buffer strategies Extrapolate lessons to general HMC

Talk Outline

Fourier Transform is a Change of Basis X iY θ P(x,y) P  (cos θ, i sin θ ) = Pe i θ Complex Unit Circle

Discrete Fourier Transform ω N = Y[k] =  X[j] Cost is Order(N 2 ) * Graphs from Wikipedia entry “DT-matrix”

Fast Fourier Transform J. Cooley and J Tukey, 1965 n = n1 * n2 Can do this recursively, factoring n1 and n2 further… For prime sizes, can use Rader’s algorithm: ◦ Increase FFT size to next power of 2 ◦ Perform two FFTs and one inverse FFT to get answer

Cooley-Tukey Example Highest level is simple factorization ◦ Example: N = 35, row major

Cooley-Tukey Example Replaces columns with all new values 10 Step 1: strided 1-D FFT across columns

Cooley-Tukey Example Exponents are product of coordinates 11 Step 2: multiply by twiddle factors WW2W2 W3W3 W4W4 W5W5 W6W6 1W2W2 W4W4 W6W6 W8W8 W 10 W 12 1W3W3 W6W6 W9W9 W 15 W 18 1W4W4 W8W8 W 12 W 16 W 20 W 25 (Ws are base N=35)

Cooley-Tukey Example This gather is all-to-all communication 12 Step 3: 1-D FFT across rows Replaces rows with all new values

Cooley-Tukey Example 13 Frequencies are in the wrong places Step 4: do final logical transpose  Really a scatter

Talk Outline

First Heterogeneous Multicore Cell 2006 – 90nm, 3.2 GHz – a Low Latency Throughput Architecture ◦ 234MT, 235mm^2, 204 SP GFLOPS 25.6 GB/sec bidirectional ring bus, 1 cycle hop 256KB scratchpad per SPE, 6-cycle latency 4-wide, dual issue 128-bit SIMD, 128 registers SPE DMA control with true scatter/gather via address list 64-bit PowerPC 8 vector processors

IBM BladeCenter Blade Dual 3.2 Gz PowerXCell 8i 8GB DDR2 DRAM over XDR interface

Talk Outline

Key Implementation Issues* Communication Topology ◦ Centralized (classic accelerator) ◦ Peer to peer FFT factorization Scratchpad allocation ◦ Twiddle computation * For additional implementation details, see IPDPS 2009 paper

1. Communication Topology

2. Factorization Strategy (N1xN2) Extreme aspect ratio – nearly 1-D Choose N1 = 4 x number of SPEs ◦ Each SPU has exactly 4 rows ◦ Each row starts on consecutive addresses  Exact match for 4-wide SIMD  Exact match for 128-bit random access and DMA Use DMA for scatters and gathers ◦ All-to-all exchange, initial gather, final scatter ◦ Need to store large DMA list of destinations

Less SPEs Improves Throughput

3. Allocating Scratchpad Memory Need to store EVERYTHING in 256KB ◦ Code, stack, DMA address lists, buffers… ◦ 64KB for 8,192 complex points ◦ 64KB for output (FFT result) buffer ◦ 64KB to overlap communication Only 64KB left to fit… ◦ 120KB for kernel code ◦ 64KB for twiddle factor storage

Multimode Twiddle Buffers Allocate 16KB in each SPU ◦ Supports local FFTs up to 2,048 points Three Kernel Modes ◦ < 2KP, use twiddle factors directly ◦ 2KP-4KP, store half and compute rest ◦ 4KP-8KP, store ¼ and compute rest Only 0.5% performance drop Leaves 30KB for code ◦ Dynamic code overlays

Talk Outline

FFT Is Memory Bound! Transfer takes % longer than entire FFT

67% faster than state of the art Excellent power of two performance

Conclusion Best in class general purpose FFT library ◦ 67% faster than FFTW Heterogeneous MC effective platform ◦ Different implementation strategies Peer-to-peer communication superior Case for autonomous, low latency accelerators

Thank You Any Questions?