High Performance Linear Transform Program Generation for the Cell BE

Slides:



Advertisements
Similar presentations
Very Large Fast DFT (VL FFT) Implementation on KeyStone Multicore Applications.
Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Carnegie Mellon Automatic Generation of Vectorized Fast Fourier Transform Libraries for the Larrabee and AVX Instruction Set Extension Automatic Generation.
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Offline Adaptation Using Automatically Generated Heuristics Frédéric de Mesmay, Yevgen Voronenko, and Markus Püschel Department of Electrical and Computer.
FOURIER TRANSFORMS CENG 5931: GNU RADIO Dr. George Collins.
ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney.
Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University
Carnegie Mellon Lessons From Building Spiral The C Of My Dreams Franz Franchetti Carnegie Mellon University Lessons From Building Spiral The C Of My Dreams.
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
A Fast Fourier Transform Compiler Silvio D Carnevali.
Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute.
Carnegie Mellon μ-op Fission: Hyper-threading without the Hyper-headache Anthony Cartolano Robert Koutsoyannis Daniel S. McFarlin Carnegie Mellon University.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Carnegie Mellon Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic Lawrence J. Chang Inpyo Hong Yevgen Voronenko Markus Püschel Department.
Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University.
1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
Short Vector SIMD Code Generation for DSP Algorithms
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
SPIRAL: Current Status José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU)
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.
High Performance Computing on the Cell Broadband Engine
Exploiting Data Parallelism in SELinux Using a Multicore Processor Bodhisatta Barman Roy National University of Singapore, Singapore Arun Kalyanasundaram,
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
FFT: Accelerator Project Rohit Prakash Anand Silodia.
Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.
Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura.
Cell Processor Programming: An introduction Pascal Comte Brock University, Fall 2007.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
Carnegie Mellon Program Generation with Spiral: Beyond Transforms This work was supported by DARPA DESA program, NSF-NGS/ITR, NSF-ACR, Mercury Inc., and.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Optimizing Ray Tracing on the Cell Microprocessor David Oguns.
Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.
Updating Designed for Fast IP Lookup Author : Natasa Maksic, Zoran Chicha and Aleksandra Smiljani´c Conference: IEEE High Performance Switching and Routing.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Real-Time Ray Tracing Stefan Popov.
Automatic Performance Tuning
High Performance Computing (CS 540)
Multicore and GPU Programming
Presentation transcript:

High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel Electrical & Computer Engineering Carnegie Mellon University Sponsors: DARPA-DESA, NSF, ARO, and Mercury Inc.

How do we harness the Cell’s impressive peak performance? Cell Broadband Engine Multicore cpu (8 SPEs+1 PPE) SPEs: SIMD cores designed for numerical computing 256KB “local store” per SPE (scratchpad-like) Programmer-driven DMA 204 Gflop/s peak Cell BE Chip Main Mem EIB SPE LS How do we harness the Cell’s impressive peak performance?

DFT on the Cell BE Spiral generated (this paper) 350x FFTC FFTW Numerical Recipes Platform-tuned code is 350x faster. But hard to write!

Overview Background, Spiral Overview Generating DFTs for the Cell Performance Results Concluding Remarks Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo: SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), 2005

“Fitting” Dataflow to Hardware Core 0 Core 1 Parallel execution (multicore) Stage 1 Stage 2 Stage 3 Stage 4 Iterative Algorithm (programming ease) Stage 5 Stage 1 Recursive algorithm (memory hierarchy) Stage 2 Stage 3 Stage 4 To “fit” DFT to architecture: Various traversals Various factorizations How to map dataflow to architecture automatically?

“Fitting” Dataflow to Platform (contd.) 1 2 3 4 5 1 2 3 4 Core 0 Core 1 Intuition: rewrite formulas to obtain suitable dataflow

Program Generation in Spiral parallelization vectorization loop optimizations constant folding scheduling …… Optimization at all abstraction levels Transform user specified Fast algorithm in SPL many choices ∑-SPL Iteration of this process to search for the fastest But that’s not all … C Code

Common Abstraction: SPL SPL: Tensor-product representation Eg.: Cooley-Tukey fast Fourier transform (FFT): Algorithms in SPL: Products of structured sparse matrices Algorithms reduce arithmetic cost O(n2)  O(n log n) Mathematical notation exposes structure: SPL (signal processing language) Tensor products in SPL represent loop structures

Overview Background, Spiral Overview Generating DFTs for the Cell Performance Results Concluding Remarks

Mapping DFTs to the Cell Objective: High-performance transform library for Cell BE Cell BE Chip Main Mem EIB SPE LS DFT Cell’s architectural paradigms: Vectorize DFT for vector length  Vectorization Parallelize DFT across p SPEs, and use a DMA packet size of  Parallelization Optimize DFT for throughput (s DFTs required) Multibuffering Tags guide formula rewriting

Idea: rewrite all SPL constructs to parallel constructs + on-chip DMA SPL to Parallel Code Natural parallel construct in SPL: A x y Processor 0 Processor 1 Processor 2 Processor 3 Independent, load-balanced, communication-free operation Parallelizing other constructs in SPL: Permutations require message exchange (on-chip DMA comm.) x y Idea: rewrite all SPL constructs to parallel constructs + on-chip DMA

Idea: rewrite algorithm at SPL level to achieve largest DMA packets SPL to Streaming Code Streaming: Overlapping computation with communication On-chip (SPE ↔ SPE) and off-chip (SPE ↔ Main memory) Idea: tensor loops become multi-buffered loops Useful for: Throughput-optimized code Large, out-of-chip sizes i'th iteration Write Ai-1 Compute Ai Read Ai+1 A A A (Trickier for other SPL constructs) x y Idea: rewrite algorithm at SPL level to achieve largest DMA packets

Generating Cell Code Transform user specified Rewriting Fast algorithm in SPL tag guided Streamed from memory for throughput Load balanced across p SPEs SIMD kernel optimized for memory hierarchy All-to-all communication (on-chip) Loop operations in ∑-SPL Cell-specific optimized C code (intrinsics, DMA etc.)

Generated Code Sample DFT 216: 4,000+ lines of code! vectorized DMA /* Complex-to-complex DFT size 64 on 2 SPEs */ dft_c2c_64(float *X, float *Y, int spuid) { // Block 1 (IxA)L for(i:=0; i<=7; i++) // Right most gather { DMA_GATHER(gath_func(X,i), gath_func(T1,i), 4) } // uses spu_mfcdma() spu_mfcstat(MFC_TAG_UPDATE_ALL); // Wait on gather // compute vectorized DFT kernel of size m for(i:=0; i<=7; i++) // Scatter at interface { DMA_SCATTER(scat_func(T1,i), scat_func(T2,i), 4) } all_to_all_synchronization_barrier(); // uses mailbox msgs // Block 2 (AxI) /* Gather is a no operation since the scatter above accounted for it */ // compute vectorized DFT kernel of size n for(i:=0; i<=7; i++) // Left most scatter { DMA_SCATTER(scat_func(T1,i), scat_func(Y,i), 4) } all_to_all_synchronization_barrier(); } vectorized DMA parallelized DFT 216: 4,000+ lines of code!

Problem Space: Options Parallelization Base (Vectorized) SPE DFT SPE DFT Vectorization assumed Single DFT parallelized across multiple SPEs SPE DFT Main Memory Operations (Only for small DFTs) SPE DFT Multiple independent DFTs on multiple SPEs Latency optimized (default) SPE DFT SPE DFT Multiple parallelized independent DFTs Throughput, multibuffered

Problem Space: Combinations Throughput-optimized usage scenarios Latency-optimized usage scenarios SPE DFT Parallel, multibuffered DFT Single DFT from main memory Independent DFTs multibuffered in parallel Devise rewrite rules for tags. Nestings describe all scenarios

Overview Background, Spiral Overview Generating DFTs for the Cell Performance Results Concluding Remarks

SPE DFT 8-SPEs 4-SPEs 2-SPEs Single precision IBM QS22 1-SPE

4.5x faster than FFTW, 1.63x faster than FFTC SPE DFT Spiral: 1-SPE Spiral: 8-SPEs FFTC FFTW 4.5x faster than FFTW, 1.63x faster than FFTC

More Performance Results Single-SPE DFT code Split/interleaved complex formats Non-2-power sizes Double precision (PowerXCell 8i) Mercury Spiral Chow IBM SDK

Other Linear Transforms Discrete Sine, Cosine transforms, DFT with real inputs (single-SPE) 2-D DFTs Out-of-core sizes Limited to 2D DFTs on 1-SPE (for now) More performance results: Srinivas Chellappa, Franz Franchetti , and Markus Püschel: Computer Generation of fast Fourier Transforms for the Cell Broadband Engine Proceedings of International Conference on Supercomputing (ICS) 2009

Overview Background, Spiral Overview Generating DFTs for the Cell Performance Results Concluding Remarks

Conclusion Automatic generation of transform libraries High performance Variety of scenarios, formats High performance on Cell requires: Vectorization multi-core parallelization, streaming, DMA code Future processors likely to have similar paradigms, tradeoffs Spiral approach: Common abstraction of transform, algorithm, architecture (SPL) Rewrite rules to go from transform to architecture architecture space algorithm