Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.

Slides:

Advertisements

Similar presentations

Parallel Processing (CS 730) Lecture 7: Shared Memory FFTs*

Advertisements

Carnegie Mellon Automatic Generation of Vectorized Fast Fourier Transform Libraries for the Larrabee and AVX Instruction Set Extension Automatic Generation.

The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Offline Adaptation Using Automatically Generated Heuristics Frédéric de Mesmay, Yevgen Voronenko, and Markus Püschel Department of Electrical and Computer.

Title of your presentation Jane Doe Hans Müller TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA.

Carnegie Mellon Lessons From Building Spiral The C Of My Dreams Franz Franchetti Carnegie Mellon University Lessons From Building Spiral The C Of My Dreams.

CompSci Searching & Sorting. CompSci Searching & Sorting The Plan  Searching  Sorting  Java Context.

Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.

Carnegie Mellon 1 Optimal Scheduling “in a lifetime” for the SPIRAL compiler Frédéric de Mesmay Theodoros Strigkos based on Y. Voronenko’s idea.

1 Lecture 6 Performance Measurement and Improvement.

A Fast Fourier Transform Compiler Silvio D Carnevali.

Carnegie Mellon μ-op Fission: Hyper-threading without the Hyper-headache Anthony Cartolano Robert Koutsoyannis Daniel S. McFarlin Carnegie Mellon University.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Reducing Hardware Complexity of Linear DSP Systems by Iteratively Eliminating Two-Term Common Subexpressions IEEE/ACM Asia South Pacific Design Automation.

Concept of Basic Time Complexity Problem size (Input size) Time complexity analysis.

Carnegie Mellon Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic Lawrence J. Chang Inpyo Hong Yevgen Voronenko Markus Püschel Department.

Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University.

Algorithmic Complexity Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.

Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.

Generative Programming. Generic vs Generative Generic Programming focuses on representing families of domain concepts Generic Programming focuses on representing.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

Chapter 12 Fast Fourier Transform. 1.Metropolis algorithm for Monte Carlo 2.Simplex method for linear programming 3.Krylov subspace iteration (CG) 4.Decomposition.

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus.

SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.

Short Vector SIMD Code Generation for DSP Algorithms

Processor Architecture Needed to handle FFT algoarithm M. Smith.

College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

SPIRAL: Current Status José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU)

High Performance Linear Transform Program Generation for the Cell BE

1 Chapter 5 Divide and Conquer Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.

CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.

Introduction. 2COMPSCI Computer Science Fundamentals.

Jeff Wang Kay-Won Chang March 18, DEMO Harmonic Product Spectrum (HPS) pitch detection: obtain fundamental frequency from FFT Fast Fourier Transform.

Implementation of Fast Fourier Transform on General Purpose Computers Tianxiang Yang.

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

FFT: Accelerator Project Rohit Prakash Anand Silodia.

Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-

Generative Programming. Automated Assembly Lines.

Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western.

Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.

R. Arce-Nazario, M. Jimenez, and D. Rodriguez Electrical and Computer Engineering University of Puerto Rico – Mayagüez.

Eyes detection in compressed domain using classification Eng. Alexandru POPA Technical University of Cluj-Napoca Faculty.

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

Adaptive Fourier analysis based on the Walsh Hadamard transformation is a further development of the transform domain AFA presented Instead of the recursive.

Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura.

Inverse DFT. Frequency to time domain Sometimes calculations are easier in the frequency domain then later convert the results back to the time domain.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

FORTRAN History. FORTRAN - Interesting Facts n FORTRAN is the oldest Language actively in use today. n FORTRAN is still used for new software development.

Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Carnegie Mellon Program Generation with Spiral: Beyond Transforms This work was supported by DARPA DESA program, NSF-NGS/ITR, NSF-ACR, Mercury Inc., and.

A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

1 MOTIVATION AND OBJECTIVE  Discrete Signal Transforms (DSTs) –DFT, DCT: major performance component in many applications –Hardware accelerated but at.

Professor A G Constantinides 1 Discrete Fourier Transforms Consider finite duration signal Its z-tranform is Evaluate at points on z-plane as We can evaluate.

1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.

Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.

Custom Reduction of Arithmetic in Linear DSP Transforms S. Misra, A. Zelinski, J. C. Hoe, and M. Püschel Dept. of Electrical and Computer Engineering Carnegie.

In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Building a Program Generator With Scala

Linear Filters in StreamIt

Embedded Systems Design

Automatic Performance Tuning

High Performance Computing (CS 540)

Tze Meng Low, Qi Guo, Franz Franchetti

CS 201 Fundamental Structures of Computer Science

Kenneth Moreland Edward Angel Sandia National Labs U. of New Mexico

VSIPL++: Parallel Performance HPEC 2004

Presentation transcript:

Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus Püschel Carnegie Mellon University HPEC, September 2008, Lexington, MA, USA This work was supported by DARPA DESA program, NSF-NGS/ITR, NSF-ACR, and Intel TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A

Carnegie Mellon The Problem: Example DFT Standard desktop computer Same operations count ≈4nlog 2 (n) Similar plots can be shown for all numerical problems 12x 30x Numerical recipes Best code 2

Carnegie Mellon DFT Plot: Analysis Memory hierarchy: 5x Vector instructions: 3x Multiple threads: 2x High performance library development = nightmare Automation? 3

Carnegie Mellon Idea: Textbook to Adaptive Library ? ? 4 “FFTW” Textbook FFT

Carnegie Mellon Goal: Teach Computers to Write Libraries Spiral Input: Transform: Algorithm: Hardware: 2-way SIMD + multithreaded Output: FFTW equivalent library For general input size Vectorized and multithreaded Performance competitive Key technologies: Layered domain specific language Algorithm manipulation via rewriting Feedback-driven search Result: Full automation 5

Carnegie Mellon Contribution: General Size Library DFT of size 1024 Spiral Transform T library for DFT of any size or Fundamentally different problems Env_1 dft(1024); dft.compute(X, Y); dft_1024(X, Y); 6

Carnegie Mellon Beyond Fourier Transform and FFTW Spiral 7 “FFTW” Cooley-Tukey FFT Spiral “FCTW” “Cooley-Tukey” DCT Spiral “FIRW” Overlap-save/add FIR Spiral “WHTW” Fast Walsh Transform Spiral “FWTW” Fast Wavelet Transform Spiral “FHTW” Fast Hartley Transform

Carnegie Mellon Examples of Generated Libraries 2-way vectorized, 2-threaded Most are faster than hand-written libs Code size: 8–120 KLOC or 0.5–5 MB Generation time: 1–3 hours 2-way vectorized, 2-threaded Most are faster than hand-written libs Code size: 8–120 KLOC or 0.5–5 MB Generation time: 1–3 hours DFT RDFT DHT DCT2DCT3DCT4 FilterWavelet Total: 300 KLOC / 13.3 MB of code generated in < 20 hours from a few simple algorithm specs Intel IPP library 6.0 will include Spiral generated code 8

Carnegie Mellon I. Background II. Library Generation III. Experimental Results IV. Conclusions and Future Work 9

Carnegie Mellon Linear Transforms Mathematically: matrix-vector product Examples: Transform matrixInput vectorOutput vector 10

Carnegie Mellon Fast Algorithms, Example: 4-point FFT Fast algorithms = matrix factorizations SPL = mathematical, declarative specification Space of algorithms generated using breakdown rules 12 adds 4 mults 4 adds 1 mult (when multiplied with input vector x ) Identity Permutation Kronecker product Fourier transform 11

Carnegie Mellon Examples of Breakdown Rules DFT Cooley-Tukey DCT “Cooley-Tukey” 12  “Teach” Spiral domain knowledge of algorithms. Never obsolete.  Each rule leads to a library  “Teach” Spiral domain knowledge of algorithms. Never obsolete.  Each rule leads to a library

Carnegie Mellon I. Background II. Library Generation III. Experimental Results IV. Conclusions and Future Work 13

Carnegie Mellon How Library Generation Works Library Structure High-performance library Transforms + Breakdown rules recursion step closure as Σ-SPL formulas Library Target (FFTW, VSIPL, IPP FFT,...) Library Implementation 14

Carnegie Mellon Cooley-Tukey Fast Fourier Transform (FFT) Breakdown Rules to Library Code 2 extra functions needed 15 void dft(int n, cplx X[], cplx Y[]) { k = choose_factor(n); m = n/k; Z = permute(X) for i=0 to k-1 dft_subvec(m, Z, Y, …) for i=0 to n-1 Y[i] = Y[i]*T[i]; for i=0 to m-1 dft_strided(k, Y, Y, …) } DFT Naive implementation k=4

Carnegie Mellon Cooley-Tukey Fast Fourier Transform (FFT) Breakdown Rules to Library Code void dft(int n, cplx X[], cplx Y[]) { k = choose_factor(n); m = n/k; for i=0 to k-1 dft_strided2(m, X, Y, …) for i=0 to m-1 dft_strided3_scaled(k, Y, Y, T, …) } 2 extra functions needed 16 void dft(int n, cplx X[], cplx Y[]) { k = choose_factor(n); m = n/k; Z = permute(X) for i=0 to k-1 dft_subvec(m, Z, Y, …) for i=0 to n-1 Y[i] = Y[i]*T[i]; for i=0 to m-1 dft_strided(k, Y, Y, …) } DFT Optimized implementation Naive implementation 2 extra functions needed How to discover these specialized variants automatically?

Carnegie Mellon Library Structure Input:  Breakdown rules Output:  Recursion step closure  Σ-SPL Implementation of each recursion step Parallelization/Vectorization  Adds additional breakdown rules  Orthogonal to the closure generation Library Structure 17

Carnegie Mellon Computing Recursion Step Closure Input: transform T and a breakdown rule Output: spawned recursion steps + Σ-SPL implementation Algorithm: 1. Apply the breakdown rule 2. Convert to  -SPL 3. Apply loop merging + index simplification rules. 4. Extract recursion steps 5. Repeat until closure is reached 18 Parametrization (not shown) derives the independent parameter set for each recursion step

Carnegie Mellon Recursion Step Closure Examples DFT (scalar) DCT4 (vectorized) 19 4 mutually recursive functions - computed automatically - described using Σ-SPL formulas 17 mutually recursive functions

Carnegie Mellon Base Cases Base cases are called “codelets” in FFTW Why needed:  Closure is converted into mutually recursive functions  Recursion must be terminated  Larger base cases eliminate overhead from recursion How many:  In FFTW 3.2:183 codelets for complex DFT (21 types) 147 codelets for real DFT (18 types)  In our generator: # codelet types · # recursion steps Obtained by using standard Spiral to generate fixed size code... 20

Carnegie Mellon Library Implementation Input:  Recursion step closure  Σ-SPL implementation of each recursion step (base cases + recursions) Output:  High-performance library  Target language: C++, Java, etc. Process:  Build library plan  Perform hot/cold partitioning  Generate target language code Library Implementation High-performance library 21

Carnegie Mellon I. Background II. Library Generation III. Experimental Results IV. Conclusions and Future Work 22

Carnegie Mellon Double Precision Performance: Intel Xeon way vectorization, up to 2 threads Complex DFTReal DFT DCT-2 WHT 23 Generated library FFTW Intel IPP

Carnegie Mellon FIR Filter Performance 2- and 4-way vectorization, up to 2 threads 8-tap wavelet 32-tap filter 32-tap wavelet 8-tap filter 24 Generated library Intel IPP Generated library Generated library Generated library

Carnegie Mellon 2-D Transforms Performance 2- or 4-way vectorization, up to 2 threads 2-D DFT double 2-D DFT single 2-D DCT-2 double 2-D DCT-2 single 25 Generated library FFTW Intel IPP

Carnegie Mellon Customization: Code Size 1 KLOC 13 KLOC 2 KLOC 1.3 KLOC 3 KLOC FFTW: 150 KLOC 26

Carnegie Mellon Backend Customization: Java Complex DFT Real DFT DCT-2 FIR Filter 27 Generated library JTransforms Generated library Portable, but only 50% of scalar C performance

Carnegie Mellon Summary Full automation: Textbook to adaptive library Performance  SIMD  Multicore Customization Industry collaboration  Intel IPP 6.0 will include Spiral generated code 28 Spiral “FFTW” FFT Spiral “FIRW” FIR