Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University www.ece.cmu.edu/~franzf.

Slides:

Advertisements

Similar presentations

Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.

Advertisements

Carnegie Mellon Performance Portable Tracking of Evolving Surfaces Wei Yu Citadel Investment Group Franz Franchetti, James C. Hoe, José M. F. Moura Carnegie.

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Carnegie Mellon Automatic Generation of Vectorized Fast Fourier Transform Libraries for the Larrabee and AVX Instruction Set Extension Automatic Generation.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Offline Adaptation Using Automatically Generated Heuristics Frédéric de Mesmay, Yevgen Voronenko, and Markus Püschel Department of Electrical and Computer.

ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Carnegie Mellon Optimized Parallel Distribution Load Flow Solver on Commodity Multi-core CPU Tao Cui (Presenter) Franz Franchetti Dept. of ECE. Carnegie.

Carnegie Mellon Lessons From Building Spiral The C Of My Dreams Franz Franchetti Carnegie Mellon University Lessons From Building Spiral The C Of My Dreams.

OpenFOAM on a GPU-based Heterogeneous Cluster

A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie.

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

Introduction CS 524 – High-Performance Computing.

Some Thoughts on Technology and Strategies for Petaflops.

 Data copy forms part of an auto-tuning compiler framework.  Auto-tuning compiler, while using the library, can empirically evaluate the different implementations.

Reconfigurable Application Specific Computers RASCs Advanced Architectures with Multiple Processors and Field Programmable Gate Arrays FPGAs Computational.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Carnegie Mellon μ-op Fission: Hyper-threading without the Hyper-headache Anthony Cartolano Robert Koutsoyannis Daniel S. McFarlin Carnegie Mellon University.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Carnegie Mellon Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic Lawrence J. Chang Inpyo Hong Yevgen Voronenko Markus Püschel Department.

Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University.

Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus.

SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.

Short Vector SIMD Code Generation for DSP Algorithms

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

SPIRAL: Current Status José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU)

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

High Performance Linear Transform Program Generation for the Cell BE

Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.

Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-

VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.

Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura.

Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.

Carnegie Mellon Program Generation with Spiral: Beyond Transforms This work was supported by DARPA DESA program, NSF-NGS/ITR, NSF-ACR, Mercury Inc., and.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Design of A Custom Vector Operation API Exploiting SIMD Intrinsics within Java Presented by John-Marc Desmarais Authors: Jonathan Parri, John-Marc Desmarais,

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.

In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Enabling machine learning in embedded systems

Building a Program Generator With Scala

Automatic Performance Tuning

High Performance Computing (CS 540)

Compiler Back End Panel

Compiler Back End Panel

Kenneth Moreland Edward Angel Sandia National Labs U. of New Mexico

The Challenge of Teaching Program Performance Tuning

VSIPL++: Parallel Performance HPEC 2004

Presentation transcript:

Carnegie Mellon Spiral: Automatic Generation of Industry Strength Performance Libraries Franz Franchetti Carnegie Mellon University CTO and Co-Founder, SpiralGen This work was supported by DARPA DESA program, NSF, ONR, Mercury Inc., Intel, and Nvidia

Carnegie Mellon The Future is Parallel and Heterogeneous multicore ? and later Cell BE 8+1 cores before 2000 Core2 Duo Core2 Extreme Virtex 5 FPGA+ 4 CPUs SGI RASC Itanium + FPGA Nvidia GPUs 240 streaming cores Sun Niagara 32 threads IBM Cyclops64 80 cores Intel Larrabee Xtreme DATA Opteron + FPGA ClearSpeed 192 cores Programmability? Performance portability? Rapid prototyping? CPU platforms AMD Fusion BlueGene/Q Intel Haswell vector coprocessors Tilera TILEPro 64 cores IBM POWER7 2x8 cores Intel Sandy Bridge 8-way float vectors Nvidia Fermi

Carnegie Mellon Spiral: Computer Writes Best SAR Code Special hardware Cell blade COTS Intel Synthesis [2] Programming [1] [1]Rudin, J., Implementation of Polar Format SAR Image Formation on the IBM Cell Broadband Engine, in Proceedings High Performance Embedded Computing (HPEC), Best Paper Award. [2]D. McFarlin, F. Franchetti, M. Püschel, and J. M. F. Moura: High Performance Synthetic Aperture Radar Image Formation On Commodity Multicore Architectures. in Proceedings SPIE, Result Same performance, 1/10 th human effort, non-expert user Key ideas restrict domain, use mathematics, program synthesis

Carnegie Mellon What is Spiral? Traditionally Spiral Approach High performance library optimized for given platform Spiral High performance library optimized for given platform Comparable performance

Carnegie Mellon Spiral’s Domain-Specific Program Synthesis νpμνpμ Architectural parameter: Vector length, #processors, … rewriting defines Kernel: problem size, algorithm choice pick search abstraction Model: common abstraction = spaces of matching formulas architecture space algorithm space optimization

Carnegie Mellon Related Work Compiler-Based AutotuningSynthesis from Domain Math Autotuning Numerical LibrariesAutotuning Primer  Polyhedral framework IBM XL, Pluto, CHiLL  Transformation prescription CHiLL, POET  Profile guided optimization Intel C, IBM XL  Spiral Signal and image processing, SDR  Tensor Contraction Engine Quantum Chemistry Code Synthesizer  FLAME Numerical linear algebra (LAPACK)  ATLAS BLAS generator  FFTW kernel generator  Vendor math libraries Code generation scripts

Carnegie Mellon Organization M. Püschel, F. Franchetti, Y. Voronenko: Spiral. Encyclopedia of Parallel Computing, D. A. Padua (Editor), Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo: SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), Spiral overview Validation and Verification Results Concluding remarks

Carnegie Mellon Transform = Matrix-vector multiplication Example: Discrete Fourier transform (DFT) Fast algorithm = sparse matrix factorization = SPL formula Example: Cooley-Tukey FFT algorithm Spiral’s Origin: Linear Transforms input vector (signal) output vector (signal) transform = matrix

Carnegie Mellon Beyond Transforms: General Operators Transform = linear operator with one vector input and one vector output Key ideas:  Generalize to (possibly nonlinear) operators with several inputs and several outputs  Generalize SPL (including tensor product) to OL (operator language)  Generalize rewriting systems for parallelizations linear

Carnegie Mellon Software Defined RadioLinear Transforms Matrix-Matrix MultiplicationSynthetic Aperture Radar (SAR) convolutional encoder Viterbi decoder Viterbi decoder = £ Expressing Kernels as Operator Formulas Interpolation 2D FFT

Carnegie Mellon One Approach for all Types of Parallelism  Multithreading (Multicore)  Vector SIMD (SSE, VMX/Altivec,…)  Message Passing (Clusters, MPP)  Streaming/multibuffering (Cell)  Graphics Processors (GPUs)  Gate-level parallelism (FPGA)  HW/SW partitioning (CPU + FPGA)

Carnegie Mellon Autotuning in Constraint Solution Space Intel MIC Base cases DFT 256 Breakdown rules Transformation rules

Carnegie Mellon Translating a Formula into Code C Code: Output = OL Formula: ∑-OL: Constraint Solver Input:

Carnegie Mellon Auto-Generation of Performance Library High-Performance Library (FFTW-like, MKL-like, IPP-like) Spiral Input: Transform: Algorithms: Vectorization: 2-way SSE Threading: Yes Output: Optimized library (10,000 lines of C++) For general input size (not collection of fixed sizes) Vectorized Multithreaded With runtime adaptation mechanism Performance competitive with hand-written code

Carnegie Mellon Core Idea: Recursion Step Closure Input: transform T and a breakdown rules Output: problem specifications for recursive function and codelets Algorithm: 1. Apply the breakdown rule 2. Convert to  -SPL 3. Apply loop merging + index simplification rules. 4. Extract recursion steps 5. Repeat until closure is reached

Carnegie Mellon void dft64(float *Y, float *X) { __m512 U912, U913, U914, U915, U916, U917, U918, U919, U920, U921, U922, U923, U924, U925,...; a2153 = ((__m512 *) X); s1107 = *(a2153); s1108 = *((a )); t1323 = _mm512_add_ps(s1107,s1108);... U926 = _mm512_swizupconv_r32(_mm512_set_1to16_ps( ),_MM_SWIZ_REG_CDAB); s1121 = _mm512_madd231_ps(_mm512_mul_ps(_mm512_mask_or_pi( _mm512_set_1to16_ps( ),0xAAAA,a2154,U926),t1341), _mm512_mask_sub_ps(_mm512_set_1to16_ps( ),0x5555,a2154,U926), _mm512_swizupconv_r32(t1341,_MM_SWIZ_REG_CDAB)); U927 = _mm512_swizupconv_r32(_mm512_set_16to16_ps( , ( ), , ( ), , ( ), , ( )),_MM_SWIZ_REG_CDAB);... s1166 = _mm512_madd231_ps(_mm512_mul_ps(_mm512_mask_or_pi(_mm512_set_16to16_ps( , ( ), , ( ), , ( ), , ( )), 0xAAAA,a2154,U951),t1362), _mm512_mask_sub_ps(_mm512_set_16to16_ps( , ( ), , ( ), , ( ), , ( )),0x5555,a2154,U951), _mm512_swizupconv_r32(t1362,_MM_SWIZ_REG_CDAB));... } Spiral-Generated Code (Intel MIC/LRBni)

Carnegie Mellon Support For Library-Specific Interfaces IPPAPI(IppStatus, ippgDFTFwd_CToC_32fc, (const Ipp32fc *pSrc, Ipp32fc *pDst, int length, int flag) ) IPPAPI(IppStatus, ippgWHT_32f, (const Ipp32f *pSrc, Ipp32f *pDst, int order, int flag, Ipp8u *pBuf) IPPAPI(IppStatus, ippgWHTGetBufferSize_32f, (int order, Ipp32u *pBufferSize) ) Complex FFT name data type Walsh-Hadamard Transform log(size) scalingmemory size scaling memory

Carnegie Mellon Industry-Strength Code: Spiral and Intel IPP 6.0 Spiral-generated code in Intel’s Library IPP IPP = Intel’s performance primitives, used by 1000s of companies Generated: 3984 C functions (signal processing) = 1M lines of code Full parallelism support Computer-generated code: Faster than what was achievable by hand Spiral-generated code in Intel’s Library IPP IPP = Intel’s performance primitives, used by 1000s of companies Generated: 3984 C functions (signal processing) = 1M lines of code Full parallelism support Computer-generated code: Faster than what was achievable by hand

Carnegie Mellon Organization Spiral overview Validation and Verification Results Concluding remarks

Carnegie Mellon Transform = Matrix-vector multiplication matrix fully defines the operation Algorithm = Formula represents a matrix expression, can be evaluated to a matrix Symbolic Verification = ?

Carnegie Mellon Run program on all basis vectors, compare to columns of transform matrix Compare program output on random vectors to output of a random implementation of same kernel Empirical Verification = ? DFT4([0,1,0,0]) DFT4_rnd([0.1,1.77,2.28,-55.3])) DFT4([0.1,1.77,2.28,-55.3]) = ?

Carnegie Mellon Rule replaces left-hand side by right-hand side when preconditions match Test rule by evaluating expressions before and after rule application and compare result Verification of the Generator = ?

Carnegie Mellon Verification of Autotuning Libraries Auto-generated FFTW-like library Need verifier for each function Auto-generated from specification Auto-generate test harness Drop-in replacement into existing infrastructure

Carnegie Mellon Organization Spiral overview Validation and Verification Results Concluding remarks

Carnegie Mellon Results: Spiral Outperforms Humans FFT on Multicore FFT on FPGA SAR SDR

Carnegie Mellon Samsung i9100 Galaxy S II Dual-core ARM at 1.2GHz with NEON ISA SIMD vectorization + multi-threading From Cell Phone To Supercomputer G. Almási, B. Dalton, L. L. Hu, F. Franchetti, Y. Liu, A. Sidelnik, T. Spelce, I. G. Tānase, E. Tiotto, Y. Voronenko, X. Xue: 2010 IBM HPC Challenge Class II Submission. Winner of the 2010 HPC Challenge Class II Award (Most Productive System). Global FFT (1D FFT, HPC Challenge) performance [Gflop/s] BlueGene/P at Argonne National Laboratory 128k cores (quad-core CPUs) at 850 MHz SIMD vectorization + multi-threading + MPI 6.4 Tflop/s BlueGene/P

Carnegie Mellon Organization Spiral overview Validation and Verification Results Concluding remarks

Carnegie Mellon Summary: Spiral in a Nutshell VerificationJoint Abstraction Target MachinesApplication Domains Signal Processing Matrix Algorithms Software Defined Radio Image Formation (SAR) CMU: Commercial: = ? DFT4([0,1,0,0])

Carnegie Mellon Acknowledgement James C. Hoe Jeremy Johnson José M. F. Moura David Padua Markus Püschel Volodymyr Arbatov Paolo D’Alberto Peter A. Milder Yevgen Voronenko Qian Yu Berkin Akin Christos Angelopoulos Srinivas Chellappa Frédéric de Mesmay Daniel S. McFarlin Marek R. Telgarsky Special thanks to: Randi Rost, Scott Buck (Intel), Jon Greene (Mercury Inc.), Yuanwei Jin (UMES) Gheorghe Almasi, Jose E. Moreira, Jim Sexton (IBM), Saeed Maleki (UIUC) Francois Gygi (LLNL, UC Davis), Kim Yates (LLNL), Kalyan Kumaran (ANL)

Carnegie Mellon More Information: