Short Vector SIMD Code Generation for DSP Algorithms

Slides:

Advertisements

Similar presentations

Course Outline Presentation Term: F09 Faculty Name : Asma Sanam Larik Course Name :INTRO TO COMPUTING Course Code : CSE145 Section :1 Semester : 1.

Advertisements

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

Carnegie Mellon Automatic Generation of Vectorized Fast Fourier Transform Libraries for the Larrabee and AVX Instruction Set Extension Automatic Generation.

MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Offline Adaptation Using Automatically Generated Heuristics Frédéric de Mesmay, Yevgen Voronenko, and Markus Püschel Department of Electrical and Computer.

The Assembly Language Level

Carnegie Mellon Lessons From Building Spiral The C Of My Dreams Franz Franchetti Carnegie Mellon University Lessons From Building Spiral The C Of My Dreams.

CPSC Compiler Tutorial 9 Review of Compiler.

Compilation Techniques for Multimedia Processors Andreas Krall and Sylvain Lelait Technische Universitat Wien.

COMP3221: Microprocessors and Embedded Systems Lecture 2: Instruction Set Architecture (ISA) Lecturer: Hui Wu Session.

Carnegie Mellon μ-op Fission: Hyper-threading without the Hyper-headache Anthony Cartolano Robert Koutsoyannis Daniel S. McFarlin Carnegie Mellon University.

Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.

Carnegie Mellon Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic Lawrence J. Chang Inpyo Hong Yevgen Voronenko Markus Püschel Department.

Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College

Source Code Basics. Code For a computer to execute instructions, it needs to be in binary Each instruction is given a number Known as “operation code”

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.

ITEC 352 Lecture 11 ISA - CPU. ISA (2) Review Questions? HW 2 due on Friday ISA –Machine language –Buses –Memory.

Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus.

SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.

General Computer Science for Engineers CISC 106 Lecture 02 Dr. John Cavazos Computer and Information Sciences 09/03/2010.

SPIRAL: Current Status José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU)

High Performance Linear Transform Program Generation for the Cell BE

CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.

Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.

Multimedia Macros for Portable Optimized Programs Juan Carlos Rojas Miriam Leeser Northeastern University Boston, MA.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.

Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.

5-1 Chapter 5 - Languages and the Machine Department of Information Technology, Radford University ITEC 352 Computer Organization Principles of Computer.

Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-

Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western.

Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

5-1 Chapter 5 - Languages and the Machine Principles of Computer Architecture by M. Murdocca and V. Heuring © 1999 M. Murdocca and V. Heuring Principles.

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

Introduction to MMX, XMM, SSE and SSE2 Technology

MICROOCESSORS AND MICROCONTROLLER:

Introdution to SSE or How to put your algorithms on steroids! Christian Kerl

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign.

Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.

Carnegie Mellon Program Generation with Spiral: Beyond Transforms This work was supported by DARPA DESA program, NSF-NGS/ITR, NSF-ACR, Mercury Inc., and.

The Instruction Set Architecture. Hardware – Software boundary Java Program C Program Ada Program Compiler Instruction Set Architecture Microcode Hardware.

1 MOTIVATION AND OBJECTIVE  Discrete Signal Transforms (DSTs) –DFT, DCT: major performance component in many applications –Hardware accelerated but at.

A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.

Design of A Custom Vector Operation API Exploiting SIMD Intrinsics within Java Presented by John-Marc Desmarais Authors: Jonathan Parri, John-Marc Desmarais,

Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.

SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.

In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU

Addressing modes, memory architecture, interrupt and exception handling, and external I/O. An ISA includes a specification of the set of opcodes (machine.

A LECTURE NOTE. Introduction to Programming languages.

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE

Getting Started with Automatic Compiler Vectorization

Vector Processing => Multimedia

Automatic Performance Tuning

Centar ( Global Signal Processing Expo

EE 445S Real-Time Digital Signal Processing Lab Spring 2014

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Presentation transcript:

Short Vector SIMD Code Generation for DSP Algorithms Franz Franchetti Christoph Ueberhuber Applied and Numerical Mathematics Technical University of Vienna Austria Markus Püschel José Moura Electrical and Computer Engineering Carnegie Mellon University http://www.ece.cmu.edu/~spiral http://www.math.tuwien.ac.at/~aurora

Sponsors Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT63-98-1-0004 administered by the Army Directorate of Contracting.

Outline Short vector extensions Digital signal processing (DSP) transforms SPIRAL Vectorization of SPL formulas Experimental results

SIMD Short Vector Extensions vector length = 4 (4-way) + x Extension to instruction set architecture Available on most current architectures Originally for multimedia (like MMX for integers) Requires fine grain parallelism Large potential speed-up Name n-way Precision Processors SSE 4-way float Intel Pentium III and 4, AMD AthlonXP SSE2 2-way double Intel Pentium 4 3DNow! 2-way float AMD K6, K7, AthlonXP AltiVec 4-way float Motorola G4 IPF 2-way Float Intel Itanium, Itanium 2

Problems Requires expert programmers Our Goal: SIMD instructions are architecture specific No common API (usually assembly hand coding) Performance very sensitive to memory access Automatic vectorization (by compilers) very limited Requires expert programmers Our Goal: Automation for digital signal processing (DSP) transforms

DSP (digital signal processing) transforms sampled signal (a vector) transform (a matrix) Example: Discrete Fourier Transform (DFT) size 4 Fast algorithm = product of structured sparse matrices Represented as formula using few constructs (e.g., ) and primitives (diagonal, permutation) Captures a large class of transforms (DFT, DCT, wavelets, …)

Tensor (Kronecker) Product of Matrices for coarse structure fine structure Examples: identity matrix key construct in many DSP transform algorithms (DFT, WHT, all multidimensional)

SPIRAL: A Library Generator for Platform-Adapted DSP Transform José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Markus Püschel (CMU) Viktor Prasanna (USC) Manuela Veloso (CMU) SPIRAL: A Library Generator for Platform-Adapted DSP Transform www.ece.cmu.edu/~spiral Observation: For a given transform there are maaaany different algorithms (equal in arithmetic cost, differ in data flow) The best algorithm and its implementation is platform-dependent It is not clear what the best algorithm/implementation is SPIRAL: Automatic algorithm generation + Automatic translation into code + Intelligent search for “best” = generated platform-adapted implementation

SPIRAL’s Mathematical Framework Transform parameterized matrix Rule a breakdown strategy product of sparse matrices Formula by recursive application of rules few constructs and primitives can be translated into code Used as mathematical high-level representation of algorithms (SPL = signal processing language)

implementation options SPIRAL system DSP transform (user specified) controls algorithm generation Formula Generator fast algorithm as SPL formula S P I R A L Search Engine Our Goal: extend SPL compiler to generate vector code controls implementation options SPL Compiler C/Fortran/SIMD code runtime on given platform platform-adapted implementation

Generating SIMD Code from SPL Formulas Example: A x y naturally represents vector operation vector length Formulas contain all structural information for vectorization Construct above captures DFT, WHT, all multi-dimensional Pi, Qi permutations Di, Ei diagonals Ai arbitrary formulas ν SIMD vector length (Current) generic construct completely vectorizable:

The Approach Use macro layer as API to hide machine specifics Vector code generation in two steps 1. Symbolic vectorization (formula manipulation) 2. Code generation

Symbolic Vectorization Formula manipulation (automatic using manipulation rules) Pattern matching Manipulate to match vectorizable construct Separate vectorizable parts and scalar parts

Formula Manipulation Normalizing formulas Converting complex to real arithmetic

Vector Code Generation fuse with load/store operations difficult part (easy to loose performance) { Pi, Qi permutations Di, Ei diagonals Ai arbitrary formulas ν SIMD vector length arithmetic vector instructions use standard SPL compiler on Ai replace scalar with vector instructions easy part (due to existing SPL compiler)

Challenge: Data Access Example: Required: Available: Memory Memory Registers Registers Strided load of complex numbers Vector load plus in-register permutations highest performance code requires properly aligned data access permutation support differs between architectures performance differs between permutations (some are good, most very bad) Solution: use formula manipulation to get “good” permutations macro layer API for efficient and machine transparent implementation

Portable High-level API restricted set of short vector operations requires C compiler with „intrinsics“-interface high-level operations Vector arithmetic operations Vector load/store operations Special and arbitrary multi-vector permutations Vector constant handling (declaration, usage) Implemented by C macros Example: Memory Unit-stride load of 4 complex numbers: LOAD_L_8_2(reg1, reg2, *mem) Registers

Portable SIMD API: Details All SIMD extensions supported: gcc 3.0, gcc-vec Intel C++ Compiler, MS VisualC++ with ProcessorPack Various PowerPC compilers (Motorola standard) Examples: Memory Reverse load of 4 real numbers: LOAD_J_4(reg, *mem) Register Memory Reverse load of 4 complex numbers: LOAD_J_4_x_I_2(r1, r2, *mem) Registers

Generated Code Vector parts: portable SIMD API Scalar parts: standard C Pi, Qi, Di, Ei handled by load/store operations Ai handled by vector arithmetics /* Example vector code: DFT_16 */ void DFT_16(vector_float *y, vector_float *x) { vector_float xl0, xl1, xl2; ... LOAD_VECT(xl0, x + 0); LOAD_VECT(xl4, x + 16); f0 = SIMD_SUB(xl0, xl4); LOAD_VECT(xl1, x + 4); LOAD_VECT(xl5, x + 20); f1 = SIMD_SUB(xl1, xl5); yl7 = SIMD_SUB(f1, f4); STORE_L_8_4(yl6, yl7, y + 24); yl2 = SIMD_SUB(f0, f5); yl3 = SIMD_ADD(f1, f4); STORE_L_8_4(yl2, yl3, y + 8); } /* Intel SSE: portable SIMD API Intel C++ Compiler 5.0 */ typedef __m128 vector_float; #define LOAD_VECT(a, b) \ (a) = *(b) #define SIMD_ADD(a, b) \ _mm_add_ps((a), (b)) #define SIMD_SUB(a, b) \ _mm_sub_ps((a), (b)) #define STORE_L_8_4(re, im, out) \ { \ vector_float _sttmp1,_sttmp2; \ _sttmp1 = _mm_unpacklo_ps(re, im); \ _sttmp2 = _mm_unpackhi_ps(re, im); \ _mm_store_ps(out, _sttmp1); \ _mm_store_ps((out) + VLEN, _sttmp2); \ }

Experimental Results our code is generated, found by dynamic programming search different searches for different types of code (scalar, vector) results in (Pseudo) gigaflops (higher = better)

Generated DFT Code: Pentium 4, SSE hand-tuned vendor assembly code (Pseudo) gflops n DFT 2n single precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0 speedups (to C code) up to factor of 3.1

Generated DFT Code: Pentium 4, SSE2 gflops n DFT 2n double precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0 speedups (to C code) up to factor of 1.8

Generated DFT Code: Pentium III, SSE gflops n DFT 2n single precision, Pentium III, 1 GHz, using Intel C compiler 6.0 speedups (to C code) up to factor of 2.1

Generated DFT Code: Athlon XP, SSE gflops n DFT 2n single precision, Pentium III, 1 GHz, using Intel C compiler 6.0 speedups (to C code) up to factor of 1.6

Other transforms speedups (to C code) up to factor of 3 WHT 2n Pentium 4, 2.53 GHz, SSE 2-dim DCT 2n x 2n Pentium 4, 2.53 GHz, SSE gflops transform size transform size WHT has only additions very simple transform speedups (to C code) up to factor of 3

Different search strategies Slowdown factor w.r.t. best n DFT 2n single precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0 standard DP looses up to 25 % performance

Best DFT Trees, size 210 = 1024 trees platform/datatype dependent Pentium 4 float Pentium 4 double Pentium III float AthlonXP float 10 10 10 10 2 8 4 6 scalar 2 8 4 6 2 6 2 2 2 4 2 5 2 2 3 3 2 4 2 2 2 3 2 2 10 10 10 10 C vect 4 6 2 8 6 4 4 6 2 2 4 2 2 5 2 4 2 2 2 4 2 2 2 2 2 3 2 2 2 2 10 10 10 10 8 2 9 1 SIMD 5 5 5 5 1 7 2 7 2 5 2 3 2 3 2 3 2 3 2 5 2 3 2 3 trees platform/datatype dependent

Crosstiming of best trees on Pentium 4 Slowdown factor w.r.t. best n DFT 2n single precision, runtime of best found of other platforms software adaptation is necessary

Summary Automatically generated vectorized DSP code Code is platform-adapted (SPIRAL) We implement “constructs”, not transforms tensor product, permutations, … DFT, WHT, arbitrary multi-dim supported Very competitive performance Ongoing work: port to other SIMD architectures include filters and wavelets http://www.ece.cmu.edu/~spiral http://www.math.tuwien.ac.at/~aurora