Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Performance Tuning

Similar presentations


Presentation on theme: "Automatic Performance Tuning"— Presentation transcript:

1 Automatic Performance Tuning
Jeremy Johnson Dept. of Computer Science Drexel University

2 Outline Scientific Computation Kernels
Matrix Multiplication Fast Fourier Transform (FFT) Integer Multiplication Automated Performance Tuning (IEEE Proc. Vol. 93, No. 2, Feb. 2005) ATLAS FFTW SPIRAL GMP

3 Matrix Multiplication and the FFT

4 Basic Linear Algebra Subprograms (BLAS)
Level 1 – vector-vector, O(n) data, O(n) operations Level 2 – matrix-vector, O(n2) data, O(n2) operations Level 3 – matrix-matrix, O(n2) data, O(n3) operations = data reuse = locality! LAPACK built on top of BLAS (level 3) Blocking (for the memory hierarchy) is the single most important optimization for linear algebra algorithms GEMM – General Matrix Multiplication SUBROUTINE DGEMM (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC ) C := alpha*op( A )*op( B ) + beta*C, where op(X) = X or X’

5 DGEMM … * Form C := alpha*A*B + beta*C. * DO 90, J = 1, N
IF( BETA.EQ.ZERO )THEN DO 50, I = 1, M C( I, J ) = ZERO CONTINUE ELSE IF( BETA.NE.ONE )THEN DO 60, I = 1, M C( I, J ) = BETA*C( I, J ) CONTINUE END IF DO 80, L = 1, K IF( B( L, J ).NE.ZERO )THEN TEMP = ALPHA*B( L, J ) DO 70, I = 1, M C( I, J ) = C( I, J ) + TEMP*A( I, L ) CONTINUE CONTINUE CONTINUE

6 Matrix Multiplication Performance

7 Matrix Multiplication Performance

8 Numeric Recipes Numeric Recipes in C – The Art of Scientific Computing, 2nd Ed. William H. Press, Saul A. Teukolsky, William T. Vetterling, Brian P. Flannery, Cambridge University Press, 1992. “This book is unique, we think, in offering, for each topic considered, a certain amount of general discussion, a certain amount of analytical mathematics, a certain amount of discussion of algorithmics, and (most important) actual implementations of these ideas in the form of working computer routines. 1. Preliminarys 2. Solutions of Linear Algebraic Equations 12. Fast Fourier Transform 19. Partial Differential Equations 20. Less Numerical Algorithms

9 four1

10 four1 (cont)

11 FFT Performance

12 Atlas Architecture and Search Parameters
NB – L1 data cache tile size NCNB – L1 data cache tile size for non-copying version MU, NU – Register tile size KU – Unroll factor for k’ loop LS – Latency for computation scheduling FMA – 1 if fused multiply-add available, 0 otherwise FF, IF, NF – Scheduling of loads Yotov et al., Is Search Really Necessary to Generate High-Performance BLAS?, Proc. IEEE, Vol. 93, No. 2, Feb. 2005

13 ATLAS Code Generation Optimization for locality
Cache tiling, Register tiling

14 ATLAS Code Generation Register Tiling Loop unrolling
MU + NU + MU×NU ≤ NR Loop unrolling Scalar replacement Add/mul interleaving Loop skewing Ci’’j’’ = Ci’’j’’ + Ai’’k’’*Bk’’j’’ A C B NU MU K mul1 mul2 mulLs add1 mulLs+1 add2 mulMu×Nu addMu×Nu-Ls+2 addMu×Nu NB NB

15 ATLAS Search Estimate Machine Parameters (C1, NR, FMA, LS)
Used to bound search Orthogonal Line Search (fix all parameters except one and search for the optimal value of this parameter) Search order NB MU, NU KU LS FF, IF, NF NCNB Cleanup codes

16 Using FFTW

17 FFTW Infrastructure Right Recursive 15 3 12 4 8 3 5
Use dynamic programming to find an efficient way to combine code sequences. Combine code sequences using divide and conquer structure in FFT Codelets (optimized code sequences for small FFTs) Plan encodes divide and conquer strategy and stores “twiddle factors” Executor computes FFT of given data using algorithm described by plan. Right Recursive 15 3 12 4 8 3 5

18 implementation options
SPIRAL system user goes for a coffee Formula Generator SPL Compiler Search Engine runtime on given platform controls implementation options algorithm generation fast algorithm as SPL formula C/Fortran/SIMD code S P I R A L (or an espresso for small transforms) DSP transform specifies Mathematician Expert Programmer platform-adapted implementation comes back

19 DSP Algorithms: Example 4-point DFT
Cooley/Tukey FFT (size 4): Fourier transform Diagonal matrix (twiddles) Kronecker product Identity Permutation

20 Cooley-Tukey Factorization
Fourier transform Identity Permutation Diagonal matrix (twiddles) Kronecker product algorithms reduce arithmetic cost O(N^2)O(Nlog(N)) product of structured sparse matrices mathematical notation exhibits structure introduces degrees of freedom (different breakdown strategies) which can be optimized

21 Algorithms = Ruletrees = Formulas

22 Generated DFT Vector Code: Pentium 4, SSE
hand-tuned vendor assembly code (Pseudo) gflop/s n DFT 2n single precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0 speedups (to C code) up to factor of 3.1

23 Best DFT Trees, size 210 = 1024 trees platform/datatype dependent
Pentium 4 float Pentium 4 double Pentium III float AthlonXP float 10 10 10 10 2 8 4 6 scalar 2 8 4 6 2 6 2 2 2 4 2 5 2 2 3 3 2 4 2 2 2 3 2 2 10 10 10 10 C vect 4 6 2 8 6 4 4 6 2 2 4 2 2 5 2 4 2 2 2 4 2 2 2 2 2 3 2 2 2 2 10 10 10 10 8 2 9 1 SIMD 5 5 5 5 1 7 2 7 2 5 2 3 2 3 2 3 2 3 2 5 2 3 2 3 trees platform/datatype dependent

24 Crosstiming of best trees on Pentium 4
Slowdown factor w.r.t. best n DFT 2n single precision, runtime of best found of other platforms software adaptation is necessary

25 GMP Integer Multiplication
Polyalgorithm Classical O(n2) algorithm Karatsuba Toom-Cook Schönhage-Strassen (FFT) Algorithm Thresholds Thresholds determine when one algorithm switches to the next Tune-up program empirically sets threshold parameters for given platform FFT 5888 Toom3 117 Karatsuba 34 ... *Default AMD Athlon X2 2.8GHz / 32-bit Linux


Download ppt "Automatic Performance Tuning"

Similar presentations


Ads by Google