Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.

Similar presentations


Presentation on theme: "Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science."— Presentation transcript:

1 Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science Drexel University www.spiral.net

2 LACSI 2006 – Automatic Tuning of Libraries and Applications Motivation On modern machines operation count is not always the most important performance metric. Effective utilization of the memory hierarchy, pipelining, and Instruction Level Parallelism is important, and it is not easy to determine such utilization from source code. Automatic Performance Tuning and Architecture Adaptation –Generate and Test –FFT, Matrix Multiplication, … Explain performance distribution

3 LACSI 2006 – Automatic Tuning of Libraries and Applications Outline Space of WHT Algorithms WHT Package and Performance Distribution Performance Model –Instruction Count –Cache

4 LACSI 2006 – Automatic Tuning of Libraries and Applications Walsh-Hadamard Transform y = WHT N x, N = 2 n

5 LACSI 2006 – Automatic Tuning of Libraries and Applications Factoring the WHT Matrix AC  D  C  D  A  A   C) = (A   C I m  n  mn WHT 2  WHT 2  WHT 2      WHT 2 

6 LACSI 2006 – Automatic Tuning of Libraries and Applications Recursive Algorithm (WHT 2  I 4 )(I 2  (WHT 2  I 2 ) (I 2  WHT 2 ))

7 LACSI 2006 – Automatic Tuning of Libraries and Applications Iterative Algorithm (WHT 2  I 4 )(I 2  WHT 2  I 2 ) (I 4  WHT 2 ))

8 LACSI 2006 – Automatic Tuning of Libraries and Applications WHT Algorithms Recursive Iterative General

9 LACSI 2006 – Automatic Tuning of Libraries and Applications WHT Implementation –N = N 1 * N 2 *  *N t N i =2 n i –x = WHT N *x x =(x(b),x(b+s), … x(b+(M-1)s)) Implementation(nested loop) R=N; S=1; for i=t, …,1 R=R/N i for j=0, …,R-1 for k=0, …,S-1 S=S* N i; M b,s t i1 )     nn WHT 2 2 2 II( n 1 + ··· +n i- 1 2 n i+1 + ··· +n t i

10 LACSI 2006 – Automatic Tuning of Libraries and Applications Partition Trees 4 13 12 11 4 1111 Right Recursive Iterative 9 342 121 11 4 13 12 11 Left Recursive 4 22 1111 Balanced

11 LACSI 2006 – Automatic Tuning of Libraries and Applications Number of Algorithms

12 LACSI 2006 – Automatic Tuning of Libraries and Applications Outline WHT Algorithms WHT Package and Performance Distribution Performance Model –Instruction Count –Cache

13 LACSI 2006 – Automatic Tuning of Libraries and Applications WHT Package Püschel & Johnson (ICASSP ’00) Allows easy implementation of any of the possible WHT algorithms Partition tree representation W(n)=small[n] | split[W(n 1 ), … W(n t )] Tools –Measure runtime of any algorithm –Measure hardware events (coupled with PCL/PAPI) –Search for good implementation Dynamic programming Evolutionary algorithm

14 LACSI 2006 – Automatic Tuning of Libraries and Applications Algorithm Comparison

15 LACSI 2006 – Automatic Tuning of Libraries and Applications Cache Miss Data

16 LACSI 2006 – Automatic Tuning of Libraries and Applications Histogram (n = 16, 10,000 samples) Wide range in performance despite equal number of arithmetic operations (n2 n flops) Pentium III vs. UltraSPARC II

17 LACSI 2006 – Automatic Tuning of Libraries and Applications Outline WHT Algorithms WHT Package and Performance Distribution Performance Model –Instruction Count –Cache

18 LACSI 2006 – Automatic Tuning of Libraries and Applications WHT Implementation –N = N 1 * N 2  N t N i =2 n i –x = WHT N *x x =(x(b),x(b+s), … x(b+(M-1)s)) Implementation(nested loop) R=N; S=1; for i=t, …,1 R=R/N i for j=0, …,R-1 for k=0, …,S-1 S=S* N i; M b,s t i1 )     nn WHT 2 2 2 II( n 1 + ··· +n i- 1 2 n i+1 + ··· +n t i

19 LACSI 2006 – Automatic Tuning of Libraries and Applications Instruction Count Model A(n) = number of calls to WHT procedure  = number of instructions outside loops A l (n) = Number of calls to base case of size l  l = number of instructions in base case of size l L i = number of iterations of outer (i=1), middle (i=2), and outer (i=3) loop  i = number of instructions in outer (i=1), middle (i=2), and outer (i=3) loop body

20 LACSI 2006 – Automatic Tuning of Libraries and ApplicationsSmall[1].file"s_1.c".version"01.01" gcc2_compiled.:.text.align 4.globl apply_small1.type apply_small1,@function apply_small1: movl 8(%esp),%edx //load stride S to EDX movl 12(%esp),%eax //load x array's base address to EAX fldl (%eax) // st(0)=R7=x[0] fldl (%eax,%edx,8) //st(0)=R6=x[S] fld %st(1) //st(0)=R5=x[0] fadd %st(1),%st // R5=x[0]+x[S] fxch %st(2) //st(0)=R5=x[0],s(2)=R7=x[0]+x[S] fsubp %st,%st(1) //st(0)=R6=x[S]-x[0] ????? fxch %st(1) //st(0)=R6=x[0]+x[S],st(1)=R7=x[S]-x[0] fstpl (%eax) //store x[0]=x[0]+x[S] fstpl (%eax,%edx,8) //store x[0]=x[0]-x[S] ret

21 LACSI 2006 – Automatic Tuning of Libraries and Applications Recurrences

22 Histogram using Instruction Model (P3)  l = 12,  l = 34, and  l = 106  = 27  1 = 18,  2 = 18, and  1 = 20

23 LACSI 2006 – Automatic Tuning of Libraries and Applications Cache Model Different WHT algorithms access data in different patterns All algorithms with the same set of leaf nodes have the same number of memory accesses Count misses for accesses to data array –Parameterized by cache size, associativity, and block size –simulate using program traces (restrict to data vector accesses) –Analytic formula?

24 LACSI 2006 – Automatic Tuning of Libraries and Applications Blocked Access 4 1 12 3 0123456789101112131415

25 LACSI 2006 – Automatic Tuning of Libraries and Applications Interleaved Access 4 3 21 1 0123456789101112131415

26 LACSI 2006 – Automatic Tuning of Libraries and Applications Cache Simulator 4 1 12 3 4 3 21 1 144 memory accesses C = 4, A = 1, B = 1 (80, 112) C = 4, A = 4, B = 1 (48, 48) C = 4, A = 1, B = 2 (72, 88) Iterative vs. Recursive (192 memory accesses) C = 4, A = 1, B = 1 (128, 112)

27 LACSI 2006 – Automatic Tuning of Libraries and Applications Cache Misses as a Function of Cache Size C=2 2 C=2 3 C=2 4 C=2 5

28 LACSI 2006 – Automatic Tuning of Libraries and Applications Formula for Cache Misses M(L,W N,R) = Number of misses for  L  WHT N   R 

29 LACSI 2006 – Automatic Tuning of Libraries and Applications Closed Form M(L,W N,R) = Number of misses for  L  WHT N   R  M(0,W_n,0) = 3(n-c)*2 n + k*2 n C = 2 c, k = number of parts in the rightmost c positions c = 3, n = 4 4 1111 Iterative 4 13 12 11 Right Recursive 4 22 1111 Balanced k = 3 k = 2 k = 1

30 LACSI 2006 – Automatic Tuning of Libraries and Applications Summary of Results and Future Work Instruction Count Model –min, max, expected value, variance, limiting distribution Cache Model –Direct mapped (closed form solution, distribution, expected value, and variance) Combine models Extend cache formula to include A and B Use as heuristic to limit search and predict performance

31 LACSI 2006 – Automatic Tuning of Libraries and Applications Sponsors Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT63-98-1-0004 administered by the Army Directorate of Contracting, DESA: Intelligent HW-SW Compilers for Signal Processing Applications, and NSF ITR/NGS #0325687: Intelligent HW/SW Compilers for DSP.


Download ppt "Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science."

Similar presentations


Ads by Google