Presentation is loading. Please wait.

Presentation is loading. Please wait.

. G.Bilardi. –University of Padua, Dipartimento di Elettronica e informatica. P.D’Alberto and A.Nicolau. –University of California at Irvine, Information.

Similar presentations


Presentation on theme: ". G.Bilardi. –University of Padua, Dipartimento di Elettronica e informatica. P.D’Alberto and A.Nicolau. –University of California at Irvine, Information."— Presentation transcript:

1 . G.Bilardi. –University of Padua, Dipartimento di Elettronica e informatica. P.D’Alberto and A.Nicolau. –University of California at Irvine, Information and Computer Science += * C C0 C1 C2C3 A A0A1 A2A3 B B0B1 B2B3 Fractal Matrix Multiply

2 Talk Organization Motivations –Alias “why matrix multiply is so popular ?’’ Why did we jump into the Project ? Matrix multiply as it is done –How we differ Our Approach (performance related stuff) –How we did it –Experimental results Conclusions

3 Motivations Matrix Multiply as Example –For data reuse every element is used for n multiplications –Space requirements Sizes, layouts Matrix Multiply as Kernel –3-BLAS applications E.g. LU-decomposition

4 Why did we jump into the Project? Matrix Multiply is asymptotically optimal Cache hierarchy oblivious –Alias Cache hierarchy oblivious –n 3 -multiplications and kn 3 -misses (k <=3) –By Hung-Kung We can study safely different algorithms: –Safely: we do not loose optimality –Different algorithms: computation orders

5 Why we jumped in the project ? Cont. Optimal use of caches = optimal performance ? –Not really Performance: –Register allocation, scheduling, layouts, recursion/no recursion, RISC/no RISC architecture, compiler optimizations, ….. etcetera We want performance –MFLOPS

6 Multiplication as it is done 1.Tiling for L1 –Reduction to a single simple common problem –Then L2, L3 …. 2.Register allocation on the simple problem : –Number of registers –No RISC/RISC (Pentium/no Pentium) 3.Scheduling by compiler 4.Feedback and start over again if necessary

7 ATLAS for example: CA B Tiles fixed in size Registers = Tiles Copied in a Contiguous Workspace

8 How we differ from the others? We present –A unique Recursive Algorithm The Decomposition function of the problem size –Recursive Layout (Fractal layout alias Z-Morton) –Register allocation tuned on the number of registers for Register-file-based architecture Automatic generated –Optimization of the index computation and recursion –Scheduling by compiler

9 Our Approach: Fractal Layout (alias Z-Morton) A is near square matrix then A0, A1, A2, A3 are near square matrixes about ¼ the size of A and A0 is the largest. Near square Near square: |row-columns| <= 1 A0 A2A3 A1 A Layout in memory Sequential

10 Our Approach 1.A square problem is decomposed into 8 near square problems of size between and 2.Each sub-problem has the operands stored contiguously –TNX: the recursive layout 3.A sub-problem is decomposed if min(k,j,l) >32 4.Otherwise is solved directly –The operands are in row major format –Optimized at register-file level Reuse of common optimizations

11 Our Approach, cont. The Type DAG A recursion tree for problem has O(8 log n) different types The type determines the index computation for the sub-problems The types and the matrix offsets are determined and stored in a tree-like structure “type DAG’’ Reduction of index computations by 30% –With moderate extra space.

12 Recursive Tree and Type DAG C0+=A0B0 C0+=A1B2 C1+=A1B1 C1+=A0B3 C3+=A3B3 C3+=A2B1 C2+=A2B2 C2+=A3B0

13 Our approach, cont. Register Allocation When the recursion stops: 1.Sub-Problems smaller than are computed directly 2.Sub-Matrix smaller than 32 by 32 are stored in row major 3.Register Allocation 1.Fractal register allocation 2.C-tiling register allocation

14 Register Allocation, Fractal We applied the recursive decomposition at register level –We balance the distribution of registers for each matrix Adv: –Register file is considered as L0 Disadv: –The computation is expressed as straight line code, code explosion

15 Register Allocation, C-tiling No balanced distribution of registers R –s 2 registers for C, s for A and s for B (Use of 2s+s 2 Registers) The C is tiled further in sub-squares s x s and for each of them –s x s square of C tile is loaded in registers 1.s x 1 of A Tile is loaded in registers 2.1 x s of B Tile is loaded in registers 3.Scalar product

16 C-tiling, cont. Adv: more efficient than Fractal, reducing loads+stores Disadv: the register file is considered differently C A B

17 Cache Performance ULTRA5

18 Cache Performance SPARC5

19 MFLOPS Performance Pentium II

20 MFLOPS R5K_ip32 ultra2

21 Conclusions Algorithms exploiting cache hierarchy without taking in account cache parameters Performance is achieved optimizing the recursion: –Carefully pruning –Index computation optimization We used the matrix Multiply: –For LU-decomposition Improving further the performance

22 Thank you


Download ppt ". G.Bilardi. –University of Padua, Dipartimento di Elettronica e informatica. P.D’Alberto and A.Nicolau. –University of California at Irvine, Information."

Similar presentations


Ads by Google