Download presentation

Presentation is loading. Please wait.

Published byClarence Parslow Modified over 3 years ago

1
. G.Bilardi. –University of Padua, Dipartimento di Elettronica e informatica. P.D’Alberto and A.Nicolau. –University of California at Irvine, Information and Computer Science += * C C0 C1 C2C3 A A0A1 A2A3 B B0B1 B2B3 Fractal Matrix Multiply

2
Talk Organization Motivations –Alias “why matrix multiply is so popular ?’’ Why did we jump into the Project ? Matrix multiply as it is done –How we differ Our Approach (performance related stuff) –How we did it –Experimental results Conclusions

3
Motivations Matrix Multiply as Example –For data reuse every element is used for n multiplications –Space requirements Sizes, layouts Matrix Multiply as Kernel –3-BLAS applications E.g. LU-decomposition

4
Why did we jump into the Project? Matrix Multiply is asymptotically optimal Cache hierarchy oblivious –Alias Cache hierarchy oblivious –n 3 -multiplications and kn 3 -misses (k <=3) –By Hung-Kung We can study safely different algorithms: –Safely: we do not loose optimality –Different algorithms: computation orders

5
Why we jumped in the project ? Cont. Optimal use of caches = optimal performance ? –Not really Performance: –Register allocation, scheduling, layouts, recursion/no recursion, RISC/no RISC architecture, compiler optimizations, ….. etcetera We want performance –MFLOPS

6
Multiplication as it is done 1.Tiling for L1 –Reduction to a single simple common problem –Then L2, L3 …. 2.Register allocation on the simple problem : –Number of registers –No RISC/RISC (Pentium/no Pentium) 3.Scheduling by compiler 4.Feedback and start over again if necessary

7
ATLAS for example: CA B Tiles fixed in size Registers = 4+2+2 Tiles Copied in a Contiguous Workspace

8
How we differ from the others? We present –A unique Recursive Algorithm The Decomposition function of the problem size –Recursive Layout (Fractal layout alias Z-Morton) –Register allocation tuned on the number of registers for Register-file-based architecture Automatic generated –Optimization of the index computation and recursion –Scheduling by compiler

9
Our Approach: Fractal Layout (alias Z-Morton) A is near square matrix then A0, A1, A2, A3 are near square matrixes about ¼ the size of A and A0 is the largest. Near square Near square: |row-columns| <= 1 A0 A2A3 A1 A 17 8 9 8 9 Layout in memory Sequential

10
Our Approach 1.A square problem is decomposed into 8 near square problems of size between and 2.Each sub-problem has the operands stored contiguously –TNX: the recursive layout 3.A sub-problem is decomposed if min(k,j,l) >32 4.Otherwise is solved directly –The operands are in row major format –Optimized at register-file level Reuse of common optimizations

11
Our Approach, cont. The Type DAG A recursion tree for problem has O(8 log n) different types The type determines the index computation for the sub-problems The types and the matrix offsets are determined and stored in a tree-like structure “type DAG’’ Reduction of index computations by 30% –With moderate extra space.

12
Recursive Tree and Type DAG C0+=A0B0 C0+=A1B2 C1+=A1B1 C1+=A0B3 C3+=A3B3 C3+=A2B1 C2+=A2B2 C2+=A3B0

13
Our approach, cont. Register Allocation When the recursion stops: 1.Sub-Problems smaller than are computed directly 2.Sub-Matrix smaller than 32 by 32 are stored in row major 3.Register Allocation 1.Fractal register allocation 2.C-tiling register allocation

14
Register Allocation, Fractal We applied the recursive decomposition at register level –We balance the distribution of registers for each matrix Adv: –Register file is considered as L0 Disadv: –The computation is expressed as straight line code, code explosion

15
Register Allocation, C-tiling No balanced distribution of registers R –s 2 registers for C, s for A and s for B (Use of 2s+s 2 Registers) The C is tiled further in sub-squares s x s and for each of them –s x s square of C tile is loaded in registers 1.s x 1 of A Tile is loaded in registers 2.1 x s of B Tile is loaded in registers 3.Scalar product

16
C-tiling, cont. Adv: more efficient than Fractal, reducing loads+stores Disadv: the register file is considered differently C A B

17
Cache Performance ULTRA5

18
Cache Performance SPARC5

19
MFLOPS Performance Pentium II

20
MFLOPS R5K_ip32 ultra2

21
Conclusions Algorithms exploiting cache hierarchy without taking in account cache parameters Performance is achieved optimizing the recursion: –Carefully pruning –Index computation optimization We used the matrix Multiply: –For LU-decomposition Improving further the performance

22
Thank you

Similar presentations

OK

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To ensure the functioning of the site, we use **cookies**. We share information about your activities on the site with our partners and Google partners: social networks and companies engaged in advertising and web analytics. For more information, see the Privacy Policy and Google Privacy & Terms.
Your consent to our cookies if you continue to use this website.

Ads by Google

Ppt on ramayana in sanskrit Ppt on network theory definition Ppt on business etiquettes ppt Ppt on paintings and photographs related to colonial period in american Ppt on history of olympics in usa Ppt on hr practices in banking sector Ppt on beer lambert law example Ppt on different solid figures pictures Ppt on how to use powerpoint 2013 Ppt on games and sports in india