Presentation is loading. Please wait.

Presentation is loading. Please wait.

High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige.

Similar presentations


Presentation on theme: "High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige."— Presentation transcript:

1 High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige Goto*, Doug Burger Department of Computer Science University of Texas at Austin *Texas Advanced Computing Center University of Texas at Austin

2 2 Trends in Chip Level Parallelism Emerging architectures more fine grained  On chip networks, precise control over communication  Tight orchestration of computation across ALUs Algorithmic insight from most fine grained case Coarse Grained Fine Grained Quad Core (MIMD) TRIPS (SDU) CellTilera

3 3 Parallel Programming Paradigms Programming occurs at many levels Trends towards optimized library model  Special low level APIs for high performance We’re interested in these low level APIs High Level API Low Level API Haskel, F#, Sequoia, CUDA, Ct, UPC, etc Dynamic Run Times / Compilation Classic Multithreading High Performance, Low Level Libraries

4 4 Case Study: Matrix Multiply Implementing full scale DGEMM High Performance Dense Linear Algebra Libraries (Level 3 BLAS) are layered on top of high performance Matrix Multiply kernels:  SYMM, SYRK, TRSM, TRMM, etc.  Core LAPACK: LU with partial pivoting, Cholesky, QR factorization, matrix inversion, reduction to tridiagonal/Hessenberg/bidiagonal form  Control theory: Sylvester equation, Lyapunov equation, and many, many others... Regular operation is very amenable to algorithmic transformations and easy to reason about

5 5 Talk Outline Spatially Distributed Uniprocessors Matrix Multiply Algorithm  High Level Memory Management  Low Level Blocking  Inner Kernel Optimizing Inner Kernel Results Conclusion

6 6 Spatially Distributed Uniprocessors (SDUs) Single threaded scalability issues for architectures and implementation technology:  Wire delay, Power, Issue Width, Memory Bandwidth…  Solution: SDU - partitioned register banks, functional units, … Still executing a single thread across multiple ALUs Where an instruction executes matters  Program statically determines location of instructions  Examples include advanced VLIW processors in embedded market TRIPS partitions most aspects of single core into tiles:  Tiles connected by on chip 2-D network  Large number of distributed ALUs, registers, data ports  Enormous aggregate bandwidth to registers and data, but…  Communication between ALUs must go through network

7 7 TRIPS - a modern SDU

8 8 Core 1 Core 2 Shared L2

9 9 TRIPS - a modern SDU Register BanksL1 banks L2 banks Grid of ALUs

10 10 Talk Outline Spatially Distributed Uniprocessors Matrix Multiply Algorithm  High Level Memory Management  Low Level Blocking  Inner Kernel Optimizing Inner Kernel Results Conclusion

11 11 Outer-level: Goto streaming algorithm  At heart GotoBLAS Linear Algebra Libraries  Licensed by many of the top computer vendors  Used by many supercomputers in top 500 list Mid-level: enhanced Goto algorithm with new hierarchical blocking layer to leverage SDU topology Inner kernel: novel algorithm suited to SDUs Implementing Matrix Multiply

12 12 Goto Streaming Algorithm Classical blocking algorithm (C += AB):  Break matrices into square blocks just big enough for a, b and c to fit in L1 cache Goto: L2 cache is actually fast enough to access directly from inner kernel Instead of small, square matrix blocks, use huge block-panel multiplies  Traversal order to maximize reuse  Stream full-sized panels of B and C directly out of DRAM

13 13 Goto: High Level Blocking CAB High Level Blocking C’A’ B’ Original Problem A’ C’ B’ L2DRAM/L1DRAM/REG Thousands Hundreds Panel Slices +=

14 14 128 registers hold non-trivial sized blocks 2-D mesh network has high bandwidth in orthogonal directions (like a systolic array)  Additionally store blocks of A in registers Bring in elements of A and B simultaneously and maximize bandwidth Maximize use of both horizontal and vertical network links  But to amortize use of elements of A in registers, need to add another level of low level blocking to the hierarchy Enhancing Goto Algorithm

15 15 B’, C’ panel slices broken into mini-panels b’, c’ a’-block broken into mini-blocks, a’  a’ block and c mini panel held in registers  4x4 a’ amortized over 4x16 b’ Careful ordering of data movement preserves computational properties of larger block-panel multiply  B slice stays in L1 for a LONG time, A stays even longer A’C’B’ (L2) (L1) (DRAM) 16 4 444 += Hundreds Low Level Blocking Scheme

16 16 How do we traverse? A’ C’ B’ 128 512 128 512 X B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM Load c’ and a’ blocks into Registers += 16 4 44

17 17 A’ C’ B’ 128 512 128 16 512 X Stream b’(4x16) from L1 & multiply by a’(4x4) (Reuse a’ four times!) += B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM How do we traverse? 4 4

18 18 A’ C’ B’ 128 512 128 512 X += B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM How do we traverse? 16 4 4

19 19 A’ C’ B’ 128 512 128 512 X += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM 16 4 4

20 20 A’ C’ B’ 128 512 128 512 X += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM 16 4 4

21 21 A’ C’ B’ 128 512 128 16 512 X Reuse register c’, next a’ right, next b’ below: += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

22 22 A’ C’ B’ 128 512 128 16 512 X Repeat until at bottom of B slice, right of A row += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

23 23 A’ C’ B’ 128 512 128 16 512 X Save c’s, load next row of a’ and c’, reuse entire B’ slice’ += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

24 24 A’ C’ B’ 128 512 128 16 512 X Repeat process over slice of B’ += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

25 25 A’ C’ B’ 128 512 128 16 512 X Continue over entire block of A’ and C’ += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

26 26 C’ B’ A’ C’ B’ 128 512 128 16 X Fetch next slice of B’ and move into next slice of C’ += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

27 27 A’ C’ B’ 128 512 128 16 X Complete B’, C’ Panels, load next A’ and repeat… C’ B B += How do we traverse? B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

28 28 Defined Inner Kernel CAB High Level Blocking C’A’ B’ Original Problem A’ C’ B’ L2DRAM/L1DRAM/REG Thousands Hundreds Panel Slices += 16 4 4 4 4 Mini Block-Panel REG L1 += c’ b’ a’

29 29 Talk Outline Spatially Distributed Uniprocessors Matrix Multiply Algorithm Optimizing Inner Kernel Results Conclusion

30 30 Optimizing the Inner Kernel Developed several optimization principles:  First to apply these principles to TRIPS Avoiding network contention is critical!  Single overscheduled link can cut performance in half  Avoided by datapath routing, direction oriented computation (DOC), register mirroring, data interleaving - got a 5x jump in Instructions Per Cycle, exceeding 10 IPC Load balance every resource in system  In a loop, total performance limited by most used wire link or execution slot  Loop body scaled to match register and data usage and to minimize architectural overheads Results in “fragility” of optimization typical of spatial architectures with shared resources

31 31 Simplified Schedule Step 1: Reading A from Register files D0 D1 D2 D3 GT R0 R1 R2 R3 Step 2: Loading B and broadcast it across rows 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Step 3: Do the multiply and then add across columns Step 4: Write the results back to C 1234

32 32  Every register use must be retrieved across network  Every load and store needs to get an address  Need to interleave prefetching, writing, updating pointers, counters  Need to account for data movement instructions What are the complications?

33 33 Talk Outline Spatially Distributed Uniprocessors Matrix Multiply Algorithm Optimizing Inner Kernel Results Conclusion

34 34 Comparison of FPC across major processors 0 1 2 3 4 5 6 7 Opteron P4 Core 2 Duo POWER5 Itanium TRIPS Kernel FPC DGEMM FPC Execution Bottlenecks: Integer/Network Ops vs FLOPS Single Operand Per Cycle Enhancement Opportunities SIMD instruction set Larger Instruction Window More network bandwidth * Results from K. Goto and R. A. van de Geijn, Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software, 2008. 13:748-757, August 2007

35 35 0 1 2 3 4 5 6 05121024153620482560307235844096 FPC DGEMM C Kernel + Goto C Kernel, no Goto Performance vs Matrix Size

36 36 Role of the Compiler Kernel has 8x performance of TRIPS C compiler  Did exhaustive empirical studies to determine individual performance contributions of optimizations and their interaction with the TRIPS compiler TRIPS compiler does scheduling as post process Determined that existing scheduler can handle orchestration well if algorithm matches topology:  If assembly for inner loop specified, scheduler obtained 75% of total performance Lesson: Orchestration is not the difficult part  Need to consider basic topology during compilation  Blocking compilers and register clustering are active topics of research  Annotations / hints to compiler?

37 37 Conclusions Fine grained architectures can boost single thread performance Optimization principles we learned can be applied to many levels of architectural granularity  But critical for fine grained architectures In the future, high performance will depend on algorithms that incorporate both the memory hierarchy and the topology of the processing/ communication substrate

38 38 Thank You :) Any Questions?

39 39 Thank You :) Any Questions?

40 40 Back Up Slides Just a list for now:  Comparison of GotoBLAS against Atlas/LAPACK  More detailed diagrams of algorithm  Other performance graphs  Systolic Array  Diagrams of other canonical processors

41 41 Future work Explore applicability of optimization principles beyond dense linear algebra, to irregular, control intensive algorithms Quantify degree to which principles apply to coarser grained architectures (CMPs) and different memory topologies

42 42 Trends in Chip Level Parallelism Multiple ways to exploit parallelism:  Instruction/Thread/Data Level Parallelism  Coarse Grained vs Fine Grained What’s the programming model?  High level paradigm of your choice…  Dynamic compilation and run time systems  Low level APIs for writing optimized libraries Likely need to rewrite applications

43 43 Trends in Computer Architecture Emerging architectures are trending towards more fine grained control  E.g. Intel Terascale, RAW, Tilera  Tightly orchestrated computation  On chip networks  Precise control over communication These represent a step down a path Algorithmic insight can be gained by looking at the most fine grained examples

44 44 Spatially Distributed Uniprocessors Scalability issues for both architectures and underlying technology  Wire delay,Power, Issue Width… More and more components of microprocessors becoming distributed  Partitioned register banks, functional units, … SDU partitions all aspects of single core into tiles  Tiles connected by on chip 2-D network  Large number of distributed registers, data ports  Enormous aggregate bandwidth to registers and data, but…  Communication between ALUs must go through network Key performance characteristic: Where an instruction executes matters!

45 45 TRIPS - a modern SDU Grid of ALUs (16) Large number of distributed registers Large number of data ports On chip 2-D mesh network S-NUCA distributed L1 and L2 cache

46 46 TRIPS - a modern SDU Potential Advantages for Matrix Multiply  Large number of ALUs  Precise placement of instructions Not a MIMD machine  Model of execution is block dataflow graphs  Bring in graphs one at a time and execute  must also deal with data movement, registers, data bandwidth, control

47 47 Classical Matrix Multiply Need to compute C = AB + C Once just used a triply nested loop… Want to amortize O(n 2 ) data movement over 2n 3 computation of matrix multiply  Break A, B and C matrices into square blocks just small enough to fit A, B and C in L1 cache  Inner kernel computes block of C by caching elements of C in registers and using values of A and B from L1 cache

48 48 Performance for thin panels C mxn = A mxk x B kxn

49 49 Goto’s Streaming Algorithm Classical algorithm breaks matrices into blocks just big enough for A, B and C to fit in L1 cache Goto realized L2 cache is actually fast enough to access directly from inner kernel!  Use most of L2 cache for a giant block of A  Inner kernel uses all levels of memory hierarchy simultaneously Cache large slices of B panel in L1 cache, cache small piece of C in registers Instead of square matrix blocks, use block-panel multiplies, with traversal order to maximize reuse  Stream full-sized contiguous panels of B and C directly out of DRAM Use extremely optimized hand tuned assembly

50 50 Methodology So we compiled code using the TRIPS compiler And we ran it on a hardware prototype. We kept making changes and seeing how fast it ran. We made notes of the changes. We made graphs from the notes. We made slides based on the graphs. We made conclusions based on the slides. It’s 130nm and 366 MHz, but that’s OK.

51 51 Controlling The Cache A C B 128 512 128 16 512 X B slice fits in L1 cache A block fits in L2 cache C chunks from L2 How do we keep B in L1 cache while streaming all of A through?

52 52 A Buffer Size

53 53 Block Panel Multiply CBA += x Doing multiple GEMDOTS in parallel.

54 54 Block Panel Multiply CBA += x Doing multiple GEMDOTS in parallel.

55 55 Block Panel Multiply CBA += x Doing multiple GEMDOTS in parallel.

56 56 Block Panel Multiply CBA += x Doing multiple GEMDOTS in parallel.

57 57 Block Panel Multiply CBA += x Doing multiple GEMDOTS in parallel.

58 58


Download ppt "High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige."

Similar presentations


Ads by Google