Download presentation

Presentation is loading. Please wait.

Published byMeghan Hutchins Modified about 1 year ago

1
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa 2, Sriram Krishnamoorthy 2, Antonino Tumeo 2 and Xiaoming Li 1 1 University of Delaware 2 Pacific Northwest National Laboratory 1 September 24 th, 2010

2
Overview Introduction Cluster level Node level Results Conclusion Future Work 2

3
Overview Introduction Cluster level Node level Results Conclusion Future Work 3

4
Sparse Matrix-Matrix Multiply - Challenges The efficient implementation of sparse matrix-matrix multiplications on HPC systems poses several challenges: Large size of input matrices E.g ×10 6 with 30×10 6 nonzero elements Compressed representation Partitioning Density of the output matrices Load balancing large differences in density and computation times 4 Matrices taken from Timothy A. Davis. University of Florida Sparse Matrix Collection, available online at:

5
Sparse Matrix-Matrix Multiply Cross Cluster implementation: Partitioning Data Distribution Load Balancing Communication/Scaling Result handling In-Node implementation: Multiple efficient SpGEMM algorithms CPU/GPU implementation Double buffering Exploiting heterogeneity 5 Matrices taken from Timothy A. Davis. University of Florida Sparse Matrix Collection, available online at:

6
Overview Introduction Cluster level Node level Results Conclusion Future Work 6

7
Sparse Matrix-Matrix Multiply - Cluster level Blocking Block size depends on sparsity of input matrices and # processing elements. NumOfBlocksX × NumOfBlocksY >> NumOfProcessingElements Data Layout What format and order to allow for easy and fast access Communication and storage implemented using Global Arrays (GA) Offers a set of primitives for non-blocking operations, contiguous and non-contiguous data transfers. 7

8
Sparse Matrix-Matrix Multiply - Data representation and Tiling 8 A B C C=A×B Blocked Matrix representation: Each block is stored in CSR* form data ( ) col ( ) row ( ) *CSR: Compressed Sparse Row

9
Sparse Matrix-Matrix Multiply - Data representation and Tiling 9 A B C C=A×B datacolumnrowdatacol… Tile 0Tile 2 … Matrix A: The single CSR tiles are stored serialized into the GA space. Tile sizes and offsets are stored in a 2D array Tiles with 0 nonzero elements are not represented in the GA dataset.

10
Sparse Matrix-Matrix Multiply - Data representation and Tiling 10 B Matrix B: tiles are serialized in a transposed way. depending on the algorithm used to calculate the single tiles the data in the tiles can be stored transposed or not transposed. For the Gustavson algorithm the representation of the data in the tiles themselves is not transposed not transposed or transposed

11
Sparse Matrix-Matrix Multiply - Tasking and Data Movement C Each Block in C represents a Task. Nodes grab tasks and additional needed data when they have computational power available Results are stored locally meta data of the result blocks in each node is distributed to determine the offsets of the tiles in the GA space. Tiles are put into the GA space in right order 01N-1 …

12
Sparse Matrix-Matrix Multiply - Tasking and Data Movement 12 A B C=A×B Each node fetches the data needed by the task to handle: E.g. here for task/tile 5 the node has to load the data of Stripes s a = 1 and s b = 0 N … S a …S b -1

13
Sparse Matrix-Matrix Multiply - Next Step: Locality aware Tasking 13 A B C C=A×B Assign tasks depending on how the global array is distributed over the cluster. The task queue should be aware of what data is already available in a node and based on that assign the follow up task. Tasks that should have a higher priority to be assigned to the node that handled task 5

14
Overview Introduction Cluster level Node level Results Conclusion Future Work 14

15
Sparse Matrix-Matrix Multiply - Gustavson 15 The algorithm is based on the equation: i-th row of C is a linear combination of the v rows of B for which a iv is nonzero. Where A has the dimensions p×q and B q×r × data(2,3,-1,2,3,-3,1,2,3,1,2,2,2,-1,4) col (0,1, 1,3,4, 2,3,2,3,0,3,4,3, 4,5) row (0,2,5,7,9,12,15) data(1,-1,5,4,6,-2,7,-4,1,5,1,2) col (0, 1,1,2,3, 0,3, 4,1,4,3,4) row (0,2,3,5,8,10,12) AC B × i=1i=1, v=1i=1, v=3i=1, v=4 + + × +

16
Sparse Matrix-Matrix Multiply - Gustavson 16 AC B In the CUDA implementation: each result row c i is handled by the 16 threads of a half warp ( 1/2W ) For each nonzero elements a iv in A one 1/2W performs the multiplications for each row v· in parallel The results are kept in dense form until all calculations are complete Then the results get compressed on the device half-warp 0 half-warp 1 half-warp 2 …

17
Overview Introduction Cluster level Node level Results Conclusion Future Work 17

18
Sparse Matrix-Matrix Multiply – Case Study Midsize matrix from the University of Florida Sparse Matrix Collection* 2D/3D problem size 72, 000 × 72, , 715, 634 nonzero Blocked into 5041 tiles. Multiplying matrix with itself. 18 *http://www.cise.ufl.edu/davis/sparse Darker colors represent higher densities of nonzero elements.

19
Sparse Matrix-Matrix Multiply - Results 19 Scaling of SpGEMM with the different approaches

20
Sparse Matrix-Matrix Multiply - Results 20

21
Sparse Matrix-Matrix Multiply - Results Even inside a node where different compute elements are used the load balancing mechanism still performs well The processes using the CUDA devices here completing almost 5x more tasks than the pure CPU processes. 21

22
Overview Introduction Cluster level Node level Results Conclusion Future Work 22

23
Sparse Matrix-Matrix Multiply We presented a parallel framework using a co-design approach which takes into account characteristics of: The selected application (here SpGEMM) The underlying hardware (heterogeneous cluster) The difficulties of using static partitioning approaches show that a global load balancing method is needed Different optimized implementations of the Gustavson algorithm are presented and are used depending on the available compute element For the selected case study optimal load balancing with uniform computation time across all processing elements is achieved 23

24
Overview Introduction Cluster level Node level Results Conclusion Future Work 24

25
Future Work – General Tasking Framework for Heterogeneous GPU Clusters More General Task definition More flexibility in Input and output data definition Exploring limits imposed on Tasks by a Heterogeneous system Feedback loop during execution that allows more efficient assignment of tasks. Introducing heterogeneous execution on GPU and CPU in one process/core. Locality aware Task queue(s) and work stealing Task reinsertion or generation at the node level. 25

26
Thank you Questions? 26

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google