Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ganesh Bikshandi, Jia Guo, Daniel Hoeflinger, Gheorghe Almasi

Similar presentations


Presentation on theme: "Ganesh Bikshandi, Jia Guo, Daniel Hoeflinger, Gheorghe Almasi"— Presentation transcript:

1 Programming for Parallelism and Locality with Hierarchically Tiled Arrays
Ganesh Bikshandi, Jia Guo, Daniel Hoeflinger, Gheorghe Almasi*, Basilio B.Fraguela+, Maria J. Garzaran, David Padua, Christoph von Praun* University of Illinois,Urbana-Champaign, *IBM T. J. Watson Research Center, +Universidade de Coruna

2 Motivation Memory Hierarchy
Determines performance Increasingly complex Registers – Cache (L1..L3) – Memory - Distributed Memory Gap between the programs and their mapping into the memory hierarchy Given the importance of memory hierarchy, programming constructs that allow the control of locality are necessary. Exchange points 1 & 2; gap between algo and mapping impl into hardware; programming the mem hierarchy Find out how the data flows from registers to cache to memory to distributed memory There are some algo’s that are better described in term of tiles. 1/12/2019 HTA

3 Contributions Design of a programming paradigm that facilitates
control of locality (in sequential and parallel programs) communication (in parallel programs) Implementations in MATLAB, C++ and X10 Implementation of NAS benchmarks in our paradigm Evaluation of productivity and performance 1. Programming paradigm for explicit memory hierarchy. 2. Impl of it. 3. Impl of benchmarks and verify. 4. To evaluate the performance 1/12/2019 HTA

4 Outline of the talk Motivation Of Our Research
Hierarchically Tiled Arrays Examples Evaluation Conclusion & Future work 1/12/2019 HTA

5 Hierarchically Tiled Arrays (HTA)
Tiled Array: Array partitioned into tiles Hierarchically Tiled Array: Tiled Array whose tiles are in-turn HTAs. 1/12/2019 HTA

6 Hierarchically Tiled Arrays
An example tiling 2 X 2 tiles Distributed/Locality 4 X 4 tiles 2 X 2 tiles 1/12/2019 HTA

7 Hierarchically Tiled Arrays
An example size 2 X 2 tiles size = 64 X 64 elements 4 X 4 tiles size = 32 X 32 elements Distributed/Locality 2 X 2 tiles size = 8 X 8 elements 1/12/2019 HTA

8 Hierarchically Tiled Arrays
An example mapping 2 X 2 tiles size = 64 X 64 elements map to distinct modules of a cluster 4 X 4 tiles size = 32 X 32 elements map to L1-cache Distributed/Locality 2 X 2 tiles size = 8 X 8 elements map to registers 1/12/2019 HTA

9 Why tiles as first class objects?
Many algorithms are formulated in terms of tiles e.g. linear algebra algorithms (in LAPACK, ATLAS) What is special about HTAs, in particular? HTA provides a convenient way for the programmer to express these types of algorithms Its recursive nature makes it easy to fit in to any memory hierarchy. 1/12/2019 HTA

10 HTA Construction hta(MATRIX,{[delim1],[delim2],…},[processor mesh]) a
mapping P P 1 3 ha 1 3 P P 1 2 4 Explain the example. ha = hta(a,{[1,4],[1,3]},[2,2]); 4 1/12/2019 HTA

11 HTA Addressing 1/12/2019 HTA

12 HTA Addressing tiles h{1,1:2} h{2,1} hierarchical 1/12/2019 HTA

13 HTA Addressing h{1,1:2} h{2,1} h(3,4) elements hierarchical flattened
1/12/2019 HTA

14 HTA Addressing h{1,1:2} h{2,1} h(3,4) h{2,2}(1,2) hierarchical
flattened or h{2,2}(1,2) hybrid 1/12/2019 HTA

15 F90 Conformability size(rhs) == 1 dim(lhs) == dim(rhs) .and.
size(lhs) == size(rhs) 1/12/2019 HTA

16 HTA Conformability All conformable HTAs can be operated using
the primitive operations (add, subtract, etc) 1/12/2019 HTA

17 HTA Assignment t h h{:,:} = t{:,:} h{1,:} = t{2,:}
Mention what is triplet, : means all Introduce conformability earlier implicit communication 1/12/2019 HTA

18 Data parallel functions in F90
1/12/2019 HTA

19 Map r = map h) @sin Recursive 1/12/2019 HTA

20 Map r = map h) @sin @sin @sin @sin @sin Recursive 1/12/2019 HTA

21 Map r = map (@sin, h) @sin @sin @sin @sin @sin @sin @sin @sin @sin
Recursive 1/12/2019 HTA

22 Reduce r = reduce (@max, h) @max @max @max @max @max @max @max @max
Recursive 1/12/2019 HTA

23 Reduce r = reduce (@max, h) @max @max @max @max @max @max @max @max
Recursive 1/12/2019 HTA

24 Reduce r = reduce (@max, h) @max @max @max @max @max Recursive
1/12/2019 HTA

25 Higher level operations
repmat(h, [1, 3]) circshift( h, [0, -1] ) Explain communication transpose(h) 1/12/2019 HTA

26 Outline of the talk Motivation Of Our Research
Hierarchically Tiled Arrays Examples Evaluation Conclusion & Future work 1/12/2019 HTA

27 Blocked Recursive Matrix Multiplication
Blocked-Recursive implementation A{i, k}, B{k, j} and C{i, j} are sub-matrices of A, B and C. function c = matmul (A, B, C) if (level(A) == 0) C = matmul_leaf (A, B, C) else for i = 1:size(A, 1) for k = 1:size(A, 2) for j = 1:size(B, 2) C{i, j} = matmul(A{i, k}, B{k, j}, C{i,j}); end 1/12/2019 HTA

28 Matrix Multiplication
matmul (A, B, C) matmul (A{i, k}, B{k, j}, C{i, j}) matmul (AA{i, k}, BB{k, j}, CC{i, j}) Distributed/Locality matmul_leaf (AAA{i, k}, BBB{k, j}, CCC{i, j}) 1/12/2019 HTA

29 Matrix Multiplication
matmul (A, B, C) matmul (A{i, k}, B{k, j}, C{i, j}) matmul (AA{i, k}, BB{k, j}, CC{i, j}) Distributed/Locality matmul_leaf (AAA{i, k}, BBB{k, j}, CCC{i, j}) for i = 1:size(A, 1) for k = 1:size(A, 2) for j = 1:size(B, 2) C(i,j)= C(i,j) + A(i, k) * B(k, j); end 1/12/2019 HTA

30 SUMMA Matrix Multiplication
Outer-product method for k=1:n for i=1:n for j=1:n C(i,j)=C(i,j)+A(i,k)*B(k,j); end for k=1:n C(:,:)=C(:, :)+A{:,k} * B(k, :); 1/12/2019 HTA

31 SUMMA Matrix Multiplication
Here, C{i,j}, A{i,k}, B{k,j} represent submatrices. The * operator represents tile-by-tile matrix multiplication function C = summa (A, B, C) for k=1:m T1 = repmat(A{:, k}, 1, m); T2 = repmat(B{k, :}, m, 1); C = C + T1 * T2; end B repmat repmat A * T2 T1 1/12/2019 HTA

32 Outline of the talk Motivation Of Our Research
Hierarchically Tiled Arrays Examples Evaluation Conclusion & Future work 1/12/2019 HTA

33 Implementation MATLAB Extension with HTA X10 Extension
Global view & single threaded Front-end is pure MATLAB MPI calls using MEX interface MATLAB library routines at the leaf level X10 Extension Emulation only (no experimental results) C++ extension with HTA (Under-progress) Communication using MPI (& UPC) 1/12/2019 HTA

34 Evaluation Implemented the full set of NAS benchmarks (except SP)
Pure MATLAB & MATLAB + HTA Performance metrics: running time & relative speedup on processors Productivity metric: source lines of code Compared with hand-optimized F77+MPI version Tiles only used for parallelism Matrix-matrix multiplication using HTAs in C++ Tiles used for locality 1/12/2019 HTA

35 Illustrations

36 MG 3D Stencil convolution: primitive HTA operations and
assignments to implement communication Sample code?? restrict interpolate 1/12/2019 HTA

37 MG r{1,:}(2,:) = r{2,:}(1,:); communication
r {:, :}(2:n-1, 2:n-1) = 0.5 * u{:,:}(2:n-1, 2:n-1) * (u{:, :}(1:n-2, 2:n-1) + u{:,:}(3:n, 2:n-1) u{:,:}(2:n-1, 1:n-2) + u{:,:}(2:n-1, 3:n)) n = 3 computation 1/12/2019 HTA

38 CG Sparse matrix-vector multiplication (A*p) 2D decomposition of A
replication and 2D decomposition of p sum reduction across columns p0T p1T p2T p3T A00 A01 A02 A03 A10 A11 A12 A13 Say the tiles are distributed to different processors Vector is transposed * A20 A21 A22 A23 A30 A31 A32 A33 1/12/2019 HTA

39 CG * A =hta(MX,{partition_A},[M N]);
1 A =hta(MX,{partition_A},[M N]); p = repmat( hta( vector', {partition_p} [1 N] ),[M,1]); 2 1 2 p0T p1T p2T p3T p0T p1T p2T p3T * A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 p0T p1T p2T pT3 repmat 1/12/2019 HTA

40 CG * A =hta(MX,{partition_A},[M N]);
1 A =hta(MX,{partition_A},[M N]); p = repmat( hta( vector', {partition_p} [1 N] ),[M,1]); p = map p) 2 3 3 A00 A01 A02 A03 A10 A11 A12 A13 p0 p1 p2 p3 p0 p1 p2 p3 * A20 A21 A22 A23 A30 A31 A32 A33 p0 p1 p2 p3 repmat 1/12/2019 HTA

41 CG * A =hta(MX,{partition_A},[M N]);
1 A =hta(MX,{partition_A},[M N]); p = repmat( hta( vector', {partition_p} [1 N] ),[M,1]); p = map p) R = reduce ( 'sum', A * p, 2, true); 2 3 4 4 reduce R00 R01 R02 R03 R10 R11 R12 R13 R20 R21 R22 R23 R30 R31 R32 R33 A00 A01 A02 A03 A10 A11 A12 A13 p0 p1 p2 p3 p0 p1 p2 p3 = * A20 A21 A22 A23 A30 A31 A32 A33 p0 p1 p2 p3 repmat 1/12/2019 HTA

42 FT 3D Fourier transform Corner turn using dpermute
Fourier transform applied to each tile h = ft (h, 1); A 2D illustration h = dpermute (h, [2 1]); h = ft(h, 1); @ft @ft dpermute ft 1/12/2019 HTA

43 IS Bucket sort of integers Bucket exchange using assignments
Distribute keys Local bucket sort Aggregate Bucket exchange Local sort @sort @sort @sort 1/12/2019 HTA

44 IS Bucket sort of integers Bucket exchange assignments
h and r top level distributed for i = 1:n for j = 1:n r{i}{j} = h{j}{i} end @sort @sort @sort 1/12/2019 HTA

45 Experimental Results NAS benchmarks (MATLAB) (Results in the paper)
CG & FT – acceptable performance and speedup MG, LU, BT – poor performance, but fair speedup C++ results yet to be obtained. Add loc plot 1/12/2019 HTA

46 Matrix Multiplication (C++ implementation)
1/12/2019 HTA

47 Lines of Code Lines of code EP CG MG FT LU 1/12/2019 HTA

48 Outline of the talk Motivation Of Our Research
Hierarchically Tiled Arrays Examples Evaluation Conclusion & Future work 1/12/2019 HTA

49 Conclusion HTA is the first attempt to make tiles part of a language
Treating tiling as a syntactic entity helps improve productivity Extending the vector operations to tiles makes the programs more readable HTA provides a convenient way to express algorithms with tiled formulation Future work C++ library + NAS benchmarks Scalability experiments Evolving the language for locality Other conclusions 1/12/2019 HTA

50 THE END 1/12/2019 HTA

51 Backup slides 1/12/2019 HTA

52 void(conj_grad (SparseHTA A, HTA x, Matrix z) {
z = 0.0; q = 0.0; /* Conjugate Gradient Benchmark in C++ */ r = x; p = r; rho = HTA::dotproduct(r, r); for(int i = 0; i < CGITMAX; i++) { qt = 0; pt = p.ltranspose(); SparseHTA::lmatmul(A, pt, &qt); /* compute A*x */ qt = qt.reduce(plus<double>(), 0.0, 1, 0x2); q = qt.transpose(); alpha = rho / HTA::dotproduct(p, q); z = z + alpha * p; r = r - alpha * q; rho0 = rho; rho = Matrix::dotproduct(r, r); beta = rho / rho0; p = r + beta * p; } qt = 0; /* compute the norm */ HTA zt = z.ltranspose(); SparseHTA::lmatmul(A, zt, &qt); r = qt.transpose(); HTA f = x - r; f = f * f; norm = f.reduce(plus<double>(), 0.0); rnorm = sqrt(norm); 1/12/2019 HTA

53 Related Works Languages Parallel MATLABs excluded Classification
ZPL, CAF, UPC, HPF, X10, CHAPEL Parallel MATLABs excluded Classification Global vs Segmented memory Single vs Multiple thread of execution Language vs Library 1/12/2019 HTA

54 Classification Global view Single threaded Library CAF Titanium UPC
X10 HPF ZPL POOMA HTA G.Arrays POET MPI/PVM Library 1/12/2019 HTA

55 Uni-processor performance challenges
MATLAB Limited conversion to vector format Interpretation costs due to loops Copy-on-write during assignments Poor data abstraction with higher dimensionality C++ Run time costs (e.g. virtual functions) Extensible, polymorphic design Syntax constraints leading to proxies More explanation in point 1 1/12/2019 HTA

56 Multi-processor performance challenges
Eliminating temps in expression evaluation e.g. stencil convolutions in MG Scalable communication operations e.g. reduction & transpose in CG e.g. permute in FT e.g. cumsum & bucket exchange in IS Many to one Mapping & complex mapping e.g. BT Overlap computation & communication e.g. LU More explanation 1/12/2019 HTA


Download ppt "Ganesh Bikshandi, Jia Guo, Daniel Hoeflinger, Gheorghe Almasi"

Similar presentations


Ads by Google