Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G.

Similar presentations


Presentation on theme: "1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G."— Presentation transcript:

1 1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G. Van Zee

2 Introduction Shared memory parallelism for GEMM Many-threaded architectures require more sophisticated methods of parallelism Explore the opportunities for parallelism to explain which we will exploit Need finer grain parallelism 2

3 Outline  GotoBLAS approach Opportunities for Parallelism Many-threaded Results 3

4 GotoBLAS Approach 4 += A BC m m nk k n The GEMM operation:

5 Main Memory L3 cache L2 cache += L1 cache registers

6 6 Main Memory L3 cache L2 cache += L1 cache registers ncnc ncnc

7 7 Main Memory L3 cache L2 cache += L1 cache registers kckc kckc

8 8 Main Memory L3 cache L2 cache += L1 cache registers mcmc mcmc

9 9 Main Memory L3 cache L2 cache += L1 cache registers nrnr nrnr nrnr

10 10 Main Memory L3 cache L2 cache += L1 cache registers mrmr mrmr

11 Outline GotoBLAS approach  Opportunities for Parallelism Many-threaded Results 11

12 3 Loops to Parallelize in GotoBLAS 12 +=

13 5 Opportunities for Parallelism 13 +=

14 Multiple Levels of Parallelism 14 irir += All threads share micro-panel of B Each thread has its own micro-panel of A Fixed number of iterations:

15 Multiple Levels of Parallelism 15 += jrjr jrjr All threads share block of A Each thread has its own micro-panel of B Fixed number of iterations Good if shared L2 cache

16 Multiple Levels of Parallelism 16 All threads share panel of B Each thread has its own block of A Number of iterations is not fixed Good if multiple L2 caches

17 Multiple Levels of Parallelism 17 Each iteration updates entire C Iterations of the loop are not independent Requires mutex when updating C Or a reduction

18 Multiple Levels of Parallelism Each iteration updates entire C Iterations of the loop are not independent Requires mutex when updating C Or a reduction

19 Multiple Levels of Parallelism 19 All threads share matrix A Each thread has its own panel of B Number of iterations is not fixed Good if multiple L3 caches Good for NUMA reasons

20 Outline GotoBLAS approach Opportunities for Parallelism  Many-threaded Results 20

21 Intel Xeon Phi Many Threads  60 cores, 4 threads per core  Need to use > 2 threads per core to utilize FPU We do not block for the L1 cache  Difficult to amortize the cost of updating C with 4 threads sharing an L1 cache  We consider part of the L2 cache as a virtual L1 Each core has its own L2 cache 21

22 22

23 23

24 24

25 25

26 IBM Blue Gene/Q (Not quite as) Many Threads  16 cores, 4 threads per core  Need to use > 2 threads per core to utilize FPU We do not block for the L1 cache  Difficult to amortize the cost of updating C with 4 threads sharing an L1 cache  We consider part of the L2 cache as a virtual L1 Single large, shared L2 cache 26

27 27

28 28

29 29

30 30

31 31

32 32

33 Thank You Questions? Source code available at:  code.google.com/p/blis/ 33


Download ppt "1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G."

Similar presentations


Ads by Google