Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Concurrent Matrix Transpose Algorithm Pourya Jafari.

Similar presentations


Presentation on theme: "A Concurrent Matrix Transpose Algorithm Pourya Jafari."— Presentation transcript:

1 A Concurrent Matrix Transpose Algorithm Pourya Jafari

2 Application Frequently Used Linear Algebra Operation Scientific Applications Scientific Applications FFT FFT Matrix Multiplication Matrix Multiplication

3 Transpose Matrix : item/cell at row i and column j of matrix B : item/cell at row i and column j of matrix B. For all i, j we have For all i, j we have. Simply exchange rows and columns For simplicity we only consider square matrices N row N columns labeled 0 to N-1 N row N columns labeled 0 to N-1

4 An Example Each cell is filled with row|column number 6 swaps, (4*4 – 4)/2 = 6 In general, for size N square Matrix we have In general, for size N square Matrix we have swaps, swaps, 00010203 10111213 20212223 30313233

5 Parallelizing Naïve algorithm A thread for each swap A thread for each swap Quadratic number of threads Quadratic number of communication links → impractical → impractical 00010203 10111213 20212223 30313233

6 Parallelizing - 2 More efficient Way Assign a column to each thread Assign a column to each thread O(N) threads Communication links? Depends on the approach Depends on the approach 00010203 10111213 20212223 30313233

7 Measure dislocation A single swap operation as row and column shifts For column shift length A j= i + K → K = i - j j= i + K → K = i - j Shift length is i-j; value range is from 0 to N-1 00010203 10111213 20212223 30313233

8 Concurrency Scheme Minimize communication Pre-process inside thread Pre-process inside thread Shift each rows Intra-process/thread communication Intra-process/thread communication Shift each columns Post-process inside thread Post-process inside thread Shift each rows again 00010203 10111213 20212223 30313233

9 Concurrency Scheme - 2 We have the row shifts fixed based on row index Has range 0 to N-1, Has range 0 to N-1, consistent with our initial finding Now arrange the rows, so that column shifts gets us to i i - L = i’ L + i’ = i L = -j i - L = i’ L + i’ = i L = -j So we shift each column j cells up

10 Steps so far 1 → 2: Column shift j up 2 → 3: Row shift based on row indices 3 → 4: ? Change of indices so far Change of indices so far (i - j, j) → (i - j, i - j + j) → (i - j, i) = (m, n) (i - j, j) → (i - j, i - j + j) → (i - j, i) = (m, n) One operation to change row index to j One operation to change row index to j n - m = (i - (i - j))= j 00010203 10111213 20212223 303132330011223310213203 20310213 300112230001020310111213 20212223 303132330011223303102132 02132031 01122330 0010203001112131 02122232 03132333 (1)(2-a)(2-b)(3) (4)

11 Efficiency of algorithm so far O(N) row and column operation O(N 2 ) overall considering both rows and column O(N 2 ) overall considering both rows and column O(N) communication links O(N) communication links Communication is a major bottleneck Group row shifts Group row shifts Reduce communication and overall complexity

12 Radix Representation Radix r Base r numbers Base r numbers For k each digit place (starting from LS) For k each digit place (starting from LS) For l steps from 0 to r-1 group all row shifts for current step group all row shifts for current step Radix 3 Radix 3 Possible numbers 0, 1 and 2 Second loop { For l=0 to 2 } Second loop { For l=0 to 2 } Shift all number have l in their k th digit place l*r^k to the right

13 Special Case: Radix-2 Two steps only 0 and 1 We only shift for 1 We only shift for 1 Digits are bit representation Shift all row indices have their k th bit on Shift all row indices have their k th bit on 0 1 2 301 2 301 2 3 Shift for each row k=0 k=1 =+

14 Algorithm complexity Depends on r (radix) C 1 =(r-1)[log r N] C 1 =(r-1)[log r N] C 2 =b(r-1)[N/r][log r N] C 2 =b(r-1)[N/r][log r N] Special cases Special casesr=2 Important when communication cost is high Important when communication cost is high Good when message size small r=N Good when message size is large Good when message size is large Best value based on communication costs, message size, communication link performance, number of ports, etc.

15 Radix vs. message size vs. index time for 64 processors


Download ppt "A Concurrent Matrix Transpose Algorithm Pourya Jafari."

Similar presentations


Ads by Google