Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.

Similar presentations


Presentation on theme: "1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University."— Presentation transcript:

1 1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University

2 Sudarsanam MAPLD2005/171 2/29 Overview Problem definition Matrix inverse algorithm Types of Polymorphism Design Set-up Hardware design flow (For LU Decomposition) Results Conclusions

3 Sudarsanam MAPLD2005/171 3/29 Problem Definition Given a 2-D matrix, A[N][N], A = A[1,1] A[1,2] A[1,3]…….. A[1,N] A[2,1] A[2,2] A[2,3]…….. A[2,N] A[3,1] A[3,2] A[3,3]…….. A[3,N]. A[N,1] A[N,2] A[N,3]…….. A[N,N] Determine the Inverse matrix A -1, defined as AxA -1 = I

4 Sudarsanam MAPLD2005/171 4/29 Algorithm flow Step 1: LU Decomposition  Matrix A is split into two triangular matrices, L and U For i = 1:N For j = I+1:N A(j,i) = A(j,i)/A(I,i)); A(j,(i+1):N) = A(j,(i+1):N) - A(j,i)*A(i,(i+1):N); End For j End For i

5 Sudarsanam MAPLD2005/171 5/29 Algorithm flow Step 2: Inverse computation for triangular matrices  L -1 and U -1 are computed using a variation of Gaussian elimination For i = 1:N For j = i+1:N Linv(j,i+1:N) = Linv(j,i+1:N) - L(j,i)* Linv(i,i+1:N); End For j End For i

6 Sudarsanam MAPLD2005/171 6/29 Algorithm flow Step 3: Matrix multiplication  L -1 and U -1 are multiplied together to generate A -1 For i = 1:N For j = 1:N Ainv[i,j] = Ainv[i,j] +U[i,k]*L[k,j] End For j End For i

7 Sudarsanam MAPLD2005/171 7/29 Types of Polymorphism Following parameters can be varied for the input matrix:  Data type – variable precision, signed/unsigned, and float  Information rate – Rate at which input arrives into, and leaves the system (pipelining/parallelism)  Order tensor – matrix size (16x16, 32x32 etc.)

8 Sudarsanam MAPLD2005/171 8/29 Polymorphism and Viva Viva supports polymorphic hardware implementation, just as any software programming language. A large library of polymorphic arithmetic, control and memory modules is available.

9 Sudarsanam MAPLD2005/171 9/29 Data Type Polymorphism Poly- morphi c

10 Sudarsanam MAPLD2005/171 10/29 Information Rate Polymorphism Clock speed can be changed based on the input data rate This ‘Mul’ unit is a Truly polymorphic object. Based on the input list size, the Viva compiler will generate the required number of parallel multiplier units. The number of parallel units will be denoted as ‘K’

11 Sudarsanam MAPLD2005/171 11/29 Order Tensor Polymorphism Value of ‘N’ set at run time

12 Sudarsanam MAPLD2005/171 12/29 Design Flow – Top level block diagram Central Control Unit (CCGU) Memory Unit for A Memory Unit for L Memory Unit for U Memory Unit for L -1 Memory Unit for U -1 Memory Unit for A -1 LU Decompose Loop Unit Inverse of L Loop Unit Inverse of U Loop Unit U -1 X L -1 Loop Unit From Files

13 Sudarsanam MAPLD2005/171 13/29 Design Flow Main StepsOperationSub StepsSub Module 1Initialize0Generate address 1Write A onto BRAM 2LU Decompose0Generate ‘i’, ‘j’, ‘k’ 1Read A[j,i], A[j()]… 2Compute new A[j,()] 3Write A[j,()],A[j,i] 3A2LU Convert0Generate ‘j’,’k’ 1Read A[j,()] 2Compute L[j,()], U[j()] 3Write L[j()], U[j()]

14 Sudarsanam MAPLD2005/171 14/29 Design Flow Main StepsOperationSub StepsSub Module 4L inverse0Generate ‘i’,‘j’, ‘k’ 1Read L[j()],L -1 [j()].. 2Compute new L -1 [j()] 3Write L -1 [j,()] 5U inverse0Generate ‘i’, ‘j’, ‘k’ 1Read U[j,()],U -1 [j,()].. 2Compute U -1 [j,()] 3Write U -1 [j,()] 6A inverse0Generate ‘i’, ‘j’, ‘k’ 1Read L[I,()], U[j,()] 2Compute Ainv[i,j,()] 3Update Ainv[i,j]

15 Sudarsanam MAPLD2005/171 15/29 Hardware Design Set-up Hardware: PE6 (Xilinx 2V6000 FPGA) of the Starbridge Hypercomputer, connected to an Intel x86 processor. (66 MHz / 33,768 Slices) Software: Viva 2.3, developed at Starbridge Systems

16 Sudarsanam MAPLD2005/171 16/29 Implementation – LU Decomposition Loop Unit i,j,k Address Generation Unit Memory Unit A[j,()],A[i,()], A[j,i], A[i,i] Computation Unit i,j,k A[j,()], A[j,i]

17 Sudarsanam MAPLD2005/171 17/29 Loop Unit - Functionality Given the order of the matrix ‘N’ and the parallelism to be supported ‘K’, The following loop structure needs to be generated. For i = 1 to N For k = ((i-1)/K)*K to N+1-K in steps of K For j = i to N Generate(i,k,j); End j End k End i

18 Sudarsanam MAPLD2005/171 18/29 Loop Unit - Architecture A simple register-based implementation is shown. The overall latency is 2 Clock cycles.

19 Sudarsanam MAPLD2005/171 19/29 Memory Unit - Distribution A[1,1:8]A[2,9:16]A[1,17:24]A[1,25:32] ………… ….. A[2,1:8]A[2,9:16]A[2,17:24] A[3,1:8]A[3,9:16] A[4,1:8]....... One Block

20 Sudarsanam MAPLD2005/171 20/29 Memory Unit - Architecture  BRAM memories are used to store data internally. (Matrix is expected to fit into the BRAMs. Maximum value of N is 128)  There are ‘K’ [(NxN)/K]x(variable Data Size) individual BRAMs.  The ‘K’ values in each block in Matrix is distributed over the ‘K’ BRAMs. This results in a single clock access time for internal memory.  A[j] and A[j,i] will be fetched one after the other on every iteration.  The overall latency was found to be 3 clock cycles.

21 Sudarsanam MAPLD2005/171 21/29 Address Generation - Functionality Inputs: i,j,k from the Loop Unit Outputs: Address in the BRAM for the A[j,()] and A[i,()] blocks of data Address in the BRAM of A[j,i] and A[i,i]  The computations have been organized in such a way that A[i,()] needs to be fetched only once for processing a complete column of blocks.  Thus, only one port is required to access both A[i,()] and A[j,()]

22 Sudarsanam MAPLD2005/171 22/29 Address Generation - Architecture ‘Shift’ used instead of multipliers: N,K assumed to be powers of 2. (Latency = 1 cc)

23 Sudarsanam MAPLD2005/171 23/29 Computation Units - Functionality Inputs: - A[j,()] and A[i,()] blocks from BRAM unit - A[j,i] and A[i,i] from the BRAM unit. - Indices i,j,k from the loop unit. Output: The modified A[j,()] block and the A[j,i] value. Three steps are performed: 1.Modify A[i,()] based on the loop indices 2.Perform computations: Divide, Multiply, Subtract 3.Include A[j,i] on A[j,()] if required

24 Sudarsanam MAPLD2005/171 24/29 Computation Units – Architecture (K=8)

25 Sudarsanam MAPLD2005/171 25/29 Results for LUD – Slice Counts (N=16) List TypeFix16Fix32Float Size=41862 (8)7305 (32)5012 (12) Size=83731 (16)14472 (64)9802 (24) Size=167502 (32)29018 (128)19024 (48) Number of ROM multipliers used shown in brackets.

26 Sudarsanam MAPLD2005/171 26/29 Results for LUD – Time Taken (in cycles) List TypeFix16Fix32Float Size=4121212761232 Size=8590654610 Size=16279343299

27 Sudarsanam MAPLD2005/171 27/29 Time taken Vs Size of Matrix (Fix16, K = 8) Size of the matrixTime taken (in cycles) 16x16590 32x324348 64x6433528 128x128264688 (3970320 ns) A ‘C’ code (N=128;Fix16) will take O(M*N 3 ) time ~ 702545*M ns (where ‘M’ is number of cycles per iteration ~ 30) (On Intel Centrino 1.5GHz) ~ M/6 speed-up

28 Sudarsanam MAPLD2005/171 28/29 Conclusions A polymorphic design for matrix inverse was implemented  Data type - Float/Fix16/Fix32  Information rate (K) - 4/8/16  Order Tensor (N) – 16/32/64/128 Viva’s effectiveness in polymorphic implementation was evaluated. Hardware design flow and Results were shown for LU Decomposition.

29 Sudarsanam MAPLD2005/171 29/29 Lessons learned Pseudo polymorphism  Some of the polymorphic objects in the Viva library are pseudo polymorphic. For e.g. floating point and fixed point implementations of adder unit. Need for timing analysis tool  It was difficult to compute the delays associated with each block in the Viva library Fix32 Vs Float  The division unit in the Viva library is optimized for Floating point and not for fixed point (as shown in the results)


Download ppt "1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University."

Similar presentations


Ads by Google