Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Matrix Multiplication on SOPC Project instructor: Ina Rivkin Students: Shai Amara Shuki Gulzari Project duration: one semester.

Similar presentations


Presentation on theme: "1 Matrix Multiplication on SOPC Project instructor: Ina Rivkin Students: Shai Amara Shuki Gulzari Project duration: one semester."— Presentation transcript:

1 1 Matrix Multiplication on SOPC Project instructor: Ina Rivkin Students: Shai Amara Shuki Gulzari Project duration: one semester

2 2 Project Goals:   Implementing a Matrix multiplication IP. The IP will multiply N x M sized matrix A with M x L sized matrix B and provide an N x L Result matrix.   Integrating the IP on a system on programmable chip (SOPC).

3 3 Specification of Matrix IP  The matrices sizes, N, M, L can vary from 1 to 127.  The two multiplied matrixes numbers can have values that range from -2^15 up to +2^15 -1.  The result matrix’s numbers are of type integer (32 bits) and can have values from -2^31 up to +2^31 -1.

4 4  Each matrix is stored in a separate address range of the IP.  Since an address range is limited to a maximum of 64KB, that is sufficient for 16K(2^14) integers => maximum of a square matrix is 128x128. *our maximum is 127x127. *our maximum is 127x127. Specification of Matrix IP-cont’

5 5 Implementation General Hardware scheme Processor Matrix Multiplication PLB/OPB bridge Uart PLB OPB

6 6 IP’s inner address ranges  The IP has 3 Address Ranges: AR0 (matrix A), AR1 (matrix B) and AR2 (Result matrix).  The CPU is only allowed to write to AR0 and AR1 and only to read from AR2.  The IP’s FSM is only allowed to read from AR0 and AR1 and only to write to AR2.

7 7 General Implementation Idea Block diagram Matrix Multiplication unit Memory Logic FSM Clock Address Data Write Enable data address Write enable R0 start signal and sizes of matrices Data out Matrix A Matrix B The result Matrix Mult Accum R1 Finish Bit

8 8 Actual Implementation Block diagram

9 9   First, the processor writes the two matrices into the IP’s 1st and 2 nd address ranges. ADDRESS RANGE 0: ADDRESS RANGE 1: Implementation – a Simple Example 0x00x1 ….. 0x20x3 0x00x10x20x3 32 5 …. 46

10 10 a Simple Example - continue a Simple Example - continue  Secondly, it writes the matrices sizes (N, M, L) and start bit to the IP’s inner register in the following format: * The IP’s FSM reads N, M, L as unsigned numbers, so the maximum size for each of them is 2^7 -1 = 127 0 6 L M 13 20 N Start 22 Don ’ t care 31 7 14 21

11 11 a Simple Example - continue a Simple Example - continue  In our example the sizes could be 2x2 and 2x1 or 4x1 and 1x2.  Let’s take the case of 2x2 and 2x1.  The inner register will be written with: 0000010000000100000101 start NM L

12 12 Finish bit <=1 n, m, l <= sizes Address_A <= i + row*m Address_B <= i*l + col Sel_A <= 1 Sel_B <= 1 Idle start=‘0’ start =‘1’ row <= 0 i <= 0 col <= 0 i <= i +1 Sel_A<=0 Sel_B<=0 i< m-1 WE<=1 Data_out<=data_in Add_out<=row*l+col col<= col+1 i = m-1 row <= row +1 WE <= 0 i<= 0 WE<=0 col < l -1 row < n -1 row = n -1 col = l -1 IP’s FSM

13 13 EXAMPLE – Continue  In our example the fsm will do the following: 0x00x10x2 0x3 0x00x1 325 46 Xilinx Multiplier accumulator 12 3 4 6 6 0x00x1 6 2 4 08 5 6 38 AR0 AR1 AR2

14 14 EXAMPLE – Continue  And Indeed: X = 31- 25 4 6 6 38

15 15 Implementation - continue  The Result matrix is saved in the IP’s third address range. the IP informs the processor about the completion of the task by asserting finish bit that is being polled by the CPU.  After the CPU reads that finish bit = 1, it can read the result matrix from the IP.

16 16 The Verification Process  For sizes of up to 16*16 the validation was by allocating memory and random values for matrices A, B.  The validation was simply a comparison between matrices C (result) and D (expected).   When dealing with larger sizes we encountered a problem of allocating large memories (in software).   So we didn’t allocate memory and used instead:   A[i] [j] = i + j ; B[i] [j] = i - j ; And compared it to the known result.

17 17 Performance analysis  The state machine number of clock cycles: { [ (3*M +2) x L ] + 2 } x N + 3 = …= { [ (3*M +2) x L ] + 2 } x N + 3 = …= = 3*M*L*N + 2*L*N + 2*N + 3 = O(N*M*L). = 3*M*L*N + 2*L*N + 2*N + 3 = O(N*M*L).  Total : O(N*M*L) clock cycles.  Since we found it difficult to find the number of clock cycles that take in software, we conducted a comparison in software that gives a good indication on our hardware.

18 18 Performance analysis – continue  In Software the calculation is: for (i=0; i<N; i++) for(j=0; j<L; j++) for(k=0; k<M; k++) for(k=0; k<M; k++) C[i][j] + = A[i][k] * B[k][j];  In this implementation the CPU enters the loop N*M*L times (not clock cycles) !

19 19 Performance analysis – continue  In order to compare it to our IP’s performance, we counted the number of times we “visit” inside the while() loop in which we wait for the finish signal.  The following graph shows a comparison between the number of CPU operations for square matrices of sizes 2x2 til 15x15.

20 20 Performance analysis – Comparison results Conclusion – Our IP provides an excellent solution for applications that require many multiplications of large matrices !!! Conclusion – Our IP provides an excellent solution for applications that require many multiplications of large matrices !!!

21 21 Improvement suggestions  For better performance additional Multipliers can be added to the design. so that in each cycle more numbers could be multiplied and speed up the calculation time. so that in each cycle more numbers could be multiplied and speed up the calculation time.  using an interrupt instead of polling would also save valuable CPU time.

22 22 Thank you !


Download ppt "1 Matrix Multiplication on SOPC Project instructor: Ina Rivkin Students: Shai Amara Shuki Gulzari Project duration: one semester."

Similar presentations


Ads by Google