Presentation is loading. Please wait.

Presentation is loading. Please wait.

South Carolina The DARPA Data Transposition Benchmark on a Reconfigurable Computer Sreesa Akella, Duncan A. Buell, Luis E. Cordova, and Jeff Hammes Department.

Similar presentations


Presentation on theme: "South Carolina The DARPA Data Transposition Benchmark on a Reconfigurable Computer Sreesa Akella, Duncan A. Buell, Luis E. Cordova, and Jeff Hammes Department."— Presentation transcript:

1 South Carolina The DARPA Data Transposition Benchmark on a Reconfigurable Computer Sreesa Akella, Duncan A. Buell, Luis E. Cordova, and Jeff Hammes Department of Computer Science and Engineering University of South Carolina DARPA Data Transposition Benchmark Let {A i } be a stream of n-bit integers of length L. consider each successive block of n integers as a n x n matrix of bits. For each such matrix, transpose the bits such that bit b ji is interchanged with bit b ji. n=32L=10 7 ITER = 400 n = 64L = 10 7 ITER = 230 n = 1024L = 10 7 ITER = 12 Software Implementation Written in C and uses a two loop structure. Benchmark Total User time No of Iterations Time per Iteration n = 32 bit560940014.02 n = 64-bit816023035.47 n = 1024-bit200412187.67 SRC-6 Reconfigurable Computer SRC-6 Implementations - The SRC implementation- Two ways. - Transposition function in C – C Map. - Transposition function in Verilog – Verilog Map. Timing Results SRC-6 C Map Implementation - The main program calls a C map function. - The parameters passed are the A, E values. -A has the input values, E has the output values. - The two loop structure was used for transposition. -Implementation was slower than software. // Assigning values for (i = 0; i < m; i++){ fscanf(in, "%lld", &temp); A[i] = temp; E[i] = 0; } for (j=0;j<230;j++){ for(k=0;k<nblocks;k++) // assign values in blocks of half // the bank capacity // call map function dt (A, E, m, &time, 0); …. } - Parallel sections for computation and data transfer. - Unrolled the inner loop. - In ‘n’ cycles we get all the ‘n’ outputs. - In ‘n’ cycles we read these ‘n’ values back to memory. - All benchmarks were implemented. Modifications to the C Map Implementation Timing Results Benchmark Total User time No of Iterations Time per Iteration n = 32 bit2444000.61 n = 64-bit1292300.56 n = 1024-bit97128.08 - The main program calls the map function. - The map functions calls a Verilog macro. - The Verilog macro implements the transposition. - Performance was better than C Map implementation. SRC-6 Verilog Map Implementation Timing Results Benchmark Total User time No of Iterations Time per Iteration n = 32 bit1794000.44 n = 64-bit982300.42 n = 1024-bit72.7126.05 Parallel 3-unit Implementation - Utilizes all the 6 available memory banks - 3 for input and 3 for output - Only one macro call from the map function - Verilog macro has 3 units working in parallel - Theoretically 3 times computational speedup - overall twice speedup ABC U1U2U3 DEF OBMs FPGA Benchmark Total User time No of Iterations Time per Iteration n = 32 bit954000.23 n = 64-bit602300.26 n = 1024-bit44123.66 Timing Results 128-bit Data transfer Implementation - 128-bit word transfers to 4 OBMs - Effectively 2 word per cycle transfer -Transposition: -2 units for 32&64-bit; 4 units for 1024-bit -32-bit: read 8 words from 4 banks & use 4 bit shifts -64-bit: read 4 words from 4 banks & use 2 bit shifts -1024-bit: read 4 words and use 4 units in parallel - 4 OBMS for input and 2 for output -2 Memory loop dependency cycles added to latency ABC U1U2 D EF OBMs FPGA Each performs 2 or 4-bit shifts Benchmark Total User time No of Iterations Time per Iteration n = 32 bit554000.13 n = 64-bit542300.23 n = 1024-bit30122.50 Timing Results Performance Analysis Benchmarks Speedup over software* ABCDE 32-bit1521414668 64-bit2533555261 1024-bit23315175 * A- C Map, B-Verilog Map, C- Parallel 3-unit, D- 128-bit, E-Parallel 2-unit 128-bit - Parallel 3 unit: - 32-bit: 30%, 64-bit: 53%, 1024-bit: 47% - Parallel 2 unit 128-bit: - 32-bit: 26%, 64-bit: 40%, 1024-bit: 59% - Can have more parallel units - Will lead to bank conflicts - More memory banks: run out of I/O pins on FPGA Analysis - Parallelism Conclusions - SRC-6 computer provides great speedup - 75 times for 1024-bit benchmark - Parallelism exploited to a certain degree - Could explore: - Highly Parallel multi-PE architectures - Distributed memory architecture COMPUTER SCIENCE & ENGINEERING MAPLD 2005/243


Download ppt "South Carolina The DARPA Data Transposition Benchmark on a Reconfigurable Computer Sreesa Akella, Duncan A. Buell, Luis E. Cordova, and Jeff Hammes Department."

Similar presentations


Ads by Google