Presentation is loading. Please wait.

Presentation is loading. Please wait.

Resource Saving in Micro-Computer Software & FPGA Firmware Designs Wu, Jinyuan Fermilab Nov. 2006.

Similar presentations


Presentation on theme: "Resource Saving in Micro-Computer Software & FPGA Firmware Designs Wu, Jinyuan Fermilab Nov. 2006."— Presentation transcript:

1 Resource Saving in Micro-Computer Software & FPGA Firmware Designs Wu, Jinyuan Fermilab Nov. 2006

2 Resource Saving in FPGA From: “CompactFPGAdesign.pdf” Glue Logic Digitization –TDC, (ADC), etc. Communication –C5, Digital Phase Follower, etc. Data Organization –Zero-Suppression, Parasitic Event Building, etc. Reconfigurable Computing –Hash Sorter, TTF, ELMS, etc. Software -- Firmware

3 Computer Is Fast This is the first impression of many beginners. “FPGA is big.” Program Creation Time > Execution Time

4 How to Slow Down Computers? Single Layer Loop: –256 x 3 x 4 x 0.25 us = 0.75 ms Nested Loops: –256 x0.75 ms =.19 s 5 56 | 2 - | 1 16 | 2 - |. Square Wave Generator CPU Z80 4MHz “LD A,A” = “NOOP” 1 NOOP spends 1  s 1,000,000 NOOP spends 1s LD A,#255 BACKA:NOOP DEC A JP NZ, BACKA LD B,#255 BACKB:LD A,#255 BACKA:NOOP DEC A JP NZ, BACKA LD A,B DEC B DEC A JP NZ, BACKB T

5 Knowing Slow, Knowing Fast Where Resources Can Be Saved For micro-computer software: –Pay attention to loops and frequently called subroutines, –Especially inner-most nested loops. For FPGA firmware: –Algorithms rooted in micro-computer software. –Reusable blocks. –Occasionally used functions.

6 Example: Inner-Product Avoid using conditional branch for loop control: ELMS –Saves 25% execution time in this case. LDR1, #n LDR2, #addr_a LDR3, #addr_X LDR7, #0 BckA1LDR4, (R2) INCR2 LDR5, (R3) INCR3 MULR6, R4, R5 EndA1ADDR7, R7, R6 DECR1 BRNZBckA1 R1-- R3++ X R2++ a R4 R6 R5 R7 x + Multiplier-less algorithms. Reuse computations: Using fast algorithms like FFT. Avoid entering the loop: Using early constraints.

7 Computing Module in Micro-processor & FPGA Micro-processors use full sequencing approach. One operation is performed in each clock cycle. In FPGA, flatten logics are allowed and are fast but take large silicon area. (100+3-4)*5+7 =? 100 3 4 5 7 Control: Data: 100,3,4,5,7 LD(-)(+)(*)(+)

8 Sequencing in FPGA for Resource Control Sequencing is a very efficient means of resource control in FPGA. Reuse processing resource for similar function and/or different channels. Pay attention to occasionally-used functions like initialization. Initialization Sum1Sum2Sum3Sum4 Sum1Sum2Sum3Sum4 Sum1Sum2Sum3Sum4 Sum1Sum2Sum3Sum4 CH0 CH1 CH2 CH3 Initialization1 Sum1 Sum2 Sum3 Sum4 Sum1 Sum2 Sum3 Sum4 Sum1 Sum2 Sum3 Sum4 Sum1 Sum2 Sum3 Sum4 CH0 CH1 CH2 CH3 Initialization2

9 Suggestion (1) Use partially flatten and partially sequential logic to reach balance of speed and size.

10 ELMS– Enclosed Loop Micro-Sequencer A PC+ROM structure can be a very good sequencer in FPGA. The Conditional Branch Logic is added to support regular conditional branch as in micro-processors. The Loop & Return Logic + Stack are added to support FOR loops with pre-defined iterations at machine code level. The resource usage of ELMS in FPGA is very small. Program Counter ROM 128x 36bits Reset A Control Signals CLK Program Counter ROM 128x 36bits A Loop & Return Logic + Stack Conditional Branch Logic Reset CLK Control Signals FORBckA1 EndA1 #n LDR2, #addr_a LDR3, #addr_X LDR7, #0 BckA1LDR4, (R2) INCR2 LDR5, (R3) INCR3 MULR6, R4, R5 EndA1ADDR7, R7, R6

11 ELMS– Detailed Block Diagram

12 FOR Loops at Machine Code Level Looping sequence is known in this example before entering the loop. Regular micro-processor treat the sequence as unknown. ELMS supports FOR loops with pre-defined iterations at machine code level. LDR1, #n LDR2, #addr_a LDR3, #addr_X LDR7, #0 BckA1LDR4, (R2) INCR2 LDR5, (R3) INCR3 MULR6, R4, R5 EndA1ADDR7, R7, R6 DECR1 BRNZBckA1 FORBckA1 EndA1 #n LDR2, #addr_a LDR3, #addr_X LDR7, #0 BckA1LDR4, (R2) INCR2 LDR5, (R3) INCR3 MULR6, R4, R5 EndA1ADDR7, R7, R6

13 Suggestion (2) Eliminate unnecessary instructions, functions, time slots, etc. whenever it is possible.

14 62 4 1 Do You SUDOKU? Fill in 1-9 so that: –Each column contains 1-9 without repeating. –Each row contains 1-9 without repeating. –Each 3x3 box contains 1-9 without repeating. It is fun to solve by hand. It is also fun to write a solver program, or read a good one. 8 9 17 472 1 9 6 5 3 92 47 3 8 1 9 5 312

15 62 4 1 A Possible SUDOKU Solver? For all empty boxes, assign 1-9 to each. Check correct or not. If not, repeat. 8 9 17 472 1 9 6 5 3 92 47 3 8 1 9 5 312 81-28=53 empty boxes 9 possibilities for each box. Total possibilities 9 53. Assume a computer checks 10 10 possibilities/sec. A year = 3x10 7 sec. Total time to solve: 9 53 /(10 10 x 3x10 7 ) >> 1000 years

16 62 4 1 A Real SUDOKU Solver Eliminate impossible values for each empty box. Assign a possible value to the box. Repeat. 8 9 17 472 1 9 6 5 3 92 47 3 8 1 9 5 312 Total time to solve: < 1 sec

17 sudoku.c #include void show_board(int b[9][9]) { int i, j; printf("+-------+-------+-------+\n"); for (i = 0; i < 9; i++) { printf("|"); for (j = 0; j < 9; j++) { if (b[i][j] == 0) printf(" "); else printf(" %d", b[i][j]); if (j % 3 == 2) printf(" |"); } printf("\n"); if (i % 3 == 2) printf("+-------+-------+-------+\n"); } /* init_board() -- initialize the board with all 0 */ void init_board(int b[9][9]) { int i, j; for (i = 0; i < 9; i++) for (j = 0; j < 9; j++) b[i][j] = 0; } /* read_board() -- read the board from input file */ void read_board(FILE *fp, int b[9][9]) { char s[10]; int i, j, c; i = 0; j = 0; while ((c = fgetc(fp)) != EOF) { if (c == '\n') { i++; j = 0; } else { if (c != ' ') b[i][j] = c - '0'; j++; } /* check_row() -- check the row */ int check_row(int b[9][9], int x, int y, int v) { int i; for (i = 0; i < 9; i++) if (i != y) if (b[x][i] == v) return 0; return v; } /* check_row() -- check the row */ int check_row(int b[9][9], int x, int y, int v) { int i; for (i = 0; i < 9; i++) if (i != y) if (b[x][i] == v) return 0; return v; } /* check_column() -- check the column */ int check_column(int b[9][9], int x, int y, int v) { int i; for (i = 0; i < 9; i++) if (i != x) if (b[i][y] == v) return 0; return v; } /* check_square() -- check the square */ int check_square(int b[9][9], int x, int y, int v) { int i, j, x0, y0; x0 = x / 3; y0 = y / 3; for (i = x0 * 3; i < x0 * 3 + 3; i++) for (j = y0 * 3; j < y0 *3 + 3; j++) if (!((x == i) && (y == j))) if (b[i][j] == v) return 0; return v; } /* unique_solution() -- find the unique solution for [i, j] */ int unique_solution(int b[9][9], int x, int y) { int s = 0, n = 0, i, j, v; for (v = 1; v < 10; v++) { if (check_row(b, x, y, v) && check_column(b, x, y, v) && check_square(b, x, y, v)) { s = v; n++; } if (n == 1) return s; else return 0; } /* possible solutions() -- find the possible solutions for [i, j] */ int possible_solutions(int b[9][9], int x, int y, int s[]) { int n = 0, i, j, v; for (v = 1; v < 10; v++) { if (check_row(b, x, y, v) && check_column(b, x, y, v) && check_square(b, x, y, v)) { s[n++] = v; } return n; } main(int argc, char **argv) { int board[9][9]; FILE *fp; int i, j, k, n; int s[9]; if (argc > 1) { fp = fopen(argv[1], "r"); } else { fp = stdin; } init_board(board); read_board(fp, board); show_board(board); solve(board); } /* solve1() -- one pass to solve the puzzle */ int solve1(int b[9][9]) { int i, j; int solved = 0; for (i = 0; i < 9; i++) for (j = 0; j < 9; j++) if (b[i][j] == 0) { b[i][j] = unique_solution(b, i, j); if (b[i][j]) solved++; } return (solved); } int solve(int b[9][9]) { int b2[9][9], i, j, k, n; int ps[9], s[9], pn, x, y; /* copy the board for recurrsion */ for (i = 0; i < 9; i++) for (j = 0; j < 9; j++) b2[i][j] = b[i][j]; while (solve1(b2)) { show_board(b2); } /* figure out possible solution for unknown */ pn = 10; for (i = 0; i < 9; i++) for (j = 0; j < 9; j++) { if (b2[i][j] == 0) { for (k = 0; k < 9; k++) s[k] = 0; n = possible_solutions(b2, i, j, s); if (n < pn) { pn = n; for (k = 0; k < n; k++) ps[k] = s[k]; x = i; y = j; } if (pn == 10) /* that's it */ { for (i = 0; i < 9; i++) for (j = 0; j < 9; j++) if (b2[i][j] == 0) return 0; return 1; } for (i = 0; i < pn; i++) { b2[x][y] = ps[i]; show_board(b2); if (solve(b2)) { return 1; } return 0; }

18 A Possible Track Finder? Choose a hit for each layer. Fit and calculate  2. Cut on  2. 10 layers O(n 10 ) 100 hits/layer. Total possibilities 10 20. Assume a computer checks 10 10 possibilities/sec. A year = 3x10 7 sec. Total time to check all possibilities: 10 20 /(10 10 x 3x10 7 ) > 300 years

19 A Better Track Finder Choose a hit for each of layer 1 and 2. Choose only compactable hits on layers 3 to 10. Calculate  2. Cut on  2. First constrain at layer 3 O(n 3 ) 100 hits/layer. Total possibilities 10 6 +. Assume a computer checks 10 10 possibilities/sec. Total time to check all possibilities: 10 6 /(10 10 ) > 0.1 ms

20 Suggestion (2) Use early constraints to reduce number of iterations. Evaluate the first constraint as simply as possible. Apply the first constraint as early as possible. (e.g. At layer 3, not until 10) (e.g. Offset, rather than  2)

21 Triplets Triplet: –Data item with 2 free parameters. –# of measurements - # of constraints = 2. –A triplet is not necessarily a straight track segment. –A triplet may have more than 3 measurements. Circular track with known interaction point is a triplet since it has 2 free parameters. (Otherwise it has 3 parameters.)

22 Triplet Finding Triplet finding can be done in software or in firmware. Tiny Triplet Finder (TTF) is a firmware implementation developed in Fermilab BTeV. Tiny = small silicon usage. For more info on TTF, see handout. Triplet Finding O(n 3 ) Software Processes O(n) FPGA Firmware Functions O(N 2 ) Implementations CAM, Hough Trans., etc. O(N*log(N)) Implementation Tiny Triplet Finder

23 DFT and FFT Why log(N)? –Information propagation –Multiplication reuse of rotational factors DFT: O(N 2 ) FFT: O(N*log(N))

24 FFT for Arbitrary Precision Multiplications Multiplication of two very long integers consumes O(N 2 ) computation. It can be viewed as a convolution. Convolutions can be computed using FFT with O(N*log(N)) computation.

25 Suggestion (3) Take advantages of fast (like FFT) or tiny (like Tiny Triplet Finder) algorithms.

26 Multiplier-less (ML) Approaches Canonic signed digit (CSD) and sum of powers of two (SOPOT) representations: –5xA = 4xA + A, 248xA = 256xA - 8xA Recursive implementation of finite impulse respond (FIR) filter: –Sliding sum, sinc2, etc. CORDIC or similar algorithms: –ML FFT, rotators, etc. Distributed Arithmetic (DA) designs: –Look-up tables. Single-bit sinc3 FIR decimation filter –In delta-sigma ADC

27 Least-Square (LS) Track Fitter Standard least square fitting uses large amount of multiplications and possibly divisions.

28 Multiplier-less (ML) Track Fitter The coefficients are scaled to avoid using dividers. The coefficients for ML approximate fitting algorithm are “two-bit” integers. The full multiplications are replaced by two integer shift-additions

29 Errors of LS and ML Track Fitters The errors of ML approximate fitting algorithm are only slightly larger than LS fitting errors..

30 Errors Several Track Fitters Generally speaking, more computations yield better quality of the results. However, after certain point, the quality of the results does not improve as rapidly as before. It is common that large amount of computation brings only small improvement in the mathematically perfect algorithms.

31 Suggestion (4) Consider resource/power friendly algorithms such as multiplier-less, divider-less algorithms.

32 Why Saving Resource ?

33 Moore’s Law Number of transistors in a package: x2 /18months Taken from www.intel.com

34 The Fever of Moore’s Law vs. Maxwell Equations During the fever of Moore’s law, saving computing resource became non-critical, if not impossible. From basic principle like Maxwell Equations, it was know the fever would not last. 1998 2000 2002 2004 2006 2008 2010 Op/sec MIT, 2002

35 Moore’s Law Today # of transistors –Yes, via multi-core. Clock Speed –? Taken from www.intel.com

36 Total Useful Works = (Clock Frequency) x (Silicon Size) x (Efficiency) There is big room for improvement on computation efficiency in both micro-computer software and FPGA firmware. Resource saving helps today when technology stales. Resource saving helps future with technology progresses. E F S E F S

37 Resource Saving Helps Future Where Resources Can Be Saved Today’s subroutines or FPGA blocks are to be reused thousands of times in the future: –If today’s design is slightly too slow, too big… Today’s students as well as old people gain experience from today’s work and become bosses, reviewers, etc. in the future: –The “experience” (?) –E. g.: Is a wedding with $20K budget possible? (Given the “experience” of $1000/pizza?).

38 The End Thanks

39 Three layers of nested loops are needed if the process is implemented in software. A total of n 3 combinations must be checked (e.g. 5x5x5=125). In FPGA, to “unroll” 2 layers of loops, large silicon resource may be needed without careful planning: O(N 2 ) Triplet Finding Plane APlane BPlane C for (i=0; i<N_A; i++){ for (j=0; j<N_B; j++){ for (k=0; k<N_C; k++){ }

40 Circular Tracks from Collision Point on Cylindrical Detectors For a given hit on layer 3, the coincident between a layer 2 and a layer 1 hit satisfying coincident map signifies a valid circular track. A track segment has 2 free parameters, i.e., a triplet. The coincident map is invariant of rotation.  1 -  3 )+64  2 -  3 )+64

41 Tiny Triplet Finder Reuse Coincident Logic via Shifting Hit Patterns C1 C2 C3 One set of coincident logic is implemented. For an arbitrary hit on C3, rotate, i.e., shift the hit patterns for C1 and C2 to search for coincidence.

42 Tiny Triplet Finder for Circular Tracks *R1/R3 *R2/R3 Triplet Map Output To Decoder Bit Array Shifter Bit Array Shifter Bit-wise Coincident Logic 1.Fill the C1 and C2 bit arrays. (n1 clock cycles) 2.Loop over C3 hits, shift bit arrays and check for coincidence. (n3 clock cycles) Also works with more than 3 layers


Download ppt "Resource Saving in Micro-Computer Software & FPGA Firmware Designs Wu, Jinyuan Fermilab Nov. 2006."

Similar presentations


Ads by Google