Resource Saving in Micro-Computer Software & FPGA Firmware Designs Wu, Jinyuan Fermilab Nov. 2006.

Slides:

Advertisements

Similar presentations

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Advertisements

Henry Hexmoor1 C hapter 4 Henry Hexmoor-- SIUC Rudimentary Logic functions: Value fixing Transferring Inverting.

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

ALGORITHMS THIRD YEAR BANHA UNIVERSITY FACULTY OF COMPUTERS AND INFORMATIC Lecture two Dr. Hamdy M. Mousa.

Jan. 2009Jinyuan Wu & Tiehui Liu, Visualization of FTK & Tiny Triplet Finder Jinyuan Wu and Tiehui Liu Fermilab January 2010.

Some Thoughts on L1 Pixel Trigger Wu, Jinyuan Fermilab April 2006.

C Lecture Notes 1 Program Control (Cont...). C Lecture Notes 2 4.8The do / while Repetition Structure The do / while repetition structure –Similar to.

Recursion. Objectives At the conclusion of this lesson, students should be able to Explain what recursion is Design and write functions that use recursion.

© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. Chapter 5 - Functions Outline 5.1Introduction 5.2Program.

CS 104 Introduction to Computer Science and Graphics Problems

© Love Ekenberg The Algorithm Concept, Big O Notation, and Program Verification Love Ekenberg.

1 Computer System Overview OS-1 Course AA

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 16: Application-Driven Hardware Acceleration (1/4)

Chapter 1 and 2 Computer System and Operating System Overview

Digital Kommunikationselektronik TNE027 Lecture 3 1 Multiply-Accumulator (MAC) Compute Sum of Product (SOP) Linear convolution y[n] = f[n]*x[n] = Σ f[k]

Contemporary Logic Design Arithmetic Circuits © R.H. Katz Lecture #24: Arithmetic Circuits -1 Arithmetic Circuits (Part II) Randy H. Katz University of.

Computer Science 1620 Programming & Problem Solving.

Hashing General idea: Get a large array

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

Low Cost TDC Using FPGA Logic Cell Delay Jinyuan Wu, Z. Shi For CKM Collaboration Jan

Virtual Memory.

Chapter 3 Memory Management: Virtual Memory

Data Structures Using C++ 2E Chapter 6 Recursion.

Advanced Topics on FPGA Applications Screen B Wu, Jinyuan Fermilab IEEE NSS 2007 Refresher Course Supplemental Materials Oct, 2007.

1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.

Data Structures Using C++ 2E Chapter 6 Recursion.

© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. C How To Program - 4th edition Deitels Class 05 University.

Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.

Digital Kommunikationselektronik TNE027 Lecture 2 1 FA x n –1 c n c n1- y n1– s n1– FA x 1 c 2 y 1 s 1 c 1 x 0 y 0 s 0 c 0 MSB positionLSB position Ripple-Carry.

Copyright © 2001, S. K. Mitra Digital Filter Structures The convolution sum description of an LTI discrete-time system be used, can in principle, to implement.

1 7.Algorithm Efficiency What to measure? Space utilization: amount of memory required  Time efficiency: amount of time required to process the data Depends.

Resource Awareness FPGA Design Practices for Reconfigurable Computing: Principles and Examples Wu, Jinyuan Fermilab, PPD/EED April 2007.

1 7.Algorithm Efficiency What to measure? Space utilization: amount of memory required  Time efficiency: amount of time required to process the data.

EKT 221/4 DIGITAL ELECTRONICS II  Registers, Micro-operations and Implementations - Part3.

1 Flowchart notation and loops Implementation of loops in C –while loops –do-while loops –for loops Auxiliary Statements used inside the loops –break –continue.

Mar Wu Jinyuan, Fermilab 1 FPGA: From Flashing LED to Reconfigurable Computing Wu, Jinyuan Fermilab IIT Mar, 2009.

Data Structures and Algorithms Introduction to Algorithms M. B. Fayek CUFE 2006.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

CS 206 Introduction to Computer Science II 02 / 23 / 2009 Instructor: Michael Eckmann.

CSC 221: Recursion. Recursion: Definition Function that solves a problem by relying on itself to compute the correct solution for a smaller version of.

Introduction to State Machine

May Wu Jinyuan, (Fermilab Huang Yifei (IMSA) 1 An FPGA Computing Demo Core for Space Charge Simulation Wu, Jinyuan (Fermilab)

NA62 Trigger Algorithm Trigger and DAQ meeting, 8th September 2011 Cristiano Santoni Mauro Piccini (INFN – Sezione di Perugia) NA62 collaboration meeting,

CDA 3101 Fall 2013 Introduction to Computer Organization The Arithmetic Logic Unit (ALU) and MIPS ALU Support 20 September 2013.

May Wu Jinyuan, Fermilab 1 FPGA and Reconfigurable Computing Wu, Jinyuan Fermilab ICT May, 2009.

Digital Logic Design.

Computer Architecture Lecture 32 Fasih ur Rehman.

 Lecture 2 Processor Organization  Control needs to have the  Ability to fetch instructions from memory  Logic and means to control instruction sequencing.

COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.

Advanced Topics on FPGA Applications Screen A Wu, Jinyuan Fermilab IEEE NSS 2007 Refresher Course Supplemental Materials Oct, 2007.

Readout Processing and Noise Elimination Firmware for the Fermilab Beam Loss Monitor System Wu, Jinyuan C. Drennan, R. Thurman-Keup, Z. Shi, A. Baumbaugh.

CMPSC 16 Problem Solving with Computers I Spring 2014 Instructor: Tevfik Bultan Lecture 4: Introduction to C: Control Flow.

Oct. 2007, Wu Jinyuan, FermilabIEEE NSS Refresher Course1 Digital Design with FPGAs: Examples and Resource Saving Tips Screen B Wu, Jinyuan Fermilab IEEE.

Tiny Triplet Finder Jinyuan Wu, Z. Shi Dec

The SLHC CMS L1 Pixel Trigger & Detector Layout Wu, Jinyuan Fermilab April 2006.

Oct. 2007, Wu Jinyuan, Fermilab IEEE NSS Refresher Course 1 Digital Design with FPGAs: Examples and Resource Saving Tips Screen A Wu, Jinyuan Fermilab.

Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.

Java Basics. Tokens: 1.Keywords int test12 = 10, i; int TEst12 = 20; Int keyword is used to declare integer variables All Key words are lower case java.

ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.

Recursion. Objectives At the conclusion of this lesson, students should be able to Explain what recursion is Design and write functions that use recursion.

Logic Gates Dr.Ahmed Bayoumi Dr.Shady Elmashad. Objectives  Identify the basic gates and describe the behavior of each  Combine basic gates into circuits.

Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,

Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.

1 7.Algorithm Efficiency These factors vary from one machine/compiler (platform) to another  Count the number of times instructions are executed So, measure.

Wu, Jinyuan Fermilab May. 2014

Dr.Ahmed Bayoumi Dr.Shady Elmashad

The Hardware/Software Interface CSE351 Winter 2013

Chapter 5 - Functions Outline 5.1 Introduction

Chapter 9: Virtual-Memory Management

Lect5 A framework for digital filter design

Presentation transcript:

Resource Saving in Micro-Computer Software & FPGA Firmware Designs Wu, Jinyuan Fermilab Nov. 2006

Resource Saving in FPGA From: “CompactFPGAdesign.pdf” Glue Logic Digitization –TDC, (ADC), etc. Communication –C5, Digital Phase Follower, etc. Data Organization –Zero-Suppression, Parasitic Event Building, etc. Reconfigurable Computing –Hash Sorter, TTF, ELMS, etc. Software -- Firmware

Computer Is Fast This is the first impression of many beginners. “FPGA is big.” Program Creation Time > Execution Time

How to Slow Down Computers? Single Layer Loop: –256 x 3 x 4 x 0.25 us = 0.75 ms Nested Loops: –256 x0.75 ms =.19 s 5 56 | 2 - | 1 16 | 2 - |. Square Wave Generator CPU Z80 4MHz “LD A,A” = “NOOP” 1 NOOP spends 1  s 1,000,000 NOOP spends 1s LD A,#255 BACKA:NOOP DEC A JP NZ, BACKA LD B,#255 BACKB:LD A,#255 BACKA:NOOP DEC A JP NZ, BACKA LD A,B DEC B DEC A JP NZ, BACKB T

Knowing Slow, Knowing Fast Where Resources Can Be Saved For micro-computer software: –Pay attention to loops and frequently called subroutines, –Especially inner-most nested loops. For FPGA firmware: –Algorithms rooted in micro-computer software. –Reusable blocks. –Occasionally used functions.

Example: Inner-Product Avoid using conditional branch for loop control: ELMS –Saves 25% execution time in this case. LDR1, #n LDR2, #addr_a LDR3, #addr_X LDR7, #0 BckA1LDR4, (R2) INCR2 LDR5, (R3) INCR3 MULR6, R4, R5 EndA1ADDR7, R7, R6 DECR1 BRNZBckA1 R1-- R3++ X R2++ a R4 R6 R5 R7 x + Multiplier-less algorithms. Reuse computations: Using fast algorithms like FFT. Avoid entering the loop: Using early constraints.

Computing Module in Micro-processor & FPGA Micro-processors use full sequencing approach. One operation is performed in each clock cycle. In FPGA, flatten logics are allowed and are fast but take large silicon area. ( )*5+7 =? Control: Data: 100,3,4,5,7 LD(-)(+)(*)(+)

Sequencing in FPGA for Resource Control Sequencing is a very efficient means of resource control in FPGA. Reuse processing resource for similar function and/or different channels. Pay attention to occasionally-used functions like initialization. Initialization Sum1Sum2Sum3Sum4 Sum1Sum2Sum3Sum4 Sum1Sum2Sum3Sum4 Sum1Sum2Sum3Sum4 CH0 CH1 CH2 CH3 Initialization1 Sum1 Sum2 Sum3 Sum4 Sum1 Sum2 Sum3 Sum4 Sum1 Sum2 Sum3 Sum4 Sum1 Sum2 Sum3 Sum4 CH0 CH1 CH2 CH3 Initialization2

Suggestion (1) Use partially flatten and partially sequential logic to reach balance of speed and size.

ELMS– Enclosed Loop Micro-Sequencer A PC+ROM structure can be a very good sequencer in FPGA. The Conditional Branch Logic is added to support regular conditional branch as in micro-processors. The Loop & Return Logic + Stack are added to support FOR loops with pre-defined iterations at machine code level. The resource usage of ELMS in FPGA is very small. Program Counter ROM 128x 36bits Reset A Control Signals CLK Program Counter ROM 128x 36bits A Loop & Return Logic + Stack Conditional Branch Logic Reset CLK Control Signals FORBckA1 EndA1 #n LDR2, #addr_a LDR3, #addr_X LDR7, #0 BckA1LDR4, (R2) INCR2 LDR5, (R3) INCR3 MULR6, R4, R5 EndA1ADDR7, R7, R6

ELMS– Detailed Block Diagram

FOR Loops at Machine Code Level Looping sequence is known in this example before entering the loop. Regular micro-processor treat the sequence as unknown. ELMS supports FOR loops with pre-defined iterations at machine code level. LDR1, #n LDR2, #addr_a LDR3, #addr_X LDR7, #0 BckA1LDR4, (R2) INCR2 LDR5, (R3) INCR3 MULR6, R4, R5 EndA1ADDR7, R7, R6 DECR1 BRNZBckA1 FORBckA1 EndA1 #n LDR2, #addr_a LDR3, #addr_X LDR7, #0 BckA1LDR4, (R2) INCR2 LDR5, (R3) INCR3 MULR6, R4, R5 EndA1ADDR7, R7, R6

Suggestion (2) Eliminate unnecessary instructions, functions, time slots, etc. whenever it is possible.

Do You SUDOKU? Fill in 1-9 so that: –Each column contains 1-9 without repeating. –Each row contains 1-9 without repeating. –Each 3x3 box contains 1-9 without repeating. It is fun to solve by hand. It is also fun to write a solver program, or read a good one

A Possible SUDOKU Solver? For all empty boxes, assign 1-9 to each. Check correct or not. If not, repeat =53 empty boxes 9 possibilities for each box. Total possibilities Assume a computer checks possibilities/sec. A year = 3x10 7 sec. Total time to solve: 9 53 /(10 10 x 3x10 7 ) >> 1000 years

A Real SUDOKU Solver Eliminate impossible values for each empty box. Assign a possible value to the box. Repeat Total time to solve: < 1 sec

sudoku.c #include void show_board(int b[9][9]) { int i, j; printf(" \n"); for (i = 0; i < 9; i++) { printf("|"); for (j = 0; j < 9; j++) { if (b[i][j] == 0) printf(" "); else printf(" %d", b[i][j]); if (j % 3 == 2) printf(" |"); } printf("\n"); if (i % 3 == 2) printf(" \n"); } /* init_board() -- initialize the board with all 0 */ void init_board(int b[9][9]) { int i, j; for (i = 0; i < 9; i++) for (j = 0; j < 9; j++) b[i][j] = 0; } /* read_board() -- read the board from input file */ void read_board(FILE *fp, int b[9][9]) { char s[10]; int i, j, c; i = 0; j = 0; while ((c = fgetc(fp)) != EOF) { if (c == '\n') { i++; j = 0; } else { if (c != ' ') b[i][j] = c - '0'; j++; } /* check_row() -- check the row */ int check_row(int b[9][9], int x, int y, int v) { int i; for (i = 0; i < 9; i++) if (i != y) if (b[x][i] == v) return 0; return v; } /* check_row() -- check the row */ int check_row(int b[9][9], int x, int y, int v) { int i; for (i = 0; i < 9; i++) if (i != y) if (b[x][i] == v) return 0; return v; } /* check_column() -- check the column */ int check_column(int b[9][9], int x, int y, int v) { int i; for (i = 0; i < 9; i++) if (i != x) if (b[i][y] == v) return 0; return v; } /* check_square() -- check the square */ int check_square(int b[9][9], int x, int y, int v) { int i, j, x0, y0; x0 = x / 3; y0 = y / 3; for (i = x0 * 3; i < x0 * 3 + 3; i++) for (j = y0 * 3; j < y0 *3 + 3; j++) if (!((x == i) && (y == j))) if (b[i][j] == v) return 0; return v; } /* unique_solution() -- find the unique solution for [i, j] */ int unique_solution(int b[9][9], int x, int y) { int s = 0, n = 0, i, j, v; for (v = 1; v < 10; v++) { if (check_row(b, x, y, v) && check_column(b, x, y, v) && check_square(b, x, y, v)) { s = v; n++; } if (n == 1) return s; else return 0; } /* possible solutions() -- find the possible solutions for [i, j] */ int possible_solutions(int b[9][9], int x, int y, int s[]) { int n = 0, i, j, v; for (v = 1; v < 10; v++) { if (check_row(b, x, y, v) && check_column(b, x, y, v) && check_square(b, x, y, v)) { s[n++] = v; } return n; } main(int argc, char **argv) { int board[9][9]; FILE *fp; int i, j, k, n; int s[9]; if (argc > 1) { fp = fopen(argv[1], "r"); } else { fp = stdin; } init_board(board); read_board(fp, board); show_board(board); solve(board); } /* solve1() -- one pass to solve the puzzle */ int solve1(int b[9][9]) { int i, j; int solved = 0; for (i = 0; i < 9; i++) for (j = 0; j < 9; j++) if (b[i][j] == 0) { b[i][j] = unique_solution(b, i, j); if (b[i][j]) solved++; } return (solved); } int solve(int b[9][9]) { int b2[9][9], i, j, k, n; int ps[9], s[9], pn, x, y; /* copy the board for recurrsion */ for (i = 0; i < 9; i++) for (j = 0; j < 9; j++) b2[i][j] = b[i][j]; while (solve1(b2)) { show_board(b2); } /* figure out possible solution for unknown */ pn = 10; for (i = 0; i < 9; i++) for (j = 0; j < 9; j++) { if (b2[i][j] == 0) { for (k = 0; k < 9; k++) s[k] = 0; n = possible_solutions(b2, i, j, s); if (n < pn) { pn = n; for (k = 0; k < n; k++) ps[k] = s[k]; x = i; y = j; } if (pn == 10) /* that's it */ { for (i = 0; i < 9; i++) for (j = 0; j < 9; j++) if (b2[i][j] == 0) return 0; return 1; } for (i = 0; i < pn; i++) { b2[x][y] = ps[i]; show_board(b2); if (solve(b2)) { return 1; } return 0; }

A Possible Track Finder? Choose a hit for each layer. Fit and calculate  2. Cut on  layers O(n 10 ) 100 hits/layer. Total possibilities Assume a computer checks possibilities/sec. A year = 3x10 7 sec. Total time to check all possibilities: /(10 10 x 3x10 7 ) > 300 years

A Better Track Finder Choose a hit for each of layer 1 and 2. Choose only compactable hits on layers 3 to 10. Calculate  2. Cut on  2. First constrain at layer 3 O(n 3 ) 100 hits/layer. Total possibilities Assume a computer checks possibilities/sec. Total time to check all possibilities: 10 6 /(10 10 ) > 0.1 ms

Suggestion (2) Use early constraints to reduce number of iterations. Evaluate the first constraint as simply as possible. Apply the first constraint as early as possible. (e.g. At layer 3, not until 10) (e.g. Offset, rather than  2)

Triplets Triplet: –Data item with 2 free parameters. –# of measurements - # of constraints = 2. –A triplet is not necessarily a straight track segment. –A triplet may have more than 3 measurements. Circular track with known interaction point is a triplet since it has 2 free parameters. (Otherwise it has 3 parameters.)

Triplet Finding Triplet finding can be done in software or in firmware. Tiny Triplet Finder (TTF) is a firmware implementation developed in Fermilab BTeV. Tiny = small silicon usage. For more info on TTF, see handout. Triplet Finding O(n 3 ) Software Processes O(n) FPGA Firmware Functions O(N 2 ) Implementations CAM, Hough Trans., etc. O(N*log(N)) Implementation Tiny Triplet Finder

DFT and FFT Why log(N)? –Information propagation –Multiplication reuse of rotational factors DFT: O(N 2 ) FFT: O(N*log(N))

FFT for Arbitrary Precision Multiplications Multiplication of two very long integers consumes O(N 2 ) computation. It can be viewed as a convolution. Convolutions can be computed using FFT with O(N*log(N)) computation.

Suggestion (3) Take advantages of fast (like FFT) or tiny (like Tiny Triplet Finder) algorithms.

Multiplier-less (ML) Approaches Canonic signed digit (CSD) and sum of powers of two (SOPOT) representations: –5xA = 4xA + A, 248xA = 256xA - 8xA Recursive implementation of finite impulse respond (FIR) filter: –Sliding sum, sinc2, etc. CORDIC or similar algorithms: –ML FFT, rotators, etc. Distributed Arithmetic (DA) designs: –Look-up tables. Single-bit sinc3 FIR decimation filter –In delta-sigma ADC

Least-Square (LS) Track Fitter Standard least square fitting uses large amount of multiplications and possibly divisions.

Multiplier-less (ML) Track Fitter The coefficients are scaled to avoid using dividers. The coefficients for ML approximate fitting algorithm are “two-bit” integers. The full multiplications are replaced by two integer shift-additions

Errors of LS and ML Track Fitters The errors of ML approximate fitting algorithm are only slightly larger than LS fitting errors..

Errors Several Track Fitters Generally speaking, more computations yield better quality of the results. However, after certain point, the quality of the results does not improve as rapidly as before. It is common that large amount of computation brings only small improvement in the mathematically perfect algorithms.

Suggestion (4) Consider resource/power friendly algorithms such as multiplier-less, divider-less algorithms.

Why Saving Resource ?

Moore’s Law Number of transistors in a package: x2 /18months Taken from

The Fever of Moore’s Law vs. Maxwell Equations During the fever of Moore’s law, saving computing resource became non-critical, if not impossible. From basic principle like Maxwell Equations, it was know the fever would not last Op/sec MIT, 2002

Moore’s Law Today # of transistors –Yes, via multi-core. Clock Speed –? Taken from

Total Useful Works = (Clock Frequency) x (Silicon Size) x (Efficiency) There is big room for improvement on computation efficiency in both micro-computer software and FPGA firmware. Resource saving helps today when technology stales. Resource saving helps future with technology progresses. E F S E F S

Resource Saving Helps Future Where Resources Can Be Saved Today’s subroutines or FPGA blocks are to be reused thousands of times in the future: –If today’s design is slightly too slow, too big… Today’s students as well as old people gain experience from today’s work and become bosses, reviewers, etc. in the future: –The “experience” (?) –E. g.: Is a wedding with $20K budget possible? (Given the “experience” of $1000/pizza?).

The End Thanks

Three layers of nested loops are needed if the process is implemented in software. A total of n 3 combinations must be checked (e.g. 5x5x5=125). In FPGA, to “unroll” 2 layers of loops, large silicon resource may be needed without careful planning: O(N 2 ) Triplet Finding Plane APlane BPlane C for (i=0; i<N_A; i++){ for (j=0; j<N_B; j++){ for (k=0; k<N_C; k++){ }

Circular Tracks from Collision Point on Cylindrical Detectors For a given hit on layer 3, the coincident between a layer 2 and a layer 1 hit satisfying coincident map signifies a valid circular track. A track segment has 2 free parameters, i.e., a triplet. The coincident map is invariant of rotation.  1 -  3 )+64  2 -  3 )+64

Tiny Triplet Finder Reuse Coincident Logic via Shifting Hit Patterns C1 C2 C3 One set of coincident logic is implemented. For an arbitrary hit on C3, rotate, i.e., shift the hit patterns for C1 and C2 to search for coincidence.

Tiny Triplet Finder for Circular Tracks *R1/R3 *R2/R3 Triplet Map Output To Decoder Bit Array Shifter Bit Array Shifter Bit-wise Coincident Logic 1.Fill the C1 and C2 bit arrays. (n1 clock cycles) 2.Loop over C3 hits, shift bit arrays and check for coincidence. (n3 clock cycles) Also works with more than 3 layers