May. 2009 Wu Jinyuan, Fermilab 1 FPGA and Reconfigurable Computing Wu, Jinyuan Fermilab ICT May, 2009.

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 1 FPGA and Reconfigurable Computing Wu, Jinyuan Fermilab ICT May, 2009

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 2 Outline Electronic Aspect of FPGA:  LED Flashing  Logic Elements in a Nutshell  TDC and ADC FPGA as a Computing Fabric:  Moore’s Law Forever?  Space Charge Computing with FPGA Cores  Doublet Matching & Hash Sorter  Triplet Matching & Tiny Triplet Finder  Enclosed Loop Micro-Sequencer (ELMS)

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 3 Moore’s Law Number of transistors in a package: x2 /18months Taken from www.intel.com

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 4 Status of Moore’s Law: an Inconvenient Truth # of transistors  Yes, via multi-core. Clock Speed  ? Taken from www.intel.com

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 5 The Execution & Non-Execution Cycles In current micro-processors:  Each instruction takes one clock cycle to execute.  It takes many clock cycles to prepare for executing an instruction.  Pipelined? Yes. But the non-execution pipeline stages consume silicon area, power etc.  To execute an instruction != to do useful calculation. Can we do something different? From MIT 6.823 Open Course Site

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 7 The Space Charge Computing Each electron sees sum of Coulomb forces from other N-1 electrons. The total number of calculations is about N 2 and each calculation of the Coulomb force requires a square root, a division and several multiplications. Regular sequential computers are not fast enough. Number of Electrons Number of Calculations/Iteration Computing Time/1000 Iterations @ 10 7 Calculations/s 10 3 ~10 6 100 s 10 4 ~10 8 2.7 hours 10 5 ~10 10 11.6 days 10 6 ~10 12 3.2 years

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 8 The FPGA Board Up to 16 FPGA devices ($32 ea) can be installed onto each board. Each FPGA host one core.

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 9 The 16-bit Demo Core

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 10 A Double-Layer + Single-Layer Sequencer BAAA 001234255 101234 201234 301234 401234 01234 00 A double-layer loop is followed by a single-layer loop. 10 20 31 42 255253 0254 0255 00 Inner Loop Outer Loop State Control

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 11 LUT 10b in 16b out x2x2 x2x2 x2x2 + The Lookup Table The LUT replaces: A Square Rooting Two Multiplications A Reciprocal Operations

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 12 xixi - X X X LUT 10b in 16b out yiyi zizi 16-bit Coordinates 32-bit Forces x2x2 x2x2 x2x2 + - - Number of Bits for Input to LUT 32-bit Sum of Squares A 32-bit input LUT is too big. 2 32 =4G words. Shifters are used before and after the LUT. Leading zeros are eliminated:  00000001010110  0101011000

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 13 x1x2 (x1-x2) (x1-x2)^2 Sum of 3 Squares LUT Bit Evolution Before LUT If ((High Bits) != 0) Choose (High Bits) Else Choose (Low Bits)

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 14 (x1-x2) LUT Bit Evolution After LUT Shift 2n before LUTShift 3n after LUT

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 15 Two Electrons with Natural Scales 256 nm 28ps e e

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 16 256 Charged Particles, Iteration 0

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 25 Speed Comparison with Regular CPU The FPGA core is x10 faster than a typical 2.2 GHz CPU core. The FPGA core runs at 200 MHz or 200 M Coulomb force calculations/s. It seems the CPU core needs 80-100 clock cycles for each Coulomb force calculation.

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 26 One Board: 8 FPGA Cores One board has a calculation capacity as 40 dual core CPUs. The power consumption of one board is < 4.5 W. Newer FPGAs capable of hosting 4 cores/FPGA are available. One Core/FPGA = 5 Dual Core CPUs One Core/FPGA = 5 Dual Core CPUs 8 Cores/Board = 40 Dual Core CPUs

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 28 Example of Doublet Match, PET Positrons and electrons annihilate to produce pairs of photons. The back- to-back photons hit the detector at nearly the same time. Detector hits are digitized and hits at nearly the same time are to be matched together. The process takes O(n^2) clock cycles. T D T D Group 1 Group 2 -  T<A?  T>(-A)?

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 29 Hash Sorter K K D K D Pass 1:  Data in Group 1 are stored in the hash sorter bins based on key number K. Pass 2:  Data in Group 2 are fetched though and paired up with corresponding Group 1 data with same key number K. Group 1 Group 2 The entire pairing process takes 2n clock cycles, rather than n 2 clock cycles.

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 30 DINDOUT Index RAM Pointer RAM DATA RAM K Link List Structure of Hash Sorter

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 31 Hash Sorter K Using hash sorter, matching pairs can be grouped together using 2n, rather than n 2 clock cycles.

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 33 Hits, Hit Data & Triplets Hit data come out of the detector planes in random order. Hit data from 3 planes generated by same particle tracks are organized together to form triplets.

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 34 Three data items must satisfy the condition: x A + x C = 2 x B. A total of n 3 combinations must be checked (e.g. 5x5x5=125). Three layers of loops if the process is implemented in software. Large silicon resource may be needed without careful planning: O(N 2 ) Triplet Finding Plane APlane BPlane C

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 35 Tiny Triplet Finder Operations Pass I: Filling Bit Arrays Note: Flipped Bit Order Physical Planes Bit Array/Shifters For any hit… Fill a corresponding logic cell. x A + x C = 2 x B x A = - x C + constant

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 36 Tiny Triplet Finder Operations Pass II: Making Match For any center plane hit… Logically shift the bit array. Perform bit- wise AND in this range. Triplet is found. Physical Planes Bit Array/Shifters

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 37 Tiny? Yes, Tiny! – Logic Cell Usage: AM, CAM, Hough Transform etc., O(N 2 ) Tiny Triplet Finder O(N*logN) The triplet finding process for FPGA schemes takes 2n clock cycles. The Tiny Triplet Finder uses much fewer logic elements

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 38 Tiny Triplet Finder Reuse Coincident Logic via Shifting Hit Patterns C1 C2 C3 One set of coincident logic is implemented. For an arbitrary hit on C3, rotate, i.e., shift the hit patterns for C1 and C2 to search for coincidence.

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 39 Tiny Triplet Finder for Circular Tracks *R1/R3 *R2/R3 Triplet Map Output To Decoder Bit Array Shifter Bit Array Shifter Bit-wise Coincident Logic 1.Fill the C1 and C2 bit arrays. (n1 clock cycles) 2.Loop over C3 hits, shift bit arrays and check for coincidence. (n3 clock cycles) Also works with more than 3 layers

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 40 Hit Matching SoftwareFPGA Typical FPGA Resource Saving Approaches O(n 2 ) for(){ for(){…} } O(n)*O(N) Comparator Array Hash Sorter O(n)*O(N): in RAM O(n 3 ) for(){ for(){…} } O(n)*O(N 2 ) CAM, Hugh Trans. Tiny Triplet Finder O(n)*O(N*logN) O(n 4 ) for(){ for(){ for() {…} }}}

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 42 Micro-computing vs. Reconfigurable Computing In microprocessor, the users specify program on fixed logic circuits. In FPGA, the users specify logic circuits (as well as program). The FPGA computing needs not to follow microprocessor architectures. (But useful experiences can be borrowed.) The usefulness of FPGA reconfigurable computing is still to be fully appreciated. (100+3-4)*5+7 =? 100 3 4 5 7 Control: Data: 100,3,4,5,7 LD(-)(+)(*)(+) CPU FPGA Data Program Configuration Data Program

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 43 FPGA Process Sequencing Options Program Type Program Length (CLK cycles) ReprogramResource Usage Finite State Machine (FSM) Fixed Wired 10HardSmall Enclosed Loop Micro- Sequencer (ELMS) Memory Stored Program 10-1000EasySmall Microprocessor (MP) Memory Stored Program >1000EasyLarge

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 44 The Between Counter 0,1,2,3,4,5,6,7,8,9,A 5,6,7,8,9,A SLOAD D[] SCLR N Q[] M-1 == A[] B[] T 5,6,7,8,9,A 5,6,7,8,9,A,B,C,D,E,F… PC0: instr0 PC1: instr1 PC2: instr2 PC3: instr3 PC4: instr4 PC5: instr5 PC6: instr6 PC7: instr7 PC8: instr8 PC9: instr9 PCA: instrA PCB: instrB PCC: instrC PCD: instrD T ROM Between Counter Control Signals

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 45 ELMS– Enclosed Loop Micro-Sequencer Loop & Return Logic + Stack Conditional Branch Logic Program Counter ROM 128x 36bits A Reset CLK Control Signals PCControl SignalsOpration 00000000000000000 01001000100011010LDR1, #n 02000010001000000LDR2, #addr_a 03000000000000100LDR3, #addr_X 04000000010001000LDR7, #0 05000000000100001BckA1LDR4, (R2) 06000100000010000INCR2 07000001000100000LDR5, (R3) 08000100010000001INCR3 09001001000100000MULR6, R4, R5 0a000000010001000EndA1ADDR7, R7, R6 0b000010000010000DECR1 0c000000100000100BRNZBckA1 Special in ELMS Supports FOR loops at machine code level PC+ROM is a good sequencer in FPGA. Adding Conditional Branch Logic allows the program to loop back. Loop & Return Logic + Stack is a special feature in ELMS that supports FOR loops at machine code level. Allows jump back as in microprocessors

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 46 ELMS – Detailed Block Diagram User Control Signals FORBckA1 EndA1 #n LDR2, #addr_a LDR3, #addr_X LDR7, #0 BckA1LDR4, (R2) INCR2 LDR5, (R3) INCR3 MULR6, R4, R5 EndA1ADDR7, R7, R6 LDR8, R7 The Stack supports nested loops and sub- routing calls up to 128 layers.

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 47 Software: Using Spread Sheet as Compiler

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 48 What’s Good About ELMS: FOR Loops at Machine Code Level w/ Zero-Over Head Looping sequence is known in this example before entering the loop. Regular micro-processor treat the sequence as unknown. ELMS supports FOR loops with pre-defined iterations at machine code level. Execution time is saved and micro-complexities (branch penalty, pipeline bubble, etc.) associated with conditional branches are avoided. LDR1, #n LDR2, #addr_a LDR3, #addr_X LDR7, #0 BckA1LDR4, (R2) INCR2 LDR5, (R3) INCR3 MULR6, R4, R5 EndA1ADDR7, R7, R6 DECR1 BRNZBckA1 FORBckA1 EndA1 #n LDR2, #addr_a LDR3, #addr_X LDR7, #0 BckA1LDR4, (R2) INCR2 LDR5, (R3) INCR3 MULR6, R4, R5 EndA1ADDR7, R7, R6 25% MicroprocessorThe ELMS Conditional Branch

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 49 ELMS as a Hardware Loop Sequencer Loop & Return Logic + Stack Conditional Branch Logic Program Counter ROM 128x 36bits A Reset CLK Control Signals There are DSP devices that support hardware loop for zero-overhead loop implementation. The emphasis of ELMS is that the FOR loop and subroutine calls/return are treated the same. Any program passage can be used as a subroutine without needing a return instruction. The ELMS uses as less resource as possible for FPGA implementation. From http://www.analog.com/

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 50 No ALU => Small Resource Usage Program DATA Memory Princeton Architecture Harvard Architecture Fermilab (?) Architecture Program Control ALU Program Memory Program Control ALU DATA Memory Program Memory Sequencer (ELMS) Data Processor DATA Memory The Princeton Architecture is more suitable at system level while Harvard Architecture is better suited at micro-structure level. Regular microprocessors cannot run looped program without an ALU. The ALU takes large amount of resource while may not be efficiently utilized for data processing tasks in FPGA. The ELMS can run nested loop program without an ALU. Further separation of Program and data is therefore possible. The ELMS is kept small. The von Neumann Architecture

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 51 The Fever of Moore ’ s Law vs. Maxwell ’ s Equations 1998 2000 2002 2004 2006 2008 2010 Op/sec MIT, 2002 During the hot days of Moore’s Law, the rules of thumb are:  BRB – Buy Rather than Build  URU – Use Rather than Understand  WRW – Wait Rather than Work From fundamental principles like Maxwell’s Equations, it is known limits of Moore’s Law exist. The technology advance comes from hard work. WRW

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 52 Indirect Cost of Complexity If something like this can do the job… … why do these?

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 53 The Winning Line of FPGA Computing We commonly heard:  FPGA devices contains millions gate.  High parallelism can be implemented in FPGA.  FPGA cost drops by half every 18 months. We want to emphasize, especially to our young students: 1. Creativity, 2. Creativity, 3. Creativity, on Arithmetic ops, on Algorithms, on Architectures & on All Aspects. O Freunde, nicht diese Töne!

May. 2009 Wu Jinyuan, Fermilab jywu168@fnal.gov 54 The End Thanks

May. 2009 Wu Jinyuan, Fermilab 1 FPGA and Reconfigurable Computing Wu, Jinyuan Fermilab ICT May, 2009.

Similar presentations

Presentation on theme: "May. 2009 Wu Jinyuan, Fermilab 1 FPGA and Reconfigurable Computing Wu, Jinyuan Fermilab ICT May, 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

May. 2009 Wu Jinyuan, Fermilab 1 FPGA and Reconfigurable Computing Wu, Jinyuan Fermilab ICT May, 2009.

Similar presentations

Presentation on theme: "May. 2009 Wu Jinyuan, Fermilab 1 FPGA and Reconfigurable Computing Wu, Jinyuan Fermilab ICT May, 2009."— Presentation transcript:

Similar presentations

About project

Feedback