1 Program Development Environments Languages & Tools Kris Gaj George Mason University.

SRC Programming Model MicroprocessorFPGA main.c function_1() function_2() ANSI C function_1 function_2 macro_1(a, b, c) macro_2(b, d) macro_2(c, e) macro_3(s, t) macro_1(n, b) macro_4(t, k) FPGA Macro_1 Macro_2 a b c de MAP C (subset of ANSI C) I/O Libraries of macros VHDL macro_1 macro_2 macro_3 macro_4 ……………………….

C function for  P C function for MAP VHDL macro SRC Program Partitioning  P system FPGA system HLL HDL

SRC Compilation Process Object files Application sources Macro sources MAP Compiler  PCompiler Logic synthesis Place & Route Linker.v files.bin files.ngofiles.o files Application executable Configuration bitstreams HDL sources Netlists.c or.f files. vhdor.v files Logic synthesis Place & Route Linker.v files.bin files.ngofiles HDL sources. or.mc or.mf files

SRC Libraries of Hardware Macros User libraries of hardware macros developed by GWU/GMU/USC 2002-2006 Secret-key cipher encryption & breaking Binary Galois Field arithmetic (polynomial basis & normal basis representation) Elliptic Curve Arithmetic Long integer modular arithmetic (RSA) Sorting Image processing Bioinformatics See http://hpc.gwu.edu/libraryhttp://hpc.gwu.edu/library Vendor libraries of hardware macros basic integer and floating-point arithmetic digital signal processing

Library Object Sheets Star Star Bridge Programming Environment - Viva

Place & Route.bin files.ngo files Application executable Configuration bitstreams Netlists Star Bridge Compilation Process VIVA Graphical User Interface User input Xilinx

Cray XD1 Programming Flows Source: [Cray, MAPLD05] Synthesis process (a, m)is begin z <= aand m; end process; intmask(a, m) { return (a & m); } VHDL/Verilog Synthesis Mitrion-C VHDL, Verilog Mentor Graphics Synopsys Synplicity Xilinx a m z 01001011010101 01010110101001 01000101011010 10100101010101 MATLAB/ Simulink The MathWorks Standard Flow Mitrion High-level Flow System Generator Xilinx Place & Route Gate-level EDIF VHDL or Verilog

Xtreme DSP Design Flow

HDL-based SGI Altix Programming Flow IA-32 Linux Machine Design iterations Design Entry (Verilog, VHDL) Design Synthesis (Synplify Pro, Amplify) Design Implementation (ISE) Design Verification Behavioral Simulation (VCS, Modelsim) Static Timing Analysis (ISE Timing Analyzer).v,.vhd.edf.ncd,.pcf.bin Metadata Processing (Python).v,.vhd.cfg Altix Device Programming (RASC Abstraction Layer, Device Manager, Device Driver) Real-time Verification (gdb).c

IA-32 Linux Machine RTL Generation and Integration with Core Services Design Synthesis (Synplify Pro, Amplify) Design Verification Behavioral Simulation (VCS, Modelsim) Static Timing Analysis (ISE Timing Analyzer).v,.vhd.edf.ncd,.pcf.bin Metadata Processing (Python).v,.vhd.cfg Altix Device Programming (RASC Abstraction Layer, Device Manager, Device Driver) Real-time Verification (gdb).c Design Implementation (ISE) HLL Design Entry (Handel-C, Mitrion C, Viva) HLL-based SGI Altix Programming Flow

Mitrion-C Programming Model for Cray & SGI MicroprocessorFPGA main.c function_1(in1) start_fpga() ANSI C based on Mitrion API FPGA I/O RAM Application code (platform independent) Mitrion Distributed Processor Architecture (platform dependent) Mitrion Compiler & Configurator application on the distributed processor Input & output Mitrion-C VHDL function_1(in2) start_fpga()

Compiling A Mitrion Program Processor Configurator Processor Architecture Mitrion-C Source code Processor HW-Design (VHDL IP Core) FPGA Mitrion Software Development Kit Simulator & Debugger Processor Machine-code Compiler

The Mitrion Platform 1) The Mitrion Virtual Processor –A fine-grain massively parallel, configurable soft-core processor –10-30 times faster than traditional CPUs 2) The Mitrion-C programming language –An intrinsically parallel C-family language 3) The Mitrion Software Development Kit –Compiler –Debugger/Simulator –Processor configurator

A New Processor Architecture Specifically For FPGAs int:48 main() { int:48 prev = 1; int:48 fib = 1; int:48 fibonnacci = for(i in ) { fib = fib+prev; prev = fib; } <>fib; } fibonnacci; ? Architecture design goal: High silicon utilization Take advantage of FPGA re-configurability Goal achieved by: Allow processor to be massively parallel Allow processor to be fully adapted to algorithm

Processor Architecture: A Cluster-On-A-Chip Non-Von Neumann architecture Processor architecture more like a cluster Very Fine-Grain Parallelism –Normal clusters run a block of code on each PE 1 –Mitrion runs a single instruction on each PE –Each PE adapted to optimally run its instruction Network topology specific for algorithm No Instruction Stream, instead Data Stream 1) PE = Processing Element

A C-family Language Basic syntax is the same as for other C- family languages Examples: –Blocks are surrounded by { } –Assignment with = –Statements end with ; –if, for, while –Most of the usual c operators –C-style comments (though nestable)

Types Basic types int/uint signed/unsigned integer boolean boolean value ( true / false ) float Floating point real value bits Bit vector format Free bit width int:24 24 bit signed integer uint:19 19 bit unsigned integer float:24.8 IEEE-754 single precision float Collections int:24[100] Vector (indexable collection) int:14 List (no index)

Language constructs Operators if(a>b)... while(i<10)... for(i in )... foreach (e in vector)... int:8 function(int:8 a)...

A C-family Language Important differences –No pointers –No dynamic allocation –Static general recursion only Though loop structures may be dynamic

Compiler, Simulator And Debugger

26 Hardware Software Graphical Data Flow Diagram HLLHDL Increased productivity Increased capability to describe parallel execution Program Entry for FPGA Accelerator Boards Traditional Extended (e.g. Corefire) Hardware Software

27 Increased productivity Increased capability to describe parallel execution Star Bridge Hardware Software porting EDIF COM objects Program Entry for Reconfigurable Computers Hardware Software SRC HLLHDL Graphical Data Flow Diagram HDL macros

28 Increased productivity Increased capability to describe parallel execution Cray XD1 with Simulink Hardware Software Program Entry for Reconfigurable Computers Hardware Software SGI or Cray with Mitrion HLLHDL Graphical Data Flow Diagram Mitrion Processor Mitrion-C Xilinx System Generator Simulink

29 General hierarchy of library files suggested by SRC Computers Inc.

30 Structure of the SRC macro repository common rev_drev_e hdlfile InfoFileBlkBoxFile macro1 macro2macro3 rev_f DebugCodeFile DataSheet

31 Files describing an SRC macro Platform independent –HDL file: macro.v or macro.vh Verilog or VHDL code defining the macro –Debug Code File: macro.c provides the equivalent C functionality for the macro –Data sheet file: datasheet contains the documentation for the macro Platform dependent –Blk Box File: blackbox.v Interface (black box) definition for the macro in Verilog –Info File: info Info file entry for this macro

32 Library Development - SRC HLL (C, Fortran) HDL (VHDL, Verilog)  P system FPGA system Application Programmer Library Developer HLL (C, Fortran) HLL (C, Fortran) LLL (ASM) HLL (C, Fortran)

33 Library Development - StarBridge GDF (Viva) HDL (VHDL, Verilog)  P system FPGA system Application Programmer Library Developer GDF (Viva) GDF (Viva) HLL, LLL (C++, ASM) GDF (Viva)

34 Software libraries and their role in the development of SRC libraries

35 1. source of test vectors for VHDL macros | 2.emulation of hardware during debugging 3.performance comparison Roles of software libraries

36 1. Identify class of applications 2. Identify basic operations required by your applications 3. Determine the existence of the RC library of such operations 4. Determine the existence of the microprocessor library of such operations 5. Determine the right granularity for the required library operations How to approach porting your application to reconfigurable computers?

1.input/output intensive applications bulk data encryption (DES, IDEA, and RC5 encryption) 2. computationally intensive applications secret-key cipher breaking based on the exhaustive key search (DES, IDEA, RC5 breakers) public-key cipher breaking based on factoring 3. latency-critical applications cipher key agreement and signature (ECC schemes, RSA) Classes of applications

Example 1 Cryptography: High-throughput encryption

Cipher message ciphertext cryptographic key K bits

Secret-key ciphers key of Alice and Bob - K AB Alice Bob Network Encryption Decryption

High-Throughput Encryption Encryption MiMi M i+1 M i+2 CiCi C i+1 C i+2.. K0K0 Encryption algorithms: DES, 3DES, AES, RC5, IDEA, etc.

Fully Pipelined Architecture.. Loop unrolling Pipeline stages inside of cipher rounds New input & new output every clock cycle.. Round 1 Round 2 Round k...

Encryption on SRC-6 – No streaming encryption.mc (1) #include void encryption (uint64_t sdata[], uint64_t key, uint64_t *hardware_timein, uint64_t *hardware_timeprocess, uint64_t *hardware_timeout, int mapnum) { OBM_BANK_A (S1OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_B (S2OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_C (S3OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_D (S4OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_E (S5OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_F (S6OBM, uint64_t, MAX_OBM_SIZE) uint32_t encrypt_decrypt; //0:encrypt 1:decrypt int i, nbytes; uint64_t t1,t2,t3,t4;

encrypt_decrypt = 0; nbytes = MAX_OBM_SIZE * 8*3; start_timer(); read_timer(&t1); DMA_CPU(CM2OBM, S1OBM, MAP_OBM_stripe(1,"A,B,C"), sdata, 1, nbytes, 0); wait_DMA(0); read_timer(&t2); for(i=0;i<MAX_OBM_SIZE;i++) { des (S1OBM[i], key, encrypt_decrypt, &S4OBM[i]); des (S2OBM[i], key, encrypt_decrypt, &S5OBM[i]); des (S3OBM[i], key, encrypt_decrypt, &S6OBM[i]); } read_timer(&t3); Encryption on SRC-6 – No streaming encryption.mc (2)

Encryption on SRC-6 – No streaming encryption.mc (3) DMA_CPU(OBM2CM, S4OBM, MAP_OBM_stripe(1,"D,E,F"), sdata, 1, nbytes, 5); wait_DMA(5); read_timer(&t4); *hardware_timein = t2-t1; *hardware_timeprocess = t3-t2; *hardware_timeout = t4-t3; }

Encryption on SRC-6 – No streaming des_blkbx.v module des ( desOut, desIn, keyin, decrypt, clk ) /* synthesis syn_black_box syn_noprune=1 */ ; output [63:0] desOut; input [63:0] desIn; input [63:0] keyin; input decrypt; input clk /* synthesis syn_noclockbuf=1 */ ; endmodule

Encryption on SRC-6 – No streaming des.info (1) BEGIN_DEF "des" MACRO = "des"; LATENCY = 17; STATEFUL = NO; EXTERNAL = NO; PIPELINED = YES; INPUTS = 3: I0 = INT 64 BITS (desIn[63:0]) I1 = INT 64 BITS (keyin[63:0]) I2 = INT 32 BITS (decrypt) ; OUTPUTS = 1: O0 = INT 64 BITS (desOut[63:0]) ; IN_SIGNAL : 1 BITS "clk" = "CLOCK";

Encryption on SRC-6 – No streaming des.info (2) DEBUG_HEADER = $ void des__dbg (long long desin, long long keyin, int decrypt, long long *desout); $; DEBUG_FUNC = $ #include void des__dbg(long long desin, long long keyin, int decrypt, long long *desout) { des_(desout, &desin, &keyin, &decrypt); } $; END_DEF

Encryption on SRC-6 - with streaming encryption.mc (1) #include void encryption (uint64_t sdata[], uint64_t key, uint64_t *hardware_timeprocess, uint64_t *hardware_timeout, int mapnum) { OBM_BANK_A (S1OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_B (S2OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_D (S4OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_E (S5OBM, uint64_t, MAX_OBM_SIZE) uint32_t encrypt_decrypt; //0:encrypt 1:decrypt int i, nbytes; uint64_t t1,t2,t3; Stream_64 S0, S1; uint64_t v0, v1; encrypt_decrypt = 0; nbytes = MAX_OBM_SIZE * 8*2;

start_timer(); read_timer(&t1); #pragma src parallel sections { #pragma src section { stream_dma_cpu_dual (&S0, &S1, PORT_TO_STREAM, S1OBM, DMA_A_B, sdata, 1, nbytes); } #pragma src section { for (i=0; i<MAX_OBM_SIZE; i++) { get_stream (&S0, &v0); get_stream (&S1, &v1); des (v0, key, encrypt_decrypt, &S4OBM[i]); des (v1, key, encrypt_decrypt, &S5OBM[i]); }; } Encryption on SRC-6 – with streaming encryption.mc (2)

Encryption on SRC-6 – with streaming encryption.mc (3) read_timer(&t2); DMA_CPU(OBM2CM, S4OBM, MAP_OBM_stripe(1,"D,E"), sdata, 1, nbytes, 5); wait_DMA(5); read_timer(&t3); *hardware_timeprocess = t2-t1; *hardware_timeout = t3-t2; }

7.5 38 46 Speed up 560 113 93 Xeon 2.8GHz 4,240 SRC-6 End-to-End Throughput (Mbits/s) 10,76011,35019,200 3 RC5 Ciphers (64-bit block) 10,76011,35019,200 3 IDEA Ciphers (64-bit block) 10,76011,35019,200 3 DES Ciphers (64-bit block) SRC-6 Data Transfer Out Throughput (Mbits/s) Data Transfer In Throughput (Mbits/s) Computational Throughput (Mbits/s) Application Results SRC-6 without streaming

8.5 42.5 52 Speed up 560 113 93 Xeon 2.8GHz 4,800 SRC 6 End-to-End Throughput (Mbits/s) 10,7609,000NA 3 RC5 Ciphers (64-bit block) 10,7609,000NA 3 IDEA Ciphers (64-bit block) 10,7609,000NA 3 DES Ciphers (64-bit block) SRC 6 Data Transfer Out Throughput (Mbits/s) Data Transfer In & processing Throughput (Mbits/s) Computational Throughput (Mbits/s) Application Results SRC-6 with streaming (3 units)

9.5 47.5 58 Speed up 560 113 93 Xeon 2.8GHz 5,400 SRC 6 End-to-End Throughput (Mbits/s) 10,76011,350NA 2 RC5 Ciphers (64-bit block) 10,76011,350NA 2 IDEA Ciphers (64-bit block) 10,76011,350NA 2 DES Ciphers (64-bit block) SRC 6 Data Transfer Out Throughput (Mbits/s) Data Transfer In & processing Throughput (Mbits/s) Computational Throughput (Mbits/s) Application Results SRC-6 with streaming (2 units)

4.5 18 26 Speed up 560 113 93 Xeon 2.8GHz 2,430 2,040 2,430 Altix End-to-End Throughput (Mbits/s) NA 12,800 (200MHz) 1 RC5 Cipher (64-bit block) NA 6,400 (100MHz) 1 IDEA Cipher (64-bit block) NA 12,800 (200MHz) 1 DES Cipher (64-bit block) Altix Data Transfer Out Throughput (Mbits/s) Data Transfer In & processing Throughput (Mbits/s) Computational Throughput (Mbits/s) Application SGI Altix MOATB without streaming

5.5 22 33 Speed up 560 113 93 Xeon 2.8GHz 3080 2480 3080 Altix End-to-End Throughput (Mbits/s) NA 12,800 (200MHz) 1 RC5 Cipher (64-bit block) NA 6,400 (100MHz) 1 IDEA Cipher (64-bit block) NA 12,800 (200MHz) 1 DES Cipher (64-bit block) Altix Data Transfer Out Throughput (Mbits/s) Data Transfer In & processing Throughput (Mbits/s) Computational Throughput (Mbits/s) Application SGI Altix MOATB with streaming

Example 2 Cryptography: Cipher Breaking

Secret-key cipher breaking Given: Looked for: Method: remaining plaintext ciphertext or key guessed fragment of the plaintext exhaustive key search (brute-force) attack successive keys cipher

Secret-key cipher breaking Cipher breaker M0M0 C0C0 … K1K1 K2K2 K3K3 KNKN Generated by the cipher breaker Negligibly small input/output Huge amount of computations Correct key Message – Ciphertext pair

Cipher Breaking Results - SRC-6 Application Theoretical Maximum Computational Throughput Measured End-to-End Throughput (million keys/s) Speed-up SRC 6 Xeon 2.8GHz DES Cipher Breaking (20 units working in parallel) 2000 1.771130 IDEA Cipher Breaking (10 units working in parallel) 1000 2.19457 RC5 Cipher Breaking (2 units working in parallel) 200 0.71282

Application Theoretical Maximum Computational Throughput Measured End-to-End Throughput (million keys/s) Speed-up SGI Xeon 2.8GHz DES Cipher Breaking (10 units working in parallel) 2000 1.771130 Cipher Breaking Results SGI Altix MOATB

Example 3: Cryptography: Key exchange using ECC

Secret-key ciphers key of Alice and Bob - K AB Alice Bob Network Encryption Decryption

Key Distribution Problem N - Users N · (N-1) 2 Keys Users Keys 100 5,000 1000 500,000

Public Key (Asymmetric) Ciphers Public key of Bob - K B Private key of Bob - k B Alice Bob Network Encryption Decryption

Alice Bob session key (random secret-key) Bob’s public key Key exchange for secret-key ciphers Bob’s private key Network Session key encrypted using Bob’s public key Message encrypted using session key

Message Hash function Public key cipher Alice Signature Alice’s private key Bob Hash function Alice’s public key Digital Signature Hash value 1 Hash value 2 Hash value Public key cipher yes no Message Signature

Why public-key cryptography is a good application for reconfigurable computers? computationally intensive arithmetic operations unconventionally long operand sizes (160-2048 bits) multiple algorithms, parameters, key sizes, and architectures = need for reconfiguration

Elliptic Curve Cryptosystems (ECC) a family of cryptosystems, rather than a single cryptosystem = added security but need for reconfiguration public key (asymmetric) cryptosystems used for key agreement and digital signatures implementations must be optimized for minimum latency rather than maximum throughput = limited speed-up from parallel processing

Basic operations of ECC Basic operations in Galois Field GF(2 m ) Basic operations on points of an Elliptic Curve addition and subtraction (xor): x+y, x-y (XOR) addition of points: P + Q doubling a point: 2 P projective to affine coordinate: P2A multiplication, squaring: x  y, x 2 inversion: x -1 Complex operations on points of an Elliptic Curve scalar multiplication: k  P = P + P + …+P k times

Hierarchy of ECC functions kP P+Q2P projective_to_affine (P2A) MUL INV High level Medium level Low level 2 ROT XOR Low level 1

C function for  P C function for MAP VHDL macro SRC Program Partitioning  P system FPGA system HLL HDL

Investigated Partitioning Schemes

kP C function for  P C function for FPGA VHDL macro μP Software Only Based on public-domain code by Rosing M., Implementing Elliptic Curve Cryptography, Manning, 1999

MUL4 C function for FPGA VHDL macros ROT XOR C function for µ P 0 H L1 V_ ROT VAR ROT kP P2A kP P+Q2P MUL2 MUL 0HL1 Partitioning INV P2A P+Q2P

MUL4 C function for FPGA VHDL macros ROT XOR C function for µ P 0 H V_ ROT INV kP P2A kP P+Q2P MUL2 0HL2 Partitioning P2A P+Q2P L2

0HM Partitioning C function for FPGA VHDL macros C function for µ P 0 H M P+Q 2P P2A kP

0 0 H 00H Partitioning (VHDL only) C function for  P C function for FPGA VHDL macro

Timing Measurements MAP Alloc. MAP Free DMA DataOut DMA Data In FPGA Computation.c file.mc file End-to-End time (SW) MAP function MAP function FPGA Configure Configuration time MAP Allocation time MAP Release Time End-to-End time (HW)

Results (Latency)

Results (Area)

78 185 349 371 MAP C 15326010070HL1 153 Main C 1601744 2301291 36 Macro Wrapper 0HM 1960 VHDL VHDL macro 0HL2 Algorithm Partitioning Scheme Number of lines of code

Conclusions Assuming focus on: Timing Resources Ease of programming

Conclusions – cont. The best implementation approach: 0HL1 partitioning scheme 893 speedup vs. software and only 0.46 times slowdown versus pure VHDL with ease of implementation

Conclusions – cont. Elliptic Curve Cryptosystem implementation challenging for reconfigurable computers because of optimization for latency rather than throughput limited amount of parallelism First publication showing a 1000x speed-up for a reconfigurable computer application optimized for data latency

Summary of results Type of application End-to-end speed-up of SRC-6 vs. P4 Computationally intensive (cipher breaking) 300-1100 Latency critical (ECC key exchange) Input/output intensive 10-60 (secret key encryption/decryption) 890-1300

GWU_GMU secret key cipher libraries 1.Secret key cipher encryption and decryption 2.Secret key cipher breaking DES IDEA RC5 DES IDEA RC5

GWU_GMU public key cipher libraries 1.Operations in the binary Galois Fields GF(2 m ) a. polynomial basis b. normal basis 2. Multiprecision integer arithmetic 3. Elliptic Curve Operations - addition - doubling - scalar multiplication

89 Example 4 Image Processing: Hyperspectral Dimension Reduction

90  Multi-Spectral Imagery  10’s of bands  Hyperspectral Imagery  100’s-1000’s of bands  Challenges - Curse of Dimensionality  Solution  On-Board Dimension Reduction  Needs  Higher performance  Higher flexibility Multispectral / Hyperspectral Imagery Comparison High-Performance Reconfigurable Computing Application: Hyperspectral Dimension Reduction

91 Hyperspectral Dimension Reduction (Techniques)  Principal Component Analysis (PCA):  Most Common Method Dimension Reduction  Complex and Global computations: difficult for parallel processing and hardware implementations  Does Not Preserve Spectral Signatures  Wavelet-Based Dimension Reduction*:  Simple and Local Operations  High-Performance Implementation  Preserves Spectral Signatures Multi-Resolution Wavelet Decomposition of Each Pixel 1-D Spectral Signature (Preservation of Spectral Locality) * S. Kaewpijit, J. Le Moigne, T. El-Ghazawi, “Automatic Reduction of Hyperspectral Imagery Using Wavelet Spectral Analysis”, IEEE Transactions on Geoscience and Remote Sensing, Vol. 41, No. 4, April, 2003, pp. 863-871.

92  The input image is first convolved along the rows by the two filters L and H and decimated along the columns by two resulting in two "column-decimated" images L and H  Each of the two images, L and H, is then convolved along the columns by the two filters L and H and decimated along the rows by two  This decomposition results into four images, LL, LH, HL and HH  The LL image is taken as the new input to perform the next level of decomposition Discrete Wavelet Transform (DWT) Decomposition (Mallat Algorithm)

93 Wavelet-Based Dimension Reduction (Description)

94 DWT on SRC-6 transfer coefficients to OBM bank c transfer image data to OBM bank a load coefficients from bank c to on-chip registers transfer image data from bank b to the host compute Wavelet read one pixel from bank a store result into bank b End of Image Yes No Read Data MAP Alloc. Map Free Write Data Measurements Scenario

95 DWT on SRC-6 (cnt’d) (Main Program) int main (int argc, char *argv[]) {. /* initialize daubechies coefficients */ float coeff[8]; coeff[0]= coeff[1]= (1+sqrt(3))/(4*sqrt(2)); coeff[2]= coeff[3]= (3+sqrt(3))/(4*sqrt(2)); coeff[4]= coeff[5]= (3-sqrt(3))/(4*sqrt(2)); coeff[6]= coeff[7]= (1-sqrt(3))/(4*sqrt(2));.. /* allocate images */. map_allocate(1); gettimeofday(&time0, NULL); proc_fpga (image_in, image_out, coeff, dx, dy, &ht0, &ht1, &ht2, &ht3, mapno); gettimeofday(&time1, NULL); /* print time difference */. map_free(1);. } Allocate the RP configure and start the Program execution on the FPGA passing the input image pointer and the output image buffer pointer to be used by DMA individual parameters can be passed to the MAP C function such as image dimensions large parameter array, such as the wavelet coefficients, can be transferred using DMA by passing the pointer of the coefficients array Free the RP

96 DWT on SRC-6 (cnt’d) MAP C Function (FPGA.mc) void proc_fpga(int64_t image_in[], int64_t image_out[], int64_t coeff[], int dx, int dy, long long *ht0, long long *ht1, long long *ht2, long long *ht3, int mapnum) { // coefficients float LP0, LP1, LP2, LP3, LP4; float HP0, HP1, HP2, HP3, HP4; // variables int i, j, k; int64_t in_pixel, out_pixel; // input image OBM_BANK_A (AL, int64_t, MAX_OBM_SIZE) // output image OBM_BANK_B (BL, int64_t, MAX_OBM_SIZE) // filter coefficients OBM_BANK_C (CL, int64_t, MAX_OBM_SIZE)

97 start_timer(); read_timer(ht0); // DMA Input Image transfer DMA_CPU (CM2OBM, AL, MAP_OBM_stripe (1, "A"), image_in, 1, Image_size, 0); wait_DMA (0); // DMA coefficients transfer DMA_CPU (CM2OBM, CL, MAP_OBM_stripe(1,“C"), coeff, 1, 4*sizeof(int64_t), 0); wait_DMA(0); read_timer(ht1); for (i = 0; i < 4; i++) { LP0 = LP1 ; HP0 = HP1 ; LP1 = LP2 ; HP1 = HP2 ; LP2 = LP3 ; HP2 = HP3 ; split_64to32_flt_flt(CL[i], & HP3, & LP3 ); } DWT on SRC-6 (cnt’d) MAP C Function (FPGA.mc) transfer image data to an OBM bank transfer coefficients to an OBM bank load coefficients from the OBM bank to on-chip registers

98 for (i = 0; i<Image_Size; i++) { in_pixel = AL[i]; {. } BL[i] = out_pixel; } read_timer(ht2); DMA_CPU (OBM2CM, BL, MAP_OBM_stripe(1,"B"), image_out, 1, Image_size, 0); wait_DMA (0); read_timer(ht3); } DWT on SRC-6 (cnt’d) MAP C Function (FPGA.mc) read pixel value from the OBM bank compute Wavelet store results to the OBM bank transfer image data to the host

99 Overlapping Data Transfer with Computation (SRC-6) #pragma src parallel sections { #pragma src section { for(i = 0; i < i<MAX_OBM_SIZE; i++) { get_stream (&S0, &v0); DO COMPUTATION (Current Data Block) } } /* end of parallel section with compute loop */ #pragma src section { /* Stream DMA_IN */ stream_dma_cpu_dual (&S0, &S1, PORT_TO_STREAM, S1OBM, DMA_A_B, sdata, 1, nbytes); } /* end of parallel section with DMA */ } /* end of parallel sections */ Time Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 Read DMA123XX AlgorithmX123X Write DMAXX123  Improve performance by overlapping algorithm computation and data loading and unloading  Parallel sections  Multiple parallel code blocks are active in parallel

100 Stream_64 S0; #pragma src parallel sections { #pragma src section { int i; for (i=0; i<sz; i++) put_stream (&S0, AL[i]+42, 1); } /* end of parallel section */ #pragma src section { int i; for (i=0; i<sz; i++) get_stream (&S0, &BL[i]); } /* end of parallel section */ } /* end of parallel sections */ Streams (SRC-6) Conventional Data Flow Streams and Conventional Data Flow On- Board Memory or BRAM Compute Loop 1 On- Board Memory or BRAM Compute Loop 2 On- Board Memory or BRAM Compute Loop 1 Steams Compute Loop 2 On- Board Memory or BRAM Time Saves Access to On-BoardMemory Data is flowing In the logic A stream is a data structure that allows flexible communication between concurrent producer and consumer loops

101 Cray XD-1

102 DWT on Cray-XD1 (Main Program) #define APP_CFG_REG 0x08UL #define USR_REG1 0x40UL #define USR_REG2 0x48UL #define USR_REG3 0x50UL #define USR_REG4 0x58UL #define QDR1_OFFSET 0x100UL /* Offset in FPGA memory space.*/ int main (int argc, char *argv[]) { int fp_id; err_e e; u_64 coeff[4] u_64 * dp_base; u_64 * image; fp_id = fpga_open ("/dev/ufp0", O_RDWR|O_SYNC, &e); fpga_load (fp_id, "top.bin.ufp", &e);.. /* Read Image */. /* initialize daubechies coefficients */. fpga_wrt_appif_val (fp_id, coeff[0], USR_REG1, TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[1], USR_REG2, TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[2], USR_REG3, TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[3], USR_REG4, TYPE_VAL, &e); Define the address space for user registers and QDR memory Open the FPGA Device Load the FPGA Transfer coefficients into the FPGA registers

103 fpga_wrt_appif_val (fp_id, 0x0UL, APP_CFG_REG, TYPE_VAL, &e); dp_base = fpga_memmap (fp_id, QDR_SIZE, PROT_READ | ROT_WRITE,MAP_SHARED, QDR1_OFFSET, &e); for (i = 0; i < QDR_SIZE/sizeof (u_64); i++) dp_base[i] = iamge[i]; fpga_wrt_appif_val (fp_id, 0x1UL, APP_CFG_REG, TYPE_VAL, &e); /*... */ fpga_rd_appif_val (fp_id, &status, STATUS_REG, &e); /*... */ fpga_wrt_appif_val (fp_id, 0x0UL, APP_CFG_REG, TYPE_VAL, &e); for (i = 0; i < QDR_SIZE/sizeof (u_64); i++) output_iamge[i] = dp_base[i[i] ; fpga_close (fp_id, &e); } Configure the Wavelet for QDR bridging Start Processing Read the FPGA status Map the entire 4 Mbytes of QDR Memory Read back the Image Transfer the Image into the QDR Configure the Wavelet for QDR bridging Close the FPGA device DWT on Cray-XD1 (cnt’d) (Main Program)

104 Accessing µP memory from FPGA (Cray-XD1) unsigned long order; void *ftr_mem; /*... */ ftr_mem = fpga_set_ftrmem(fpga_fd, order, &err); if (ftr_mem == NULL) { /* Handle error. */ } fpga_wrt_appif_val (fp_id, (u_64) ftr_mem, BUFF0_PTR_REG, TYPE_ADDR, &e); fpga_wrt_appif_val (fp_id, 0x1UL, APP_CFG_REG, TYPE_VAL, &e); /*... */ fpga_rd_appif_val(fp_id, &status, STATUS_REG, &e); /*... */  The APIs support access to a region of the µP memory by the FPGA logic  The program uses the fpga_set_ftrmem function to:  Allocate an FTR  Associates it with the address space of the µP  Sets up the FPGA to access it directly  It does not automatically provide the address of this region to the FPGA application logic  One way is to establish an FPGA register for that purpose and use the fpga_wrt_appif_val function to write the value to the register

105 Using MPI on Cray-XD1 if(MYTHREAD==0) read_image (image_file_name, image_buffer, &rows, &cols); MPI_Bcast(&rows, 1,MPI_INT,0,MPI_COMM_WORLD); MPI_Bcast(&cols, 1,MPI_INT,0,MPI_COMM_WORLD); local_size= rows*cols/THREADS; MPI_Scatter(image_buffer, local_size,MPI_UNSIGNED_LONG, local_image_buffer, local_data_size, MPI_UNSIGNED_LONG, 0, MPI_COMM_WORLD); /* Execute the wavelet on the Hardware*/ process_image(fp_id, local_image_buffer, local_output_buffer, rows, cols); MPI_Gather(local_output_buffer,local_size,MPI_UNSIGNED_LONG, output_image_buffer, local_data_size, MPI_UNSIGNED_LONG, 0,MPI_COMM_WORLD); if(MYTHREAD==0) write_image (output_file_name, output_image_buffer, rows, cols);  Each Cray XD1 chassis consists of 6 dual-processor Opteron nodes  2 Opteron processors (Total 12)  1 Xilinx Virtex-II Pro 50 (Total 6)  Applications can be parallelized across the 6 FPGAs using MPI  Data are distributed across the 6 FPGAs

106 rasclib_algorithm_request_t ar; int alg_id; int res; strcpy(ar.alg_id,“Wavelet"); ar.num_devices = 1;.. /* Read Image */. /* initialize daubechies coefficients */. rasclib_resource_alloc(&ar, 1); algorithm_id = rasclib_algorithm_open(“Wavelet"); res = rasclib_algorithm_alg_reg_write (alg_id, “coeff0", coeff[0]); res = rasclib_algorithm_alg_reg_write (alg_id, “coeff1", coeff[1]); res = rasclib_algorithm_alg_reg_write (alg_id, “coeff2", coeff[2]); res = rasclib_algorithm_alg_reg_write (alg_id, “coeff3", coeff[3]); res = rasclib_algorithm_send (alg_id, "a_in", Image_Buff, SIZE);  Parameter Passing  Small parameters  Connect to Algorithm Defined Registers (alg_def_reg0 - alg_def_reg7)  Pass parameter mapping to software through an extractor directive, type REG_IN: -- extractor REG_IN: coeff0 64 u alg_def_reg0[63:0] -- extractor REG_IN: coeff1 64 u alg_def_reg1[63:0] -- extractor REG_IN: coeff2 64 u alg_def_reg2[63:0] -- extractor REG_IN: coeff3 64 u alg_def_reg3[63:0]  Large Arrays  Dedicate a portion of an SRAM bank for the parameter array  Pass parameter array mapping to software with an extractor comment of type SRAM: -- extractor SRAM:a_in 2048 64 sram[0] 0x00 in u fixed DWT on SGI-Altix (Main Program)

107 rasclib_algorithm_go (alg_id); res = rasclib_algorithm_receive (alg_id, "d_out", out_Buff, SIZE); rasclib_algorithm_commit (alg_id); rasclib_algorithm_wait (alg_id); rasclib_algorithm_close (alg_id);  Results Read-Back  Small parameters  Connect to Algorithm Defined Registers  Pass parameter mapping to software through an extractor directive, type REG_OUT  Use the API function rasclib_algorithm_reg_read  Large Arrays  Dedicate a portion of an SRAM bank for the parameter array  Pass parameter array mapping to software with an extractor comment of type SRAM: -- extractor SRAM:d_out 2048 64 sram[1] 0x00 out u fixed DWT on SGI-Altix (cnt’d) (Main Program)

108 Time Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 Read DMA012XX AlgorithmX012X Write DMAXX012  Improve performance by overlapping algorithm computation and data loading and unloading  Extractor directives are used to tell software:  where input/output data arrays are located (SRAM bank + starting index)  the sizes of the input/output data arrays  which arrays have been enabled for streaming  Extractor directive type used: SRAM with attribute stream, e.g.: -- extractor SRAM:a_in 2048 64 sram[0] 0x00 in u stream -- extractor SRAM:d_out 2048 64 sram[1] 0x00 out u stream Streaming (SGI-Altix)

Example 5 Image processing: Thin Plate Splines

The application: Thin Plate Splines - image analysis of protein gels Image morphing based on natural logarithm computations Essential for comparing protein content Speedup per FPGA: 10-30x. Reduces analysis runtime from days to hours.

Host Program - running on Opteron CPU, calling FPGA subroutine Transfer parameter data to QDRAM Start Mitrion program and wait until finished Retrieve computed image data u_64 fpga_mem, i; my_fpga = fpga_open(args); // Use normal XD1 API for most operations... if (!fpga_is_loaded(args)) rtn = fpga_load(args);... // memory map QDRAMs into host address space fpga_mem = fpga_memmap(args); // Upload data to QDRAM memcpy(fpga_mem, parameter_data, sizeof_parameter_data); // Control of mitrion processor is internally handled // with a number of memory mapped registers in the FPGA // Controlling running/stepping/reset etc. mitrion_start(my_fpga); // Start mitrion block mitrion_wait(my_fpga); // wait for block to finish // Fetch results from QDRAM memcpy(image_coordinates, fpga_mem, sizeof_image_data);

FPGA program (1/3) - accelerated subroutine in Mitrion-c // Options: -cpp #define RAMType mem uint:64 [ 0x100000 ] #include "grint_lib.lqd" #include "logarithm_rwhile.lqd" (Fix, RAMType) readFix(RAMType m, uint:24 basicOffset, uint:24 fixOffset) { uint:32 memOffset = basicOffset + fixOffset; (result, m2) = _memread(m, memOffset); } (result, m2); (RAMType, RAMType, RAMType, RAMType) main (RAMType Am, RAMType Bm, RAMType Cm, RAMType Dm) { Fix py; // parameter vectors Fix px; Fix koeffx; Fix koeffy; // read paramters from external RAM (px, py, koeffx, koeffy, Aml) = foreach(index in ) { (x, Am2) = readFix(Am, PX_OFF, index) ; (y, Am3) = readFix(Am2, PY_OFF, index) ; (kx, Am4) = readFix(Am3, KOEFFX_OFF, index); (ky, Am5) = readFix(Am4, KOEFFY_OFF, index); } (x, y, kx, ky, Am5); Aut = _wait(Aml); Cut = grintpolc(Cm, px, py, koeffx, koeffy); } (Aut, Bm, Cut, Dm); readFix fetches input data from QDRAM Definition of RAM type Start of program. Matches external RAM interface of the XD1: 4 banks of 1M word each

FPGA program (2/3) - accelerated subroutine in Mitrion-c RAMType grintpolc ( RAMType coords, // out Fix px, Fix py, Fix koeffx, Fix koeffy ) { imDonel = foreach(y in ) { uint:32 lineoff = y*XSIZE; imDone2l = foreach(x in ) { (distx, disty) = foreach(px, py, koeffx, koeffy in px, py, koeffx, koeffy) { Fix dx = px - int2fix(x); Fix dy = py - int2fix(y); Fix r2 = fixmul(dx,dx) + fixmul(dy,dy); Fix ext = if(r2 == 0) 0 else { Fix ln = fixln(r2); ext = fixmul(r2,ln); } ext; Input arguments (the image) for Thin Plate Splines transform Major compute intensive part: high precision ln computation

FPGA program (3/3) - accelerated subroutine in Mitrion-c Fix rx = fixmul(ext, koeffx); Fix ry = fixmul(ext, koeffy); } (rx, ry); Fix distcoordx = sum(distx); Fix distcoordy = sum(disty); // distcoordx and distcoordy is the coordinated // of the pixels to be fetched from the distorted image uint:32 index = x + lineoff; int:32 x32 = (distcoordx >>> 8); // convert into Fix16.16 int:32 y32 = (distcoordy >>> 8); // convert into Fix16.16 watch x32; watch y32; bits:64 word = [x32, y32]; imDone3 = _memwrite(coords, index, word); } imDone3; imDone2 = _wait(imDone2l); } imDone2; imDone = _wait(imDonel); } imDone; Output argument is the distorted image Output arguments (distorted image coordinates) are written to QDRAM

115 Program Development Environments Challenges

116 Application Development for Reconfigurable Computers Program Entry Compilation Execution Platform mapping Debugging & Verification

117 Tasks Addressed in This Presentation Program Entry Compilation Execution Platform mapping Debugging & Verification

118 Program Program Entry

119 Platform Mapping SW/HW Partitioning Software (executed in the microprocessor system) Hardware (executed in the reconfigurable processor system) Program

120 SW/HW Partitioning & Coding Traditional Approach Specification SW/HW Partitioning SW Coding HW Coding SW Compilation HW Compilation SW ProfilingHW Profiling

121 SW/HW Partitioning & Coding New Approach Specification SW/HW Coding SW Compilation HW Compilation SW ProfilingHW Profiling SW/HW Partitioning

122 Platform Mapping FPGA mapping Software Hardware Program FPGA 1 FPGA 2 FPGA 3 FPGA 4

123 Example of FPGA Mapping add FPGA multip ly divide add multip ly divide FPGA 1FPGA 2 add multip ly divide FPGA 2FPGA 1

124 add multip ly divide FPGA 1FPGA 2 FPGA Mapping in SRC void fpga1(int64_t a, int64_t b, int64_t *sum, int mapno) { int64_t c, temp; send_to_bridge(b); c = a * const1; recv_from_bridge(&temp); *sum = temp+Mult; } void fpga2() { int64_t a, d; recv_from_bridge(&a); d = a/const2; send_to_bridge(d); } Makefile MAPFILES = FPGA1.mc FPGA2.mc PRIMARY = FPGA1.mc SECONDARY = FPGA2.mc CHIP2 = FPGA2.mc a FPGA1.mc FPGA2.mc b sum

125 FPGA Mapping in VIVA TM By changing the attributes one can specify where an object is to be located

126 Platform Mapping FPGA-FPGA data transfer & synchronization Software Hardware Program FPGA 1 FPGA 2 FPGA 3 FPGA 4

127 FPGA 1FPGA 2 64 computati on 2 computati on 1 void fpga1(int64_t a, b, c, *d) { send_to_bridge(a, b, c); computation1 recv_from_bridge(d); } void fpga2() { int64_t a,b,c,d; recv_from_bridge(&a, &b, &c); computation2 send_to_bridge(d); } FPGA-FPGA Data Transfer in SRC FPGA1.mc FPGA2.mc a b c d

128 32 words 64 bits 64 32 words FIFO FPGA-FPGA Data Transfer in SRC Bridge Port

129 FPGA-FPGA Data Transfer in VIVA TM Special partitioning objects placed between the modules to be synthesized automatically map the relevant lines between the FPGAs. For designs mapped over several FPGAs: The system description must include those FPGAs over which the design is to be mapped,

130 Platform Mapping Use of Internal and External Memories Software Hardware Program FPGA 1 FPGA 2 FPGA 3 FPGA 4 OCM OCM – On-Chip Memory LM – Local Memory SM – Shared Memory SM LM

131 Using On-Chip Memory (OCM) in SRC void sum(int64_t a[], int *c, int mapno) { BANK_A_ALLOC(AL, int64_t, SIZE); ocm_a [SIZE]; int i; cm2obm_0(AL, a, byteLength); wait_server_0(); for(i=0; i<SIZE; i++) { ocm_a[i] = AL[i]; } for(i=0; i<SIZE; i++) { tmp = ocm_a[i] + tmp; } } FPGA SM (OBM) 64 32 AL[] ocm_a[] OCM computations c

132 Using On-Chip Memory (OCM) in VIVA TM Special Objects under the Memory Subsystem of the library allows the programmer to use the on chip memory of the Xilinx Virtex II chip

133 Platform Mapping I/O Software Hardware Program FPGA 1FPGA 2 FPGA 3 FPGA 4 SM LM OCM SRC StarBridge

134 Main program Function_1(a, d, e) Function_2(d, e, f) Function_1 Function_2 Macro_1(a, b, c) Macro_2(b, d) Macro_2(c, e) Macro_3(s, t) Macro_1(n, b) Macro_4(t, k) FPGA …… Macro_1 Macro_2 a b c de FPGA contents after the Function_1 call Program in C or Fortran Run Time Reconfiguration in SRC

135 Run-time Reconfiguration in VIVA TM Reconfigurati on is possible by using the spawn object. By specifying the FileName attribute a VIVA executable (.vex file) or a VIVA project can be loaded onto the same or a different FPGA.

136 Ideal Program Entry Program Entry Function

137 Actual Program Entry SW/HW Partitioning Data Transfers & Synchronization Use of Internal and External Memories Sequence of Run-time Reconfigurations Use of FPGA Resources (multipliers, μP cores) Preferred Architectures Program Entry Function FPGA Mapping SW/HW Interface

138 Not implemented Manual Entry Compiler Automated SRC Star Bridge FPGA-FPGA Partitioning  P-FPGA Partitioning FPGA-FPGA Data Transfer  P-FPGA Data Transfer Computation-Data transfer Overlapping Choosing component version Evolution and the current status of tools and other vendors.........

139 Debugging & Verification

140 Application MAP Runtime Library ComListCodeWrapperCode User Logic Subroutine For MAP MAP Board Execution MAP Board Data & Flags User FPGAs Control Processor On-boardMemory User Logic Registers & Flags Logic Macro Logic Macro Logic Macro Logic Macro ComList Processor DMA Engine

141 Emulator MAP Emulator + DFG Simulator Application MAP Runtime Library ComListCodeWrapperCode User Logic Subroutine For MAP Data & Flags User FPGAs Control Processor On-boardMemory User Logic Registers & Flags C Code Macro C Code Macro C Code Macro C Code Macro ComList Processor DMA Engine

142 MAP Emulator + Verilog Simulator Emulator Application MAP Runtime Library ComListCodeWrapperCode User Logic Subroutine For MAP Data & Flags User FPGAs Control Processor On-boardMemory User Logic Registers & FlagsVCS Verilo g Macro Verilo g Macro Verilo g Macro Verilo g Macro ComList Processor DMA Engine

143 X86 System in VIVA TM The FileIn Object as it appears when the x86 system is loaded

144 X86 System in VIVA TM FileIn object as it appears when the FPGA system description is loaded.

145 Debugging in VIVA TM Data can be viewed with the help of widgets, which are basically input and output ‘horns’ placed in a worksheet. Various display options are available to view data, options to include the kind of view desired by the viewer and the data viewed can be switched between HEX or INT.

146 IA-32 Linux Machine RTL Generation and Integration with Core Services Design Synthesis (Synplify Pro, Amplify) Design Verification Behavioral Simulation (VCS, Modelsim) Static Timing Analysis (ISE Timing Analyzer).v,.vhd.edf.ncd,.pcf.bin Metadata Processing (Python).v,.vhd.cfg Altix Device Programming (RASC Abstraction Layer, Device Manager, Device Driver) Real-time Verification (gdb).c Design Implementation (ISE) HLL Design Entry (Handel-C, Mitrion C, Viva) Debugging in the SGI Environment

Compiler, Simulator And Debugger

148 Programming Environments Summary

149 SRC Programming Environment + very easy to learn and use + standard ANSI C + hides implementation details + good support for debugging + vendor and user libraries + very well integrated environment + good use of 3rd party tools + in production use for over 3 years with constant improvements - subset of C - legacy C code requires rewriting - C limitations in describing HW (paralellism, data types) - closed environment, limited portability of codes to HW platforms other than SRC

150 Star Bridge Programming Environment Viva + drag-and-drop program entry + standard and user libraries + separation of designs/programs from system/platform descriptions = portability of codes + support for multiple platforms under development - does not follow any established standards - no textual description = limited scalability of codes - control signals (e.g., handshaking between the adjacent library cells) must be specified explicitly - no clear mechanism to call HW functions from SW

151 + drag-and-drop program entry + extensive libraries of DSP components + good support for debugging - graphical description = limited scalability of codes - control signals (e.g., handshaking between the adjacent library cells) must be specified explicitly - limited library support for applications other than DSP Cray Programming Environment based on Simulink/System Generator

152 + graphical programming language (drag-and-drop program entry) + extensive libraries of DSP components + single environment (MATLAB™/Simulink™) to analyze, visualize, implement, debug, verify + efficient resource usage - graphical description = limited scalability of codes - limited library support for applications other than DSP Cray Programming Environment based on DSPLogic

153 Cray XD1 and SGI Environments based on Mitrion-C + high-level C-like language easy to learn by an HPC programmer + ease of describing paralellism and non-standard (variable size) data types + small amount of Mitrion-C generates large number of lines of HDL code + suitable for describing classical complex HPC problems + Mitrion-C code portable between Cray XD1 and SGI - new and yet untested - non-standard, no support for legacy codes - language describes only what happens in a single FPGA - currently, no mechanisms to use HDL macros

1 Program Development Environments Languages & Tools Kris Gaj George Mason University.

Similar presentations

Presentation on theme: "1 Program Development Environments Languages & Tools Kris Gaj George Mason University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Program Development Environments Languages & Tools Kris Gaj George Mason University.

Similar presentations

Presentation on theme: "1 Program Development Environments Languages & Tools Kris Gaj George Mason University."— Presentation transcript:

Similar presentations

About project

Feedback