Download presentation
Presentation is loading. Please wait.
Published bySybil Byrd Modified over 9 years ago
1
1 Program Development Environments Languages & Tools Kris Gaj George Mason University
2
2 Acknowledgements AMI Cray Mitrion NCSA SGI SRC Star Bridge DoD/LUCITE Companies, centers, and sponsors
3
3 Esmail Chitalwala (GWU/Star Bridge) Hatim Diab (GWU) Esam El-Araby (GWU) Miaoqing Huang (GWU) Hoang Le (GMU) Allen Michalski (GMU/USC) Nandkishore Sastry (GMU) Chang Shu (GMU) Mohamed Taher (GWU) Proshanta Saha (GWU) Acknowledgements GWU/GMU students
4
SRC Programming Model MicroprocessorFPGA main.c function_1() function_2() ANSI C function_1 function_2 macro_1(a, b, c) macro_2(b, d) macro_2(c, e) macro_3(s, t) macro_1(n, b) macro_4(t, k) FPGA Macro_1 Macro_2 a b c de MAP C (subset of ANSI C) I/O Libraries of macros VHDL macro_1 macro_2 macro_3 macro_4 ……………………….
5
C function for P C function for MAP VHDL macro SRC Program Partitioning P system FPGA system HLL HDL
6
SRC Compilation Process Object files Application sources Macro sources MAP Compiler PCompiler Logic synthesis Place & Route Linker.v files.bin files.ngofiles.o files Application executable Configuration bitstreams HDL sources Netlists.c or.f files. vhdor.v files Logic synthesis Place & Route Linker.v files.bin files.ngofiles HDL sources. or.mc or.mf files
7
SRC Libraries of Hardware Macros User libraries of hardware macros developed by GWU/GMU/USC 2002-2006 Secret-key cipher encryption & breaking Binary Galois Field arithmetic (polynomial basis & normal basis representation) Elliptic Curve Arithmetic Long integer modular arithmetic (RSA) Sorting Image processing Bioinformatics See http://hpc.gwu.edu/libraryhttp://hpc.gwu.edu/library Vendor libraries of hardware macros basic integer and floating-point arithmetic digital signal processing
8
Library Object Sheets Star Star Bridge Programming Environment - Viva
9
Place & Route.bin files.ngo files Application executable Configuration bitstreams Netlists Star Bridge Compilation Process VIVA Graphical User Interface User input Xilinx
10
Cray XD1 Programming Flows Source: [Cray, MAPLD05] Synthesis process (a, m)is begin z <= aand m; end process; intmask(a, m) { return (a & m); } VHDL/Verilog Synthesis Mitrion-C VHDL, Verilog Mentor Graphics Synopsys Synplicity Xilinx a m z 01001011010101 01010110101001 01000101011010 10100101010101 MATLAB/ Simulink The MathWorks Standard Flow Mitrion High-level Flow System Generator Xilinx Place & Route Gate-level EDIF VHDL or Verilog
11
Xtreme DSP Design Flow
12
HDL-based SGI Altix Programming Flow IA-32 Linux Machine Design iterations Design Entry (Verilog, VHDL) Design Synthesis (Synplify Pro, Amplify) Design Implementation (ISE) Design Verification Behavioral Simulation (VCS, Modelsim) Static Timing Analysis (ISE Timing Analyzer).v,.vhd.edf.ncd,.pcf.bin Metadata Processing (Python).v,.vhd.cfg Altix Device Programming (RASC Abstraction Layer, Device Manager, Device Driver) Real-time Verification (gdb).c
13
IA-32 Linux Machine RTL Generation and Integration with Core Services Design Synthesis (Synplify Pro, Amplify) Design Verification Behavioral Simulation (VCS, Modelsim) Static Timing Analysis (ISE Timing Analyzer).v,.vhd.edf.ncd,.pcf.bin Metadata Processing (Python).v,.vhd.cfg Altix Device Programming (RASC Abstraction Layer, Device Manager, Device Driver) Real-time Verification (gdb).c Design Implementation (ISE) HLL Design Entry (Handel-C, Mitrion C, Viva) HLL-based SGI Altix Programming Flow
14
Mitrion-C Programming Model for Cray & SGI MicroprocessorFPGA main.c function_1(in1) start_fpga() ANSI C based on Mitrion API FPGA I/O RAM Application code (platform independent) Mitrion Distributed Processor Architecture (platform dependent) Mitrion Compiler & Configurator application on the distributed processor Input & output Mitrion-C VHDL function_1(in2) start_fpga()
15
Compiling A Mitrion Program Processor Configurator Processor Architecture Mitrion-C Source code Processor HW-Design (VHDL IP Core) FPGA Mitrion Software Development Kit Simulator & Debugger Processor Machine-code Compiler
16
The Mitrion Platform 1) The Mitrion Virtual Processor –A fine-grain massively parallel, configurable soft-core processor –10-30 times faster than traditional CPUs 2) The Mitrion-C programming language –An intrinsically parallel C-family language 3) The Mitrion Software Development Kit –Compiler –Debugger/Simulator –Processor configurator
17
A New Processor Architecture Specifically For FPGAs int:48 main() { int:48 prev = 1; int:48 fib = 1; int:48 fibonnacci = for(i in ) { fib = fib+prev; prev = fib; } <>fib; } fibonnacci; ? Architecture design goal: High silicon utilization Take advantage of FPGA re-configurability Goal achieved by: Allow processor to be massively parallel Allow processor to be fully adapted to algorithm
18
Processor Architecture: A Cluster-On-A-Chip Non-Von Neumann architecture Processor architecture more like a cluster Very Fine-Grain Parallelism –Normal clusters run a block of code on each PE 1 –Mitrion runs a single instruction on each PE –Each PE adapted to optimally run its instruction Network topology specific for algorithm No Instruction Stream, instead Data Stream 1) PE = Processing Element
20
A C-family Language Basic syntax is the same as for other C- family languages Examples: –Blocks are surrounded by { } –Assignment with = –Statements end with ; –if, for, while –Most of the usual c operators –C-style comments (though nestable)
21
Types Basic types int/uint signed/unsigned integer boolean boolean value ( true / false ) float Floating point real value bits Bit vector format Free bit width int:24 24 bit signed integer uint:19 19 bit unsigned integer float:24.8 IEEE-754 single precision float Collections int:24[100] Vector (indexable collection) int:14 List (no index)
22
Language constructs Operators if(a>b)... while(i<10)... for(i in )... foreach (e in vector)... int:8 function(int:8 a)...
23
A C-family Language Important differences –No pointers –No dynamic allocation –Static general recursion only Though loop structures may be dynamic
24
Compiler, Simulator And Debugger
26
26 Hardware Software Graphical Data Flow Diagram HLLHDL Increased productivity Increased capability to describe parallel execution Program Entry for FPGA Accelerator Boards Traditional Extended (e.g. Corefire) Hardware Software
27
27 Increased productivity Increased capability to describe parallel execution Star Bridge Hardware Software porting EDIF COM objects Program Entry for Reconfigurable Computers Hardware Software SRC HLLHDL Graphical Data Flow Diagram HDL macros
28
28 Increased productivity Increased capability to describe parallel execution Cray XD1 with Simulink Hardware Software Program Entry for Reconfigurable Computers Hardware Software SGI or Cray with Mitrion HLLHDL Graphical Data Flow Diagram Mitrion Processor Mitrion-C Xilinx System Generator Simulink
29
29 General hierarchy of library files suggested by SRC Computers Inc.
30
30 Structure of the SRC macro repository common rev_drev_e hdlfile InfoFileBlkBoxFile macro1 macro2macro3 rev_f DebugCodeFile DataSheet
31
31 Files describing an SRC macro Platform independent –HDL file: macro.v or macro.vh Verilog or VHDL code defining the macro –Debug Code File: macro.c provides the equivalent C functionality for the macro –Data sheet file: datasheet contains the documentation for the macro Platform dependent –Blk Box File: blackbox.v Interface (black box) definition for the macro in Verilog –Info File: info Info file entry for this macro
32
32 Library Development - SRC HLL (C, Fortran) HDL (VHDL, Verilog) P system FPGA system Application Programmer Library Developer HLL (C, Fortran) HLL (C, Fortran) LLL (ASM) HLL (C, Fortran)
33
33 Library Development - StarBridge GDF (Viva) HDL (VHDL, Verilog) P system FPGA system Application Programmer Library Developer GDF (Viva) GDF (Viva) HLL, LLL (C++, ASM) GDF (Viva)
34
34 Software libraries and their role in the development of SRC libraries
35
35 1. source of test vectors for VHDL macros | 2.emulation of hardware during debugging 3.performance comparison Roles of software libraries
36
36 1. Identify class of applications 2. Identify basic operations required by your applications 3. Determine the existence of the RC library of such operations 4. Determine the existence of the microprocessor library of such operations 5. Determine the right granularity for the required library operations How to approach porting your application to reconfigurable computers?
37
1.input/output intensive applications bulk data encryption (DES, IDEA, and RC5 encryption) 2. computationally intensive applications secret-key cipher breaking based on the exhaustive key search (DES, IDEA, RC5 breakers) public-key cipher breaking based on factoring 3. latency-critical applications cipher key agreement and signature (ECC schemes, RSA) Classes of applications
38
Example 1 Cryptography: High-throughput encryption
39
Cipher message ciphertext cryptographic key K bits
40
Secret-key ciphers key of Alice and Bob - K AB Alice Bob Network Encryption Decryption
41
High-Throughput Encryption Encryption MiMi M i+1 M i+2 CiCi C i+1 C i+2.. K0K0 Encryption algorithms: DES, 3DES, AES, RC5, IDEA, etc.
42
Fully Pipelined Architecture.. Loop unrolling Pipeline stages inside of cipher rounds New input & new output every clock cycle.. Round 1 Round 2 Round k...
43
Encryption on SRC-6 – No streaming encryption.mc (1) #include void encryption (uint64_t sdata[], uint64_t key, uint64_t *hardware_timein, uint64_t *hardware_timeprocess, uint64_t *hardware_timeout, int mapnum) { OBM_BANK_A (S1OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_B (S2OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_C (S3OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_D (S4OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_E (S5OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_F (S6OBM, uint64_t, MAX_OBM_SIZE) uint32_t encrypt_decrypt; //0:encrypt 1:decrypt int i, nbytes; uint64_t t1,t2,t3,t4;
44
encrypt_decrypt = 0; nbytes = MAX_OBM_SIZE * 8*3; start_timer(); read_timer(&t1); DMA_CPU(CM2OBM, S1OBM, MAP_OBM_stripe(1,"A,B,C"), sdata, 1, nbytes, 0); wait_DMA(0); read_timer(&t2); for(i=0;i<MAX_OBM_SIZE;i++) { des (S1OBM[i], key, encrypt_decrypt, &S4OBM[i]); des (S2OBM[i], key, encrypt_decrypt, &S5OBM[i]); des (S3OBM[i], key, encrypt_decrypt, &S6OBM[i]); } read_timer(&t3); Encryption on SRC-6 – No streaming encryption.mc (2)
45
Encryption on SRC-6 – No streaming encryption.mc (3) DMA_CPU(OBM2CM, S4OBM, MAP_OBM_stripe(1,"D,E,F"), sdata, 1, nbytes, 5); wait_DMA(5); read_timer(&t4); *hardware_timein = t2-t1; *hardware_timeprocess = t3-t2; *hardware_timeout = t4-t3; }
46
Encryption on SRC-6 – No streaming des_blkbx.v module des ( desOut, desIn, keyin, decrypt, clk ) /* synthesis syn_black_box syn_noprune=1 */ ; output [63:0] desOut; input [63:0] desIn; input [63:0] keyin; input decrypt; input clk /* synthesis syn_noclockbuf=1 */ ; endmodule
47
Encryption on SRC-6 – No streaming des.info (1) BEGIN_DEF "des" MACRO = "des"; LATENCY = 17; STATEFUL = NO; EXTERNAL = NO; PIPELINED = YES; INPUTS = 3: I0 = INT 64 BITS (desIn[63:0]) I1 = INT 64 BITS (keyin[63:0]) I2 = INT 32 BITS (decrypt) ; OUTPUTS = 1: O0 = INT 64 BITS (desOut[63:0]) ; IN_SIGNAL : 1 BITS "clk" = "CLOCK";
48
Encryption on SRC-6 – No streaming des.info (2) DEBUG_HEADER = $ void des__dbg (long long desin, long long keyin, int decrypt, long long *desout); $; DEBUG_FUNC = $ #include void des__dbg(long long desin, long long keyin, int decrypt, long long *desout) { des_(desout, &desin, &keyin, &decrypt); } $; END_DEF
49
Encryption on SRC-6 - with streaming encryption.mc (1) #include void encryption (uint64_t sdata[], uint64_t key, uint64_t *hardware_timeprocess, uint64_t *hardware_timeout, int mapnum) { OBM_BANK_A (S1OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_B (S2OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_D (S4OBM, uint64_t, MAX_OBM_SIZE) OBM_BANK_E (S5OBM, uint64_t, MAX_OBM_SIZE) uint32_t encrypt_decrypt; //0:encrypt 1:decrypt int i, nbytes; uint64_t t1,t2,t3; Stream_64 S0, S1; uint64_t v0, v1; encrypt_decrypt = 0; nbytes = MAX_OBM_SIZE * 8*2;
50
start_timer(); read_timer(&t1); #pragma src parallel sections { #pragma src section { stream_dma_cpu_dual (&S0, &S1, PORT_TO_STREAM, S1OBM, DMA_A_B, sdata, 1, nbytes); } #pragma src section { for (i=0; i<MAX_OBM_SIZE; i++) { get_stream (&S0, &v0); get_stream (&S1, &v1); des (v0, key, encrypt_decrypt, &S4OBM[i]); des (v1, key, encrypt_decrypt, &S5OBM[i]); }; } Encryption on SRC-6 – with streaming encryption.mc (2)
51
Encryption on SRC-6 – with streaming encryption.mc (3) read_timer(&t2); DMA_CPU(OBM2CM, S4OBM, MAP_OBM_stripe(1,"D,E"), sdata, 1, nbytes, 5); wait_DMA(5); read_timer(&t3); *hardware_timeprocess = t2-t1; *hardware_timeout = t3-t2; }
52
7.5 38 46 Speed up 560 113 93 Xeon 2.8GHz 4,240 SRC-6 End-to-End Throughput (Mbits/s) 10,76011,35019,200 3 RC5 Ciphers (64-bit block) 10,76011,35019,200 3 IDEA Ciphers (64-bit block) 10,76011,35019,200 3 DES Ciphers (64-bit block) SRC-6 Data Transfer Out Throughput (Mbits/s) Data Transfer In Throughput (Mbits/s) Computational Throughput (Mbits/s) Application Results SRC-6 without streaming
53
8.5 42.5 52 Speed up 560 113 93 Xeon 2.8GHz 4,800 SRC 6 End-to-End Throughput (Mbits/s) 10,7609,000NA 3 RC5 Ciphers (64-bit block) 10,7609,000NA 3 IDEA Ciphers (64-bit block) 10,7609,000NA 3 DES Ciphers (64-bit block) SRC 6 Data Transfer Out Throughput (Mbits/s) Data Transfer In & processing Throughput (Mbits/s) Computational Throughput (Mbits/s) Application Results SRC-6 with streaming (3 units)
54
9.5 47.5 58 Speed up 560 113 93 Xeon 2.8GHz 5,400 SRC 6 End-to-End Throughput (Mbits/s) 10,76011,350NA 2 RC5 Ciphers (64-bit block) 10,76011,350NA 2 IDEA Ciphers (64-bit block) 10,76011,350NA 2 DES Ciphers (64-bit block) SRC 6 Data Transfer Out Throughput (Mbits/s) Data Transfer In & processing Throughput (Mbits/s) Computational Throughput (Mbits/s) Application Results SRC-6 with streaming (2 units)
55
4.5 18 26 Speed up 560 113 93 Xeon 2.8GHz 2,430 2,040 2,430 Altix End-to-End Throughput (Mbits/s) NA 12,800 (200MHz) 1 RC5 Cipher (64-bit block) NA 6,400 (100MHz) 1 IDEA Cipher (64-bit block) NA 12,800 (200MHz) 1 DES Cipher (64-bit block) Altix Data Transfer Out Throughput (Mbits/s) Data Transfer In & processing Throughput (Mbits/s) Computational Throughput (Mbits/s) Application SGI Altix MOATB without streaming
56
5.5 22 33 Speed up 560 113 93 Xeon 2.8GHz 3080 2480 3080 Altix End-to-End Throughput (Mbits/s) NA 12,800 (200MHz) 1 RC5 Cipher (64-bit block) NA 6,400 (100MHz) 1 IDEA Cipher (64-bit block) NA 12,800 (200MHz) 1 DES Cipher (64-bit block) Altix Data Transfer Out Throughput (Mbits/s) Data Transfer In & processing Throughput (Mbits/s) Computational Throughput (Mbits/s) Application SGI Altix MOATB with streaming
57
Example 2 Cryptography: Cipher Breaking
58
Secret-key cipher breaking Given: Looked for: Method: remaining plaintext ciphertext or key guessed fragment of the plaintext exhaustive key search (brute-force) attack successive keys cipher
59
Secret-key cipher breaking Cipher breaker M0M0 C0C0 … K1K1 K2K2 K3K3 KNKN Generated by the cipher breaker Negligibly small input/output Huge amount of computations Correct key Message – Ciphertext pair
60
Cipher Breaking Results - SRC-6 Application Theoretical Maximum Computational Throughput Measured End-to-End Throughput (million keys/s) Speed-up SRC 6 Xeon 2.8GHz DES Cipher Breaking (20 units working in parallel) 2000 1.771130 IDEA Cipher Breaking (10 units working in parallel) 1000 2.19457 RC5 Cipher Breaking (2 units working in parallel) 200 0.71282
61
Application Theoretical Maximum Computational Throughput Measured End-to-End Throughput (million keys/s) Speed-up SGI Xeon 2.8GHz DES Cipher Breaking (10 units working in parallel) 2000 1.771130 Cipher Breaking Results SGI Altix MOATB
62
Example 3: Cryptography: Key exchange using ECC
63
Secret-key ciphers key of Alice and Bob - K AB Alice Bob Network Encryption Decryption
64
Key Distribution Problem N - Users N · (N-1) 2 Keys Users Keys 100 5,000 1000 500,000
65
Public Key (Asymmetric) Ciphers Public key of Bob - K B Private key of Bob - k B Alice Bob Network Encryption Decryption
66
Alice Bob session key (random secret-key) Bob’s public key Key exchange for secret-key ciphers Bob’s private key Network Session key encrypted using Bob’s public key Message encrypted using session key
67
Message Hash function Public key cipher Alice Signature Alice’s private key Bob Hash function Alice’s public key Digital Signature Hash value 1 Hash value 2 Hash value Public key cipher yes no Message Signature
68
Why public-key cryptography is a good application for reconfigurable computers? computationally intensive arithmetic operations unconventionally long operand sizes (160-2048 bits) multiple algorithms, parameters, key sizes, and architectures = need for reconfiguration
69
Elliptic Curve Cryptosystems (ECC) a family of cryptosystems, rather than a single cryptosystem = added security but need for reconfiguration public key (asymmetric) cryptosystems used for key agreement and digital signatures implementations must be optimized for minimum latency rather than maximum throughput = limited speed-up from parallel processing
70
Basic operations of ECC Basic operations in Galois Field GF(2 m ) Basic operations on points of an Elliptic Curve addition and subtraction (xor): x+y, x-y (XOR) addition of points: P + Q doubling a point: 2 P projective to affine coordinate: P2A multiplication, squaring: x y, x 2 inversion: x -1 Complex operations on points of an Elliptic Curve scalar multiplication: k P = P + P + …+P k times
71
Hierarchy of ECC functions kP P+Q2P projective_to_affine (P2A) MUL INV High level Medium level Low level 2 ROT XOR Low level 1
72
C function for P C function for MAP VHDL macro SRC Program Partitioning P system FPGA system HLL HDL
73
Investigated Partitioning Schemes
74
kP C function for P C function for FPGA VHDL macro μP Software Only Based on public-domain code by Rosing M., Implementing Elliptic Curve Cryptography, Manning, 1999
75
MUL4 C function for FPGA VHDL macros ROT XOR C function for µ P 0 H L1 V_ ROT VAR ROT kP P2A kP P+Q2P MUL2 MUL 0HL1 Partitioning INV P2A P+Q2P
76
MUL4 C function for FPGA VHDL macros ROT XOR C function for µ P 0 H V_ ROT INV kP P2A kP P+Q2P MUL2 0HL2 Partitioning P2A P+Q2P L2
77
0HM Partitioning C function for FPGA VHDL macros C function for µ P 0 H M P+Q 2P P2A kP
78
0 0 H 00H Partitioning (VHDL only) C function for P C function for FPGA VHDL macro
79
Timing Measurements MAP Alloc. MAP Free DMA DataOut DMA Data In FPGA Computation.c file.mc file End-to-End time (SW) MAP function MAP function FPGA Configure Configuration time MAP Allocation time MAP Release Time End-to-End time (HW)
80
Results (Latency)
81
Results (Area)
82
78 185 349 371 MAP C 15326010070HL1 153 Main C 1601744 2301291 36 Macro Wrapper 0HM 1960 VHDL VHDL macro 0HL2 Algorithm Partitioning Scheme Number of lines of code
83
Conclusions Assuming focus on: Timing Resources Ease of programming
84
Conclusions – cont. The best implementation approach: 0HL1 partitioning scheme 893 speedup vs. software and only 0.46 times slowdown versus pure VHDL with ease of implementation
85
Conclusions – cont. Elliptic Curve Cryptosystem implementation challenging for reconfigurable computers because of optimization for latency rather than throughput limited amount of parallelism First publication showing a 1000x speed-up for a reconfigurable computer application optimized for data latency
86
Summary of results Type of application End-to-end speed-up of SRC-6 vs. P4 Computationally intensive (cipher breaking) 300-1100 Latency critical (ECC key exchange) Input/output intensive 10-60 (secret key encryption/decryption) 890-1300
87
GWU_GMU secret key cipher libraries 1.Secret key cipher encryption and decryption 2.Secret key cipher breaking DES IDEA RC5 DES IDEA RC5
88
GWU_GMU public key cipher libraries 1.Operations in the binary Galois Fields GF(2 m ) a. polynomial basis b. normal basis 2. Multiprecision integer arithmetic 3. Elliptic Curve Operations - addition - doubling - scalar multiplication
89
89 Example 4 Image Processing: Hyperspectral Dimension Reduction
90
90 Multi-Spectral Imagery 10’s of bands Hyperspectral Imagery 100’s-1000’s of bands Challenges - Curse of Dimensionality Solution On-Board Dimension Reduction Needs Higher performance Higher flexibility Multispectral / Hyperspectral Imagery Comparison High-Performance Reconfigurable Computing Application: Hyperspectral Dimension Reduction
91
91 Hyperspectral Dimension Reduction (Techniques) Principal Component Analysis (PCA): Most Common Method Dimension Reduction Complex and Global computations: difficult for parallel processing and hardware implementations Does Not Preserve Spectral Signatures Wavelet-Based Dimension Reduction*: Simple and Local Operations High-Performance Implementation Preserves Spectral Signatures Multi-Resolution Wavelet Decomposition of Each Pixel 1-D Spectral Signature (Preservation of Spectral Locality) * S. Kaewpijit, J. Le Moigne, T. El-Ghazawi, “Automatic Reduction of Hyperspectral Imagery Using Wavelet Spectral Analysis”, IEEE Transactions on Geoscience and Remote Sensing, Vol. 41, No. 4, April, 2003, pp. 863-871.
92
92 The input image is first convolved along the rows by the two filters L and H and decimated along the columns by two resulting in two "column-decimated" images L and H Each of the two images, L and H, is then convolved along the columns by the two filters L and H and decimated along the rows by two This decomposition results into four images, LL, LH, HL and HH The LL image is taken as the new input to perform the next level of decomposition Discrete Wavelet Transform (DWT) Decomposition (Mallat Algorithm)
93
93 Wavelet-Based Dimension Reduction (Description)
94
94 DWT on SRC-6 transfer coefficients to OBM bank c transfer image data to OBM bank a load coefficients from bank c to on-chip registers transfer image data from bank b to the host compute Wavelet read one pixel from bank a store result into bank b End of Image Yes No Read Data MAP Alloc. Map Free Write Data Measurements Scenario
95
95 DWT on SRC-6 (cnt’d) (Main Program) int main (int argc, char *argv[]) {. /* initialize daubechies coefficients */ float coeff[8]; coeff[0]= coeff[1]= (1+sqrt(3))/(4*sqrt(2)); coeff[2]= coeff[3]= (3+sqrt(3))/(4*sqrt(2)); coeff[4]= coeff[5]= (3-sqrt(3))/(4*sqrt(2)); coeff[6]= coeff[7]= (1-sqrt(3))/(4*sqrt(2));.. /* allocate images */. map_allocate(1); gettimeofday(&time0, NULL); proc_fpga (image_in, image_out, coeff, dx, dy, &ht0, &ht1, &ht2, &ht3, mapno); gettimeofday(&time1, NULL); /* print time difference */. map_free(1);. } Allocate the RP configure and start the Program execution on the FPGA passing the input image pointer and the output image buffer pointer to be used by DMA individual parameters can be passed to the MAP C function such as image dimensions large parameter array, such as the wavelet coefficients, can be transferred using DMA by passing the pointer of the coefficients array Free the RP
96
96 DWT on SRC-6 (cnt’d) MAP C Function (FPGA.mc) void proc_fpga(int64_t image_in[], int64_t image_out[], int64_t coeff[], int dx, int dy, long long *ht0, long long *ht1, long long *ht2, long long *ht3, int mapnum) { // coefficients float LP0, LP1, LP2, LP3, LP4; float HP0, HP1, HP2, HP3, HP4; // variables int i, j, k; int64_t in_pixel, out_pixel; // input image OBM_BANK_A (AL, int64_t, MAX_OBM_SIZE) // output image OBM_BANK_B (BL, int64_t, MAX_OBM_SIZE) // filter coefficients OBM_BANK_C (CL, int64_t, MAX_OBM_SIZE)
97
97 start_timer(); read_timer(ht0); // DMA Input Image transfer DMA_CPU (CM2OBM, AL, MAP_OBM_stripe (1, "A"), image_in, 1, Image_size, 0); wait_DMA (0); // DMA coefficients transfer DMA_CPU (CM2OBM, CL, MAP_OBM_stripe(1,“C"), coeff, 1, 4*sizeof(int64_t), 0); wait_DMA(0); read_timer(ht1); for (i = 0; i < 4; i++) { LP0 = LP1 ; HP0 = HP1 ; LP1 = LP2 ; HP1 = HP2 ; LP2 = LP3 ; HP2 = HP3 ; split_64to32_flt_flt(CL[i], & HP3, & LP3 ); } DWT on SRC-6 (cnt’d) MAP C Function (FPGA.mc) transfer image data to an OBM bank transfer coefficients to an OBM bank load coefficients from the OBM bank to on-chip registers
98
98 for (i = 0; i<Image_Size; i++) { in_pixel = AL[i]; {. } BL[i] = out_pixel; } read_timer(ht2); DMA_CPU (OBM2CM, BL, MAP_OBM_stripe(1,"B"), image_out, 1, Image_size, 0); wait_DMA (0); read_timer(ht3); } DWT on SRC-6 (cnt’d) MAP C Function (FPGA.mc) read pixel value from the OBM bank compute Wavelet store results to the OBM bank transfer image data to the host
99
99 Overlapping Data Transfer with Computation (SRC-6) #pragma src parallel sections { #pragma src section { for(i = 0; i < i<MAX_OBM_SIZE; i++) { get_stream (&S0, &v0); DO COMPUTATION (Current Data Block) } } /* end of parallel section with compute loop */ #pragma src section { /* Stream DMA_IN */ stream_dma_cpu_dual (&S0, &S1, PORT_TO_STREAM, S1OBM, DMA_A_B, sdata, 1, nbytes); } /* end of parallel section with DMA */ } /* end of parallel sections */ Time Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 Read DMA123XX AlgorithmX123X Write DMAXX123 Improve performance by overlapping algorithm computation and data loading and unloading Parallel sections Multiple parallel code blocks are active in parallel
100
100 Stream_64 S0; #pragma src parallel sections { #pragma src section { int i; for (i=0; i<sz; i++) put_stream (&S0, AL[i]+42, 1); } /* end of parallel section */ #pragma src section { int i; for (i=0; i<sz; i++) get_stream (&S0, &BL[i]); } /* end of parallel section */ } /* end of parallel sections */ Streams (SRC-6) Conventional Data Flow Streams and Conventional Data Flow On- Board Memory or BRAM Compute Loop 1 On- Board Memory or BRAM Compute Loop 2 On- Board Memory or BRAM Compute Loop 1 Steams Compute Loop 2 On- Board Memory or BRAM Time Saves Access to On-BoardMemory Data is flowing In the logic A stream is a data structure that allows flexible communication between concurrent producer and consumer loops
101
101 Cray XD-1
102
102 DWT on Cray-XD1 (Main Program) #define APP_CFG_REG 0x08UL #define USR_REG1 0x40UL #define USR_REG2 0x48UL #define USR_REG3 0x50UL #define USR_REG4 0x58UL #define QDR1_OFFSET 0x100UL /* Offset in FPGA memory space.*/ int main (int argc, char *argv[]) { int fp_id; err_e e; u_64 coeff[4] u_64 * dp_base; u_64 * image; fp_id = fpga_open ("/dev/ufp0", O_RDWR|O_SYNC, &e); fpga_load (fp_id, "top.bin.ufp", &e);.. /* Read Image */. /* initialize daubechies coefficients */. fpga_wrt_appif_val (fp_id, coeff[0], USR_REG1, TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[1], USR_REG2, TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[2], USR_REG3, TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[3], USR_REG4, TYPE_VAL, &e); Define the address space for user registers and QDR memory Open the FPGA Device Load the FPGA Transfer coefficients into the FPGA registers
103
103 fpga_wrt_appif_val (fp_id, 0x0UL, APP_CFG_REG, TYPE_VAL, &e); dp_base = fpga_memmap (fp_id, QDR_SIZE, PROT_READ | ROT_WRITE,MAP_SHARED, QDR1_OFFSET, &e); for (i = 0; i < QDR_SIZE/sizeof (u_64); i++) dp_base[i] = iamge[i]; fpga_wrt_appif_val (fp_id, 0x1UL, APP_CFG_REG, TYPE_VAL, &e); /*... */ fpga_rd_appif_val (fp_id, &status, STATUS_REG, &e); /*... */ fpga_wrt_appif_val (fp_id, 0x0UL, APP_CFG_REG, TYPE_VAL, &e); for (i = 0; i < QDR_SIZE/sizeof (u_64); i++) output_iamge[i] = dp_base[i[i] ; fpga_close (fp_id, &e); } Configure the Wavelet for QDR bridging Start Processing Read the FPGA status Map the entire 4 Mbytes of QDR Memory Read back the Image Transfer the Image into the QDR Configure the Wavelet for QDR bridging Close the FPGA device DWT on Cray-XD1 (cnt’d) (Main Program)
104
104 Accessing µP memory from FPGA (Cray-XD1) unsigned long order; void *ftr_mem; /*... */ ftr_mem = fpga_set_ftrmem(fpga_fd, order, &err); if (ftr_mem == NULL) { /* Handle error. */ } fpga_wrt_appif_val (fp_id, (u_64) ftr_mem, BUFF0_PTR_REG, TYPE_ADDR, &e); fpga_wrt_appif_val (fp_id, 0x1UL, APP_CFG_REG, TYPE_VAL, &e); /*... */ fpga_rd_appif_val(fp_id, &status, STATUS_REG, &e); /*... */ The APIs support access to a region of the µP memory by the FPGA logic The program uses the fpga_set_ftrmem function to: Allocate an FTR Associates it with the address space of the µP Sets up the FPGA to access it directly It does not automatically provide the address of this region to the FPGA application logic One way is to establish an FPGA register for that purpose and use the fpga_wrt_appif_val function to write the value to the register
105
105 Using MPI on Cray-XD1 if(MYTHREAD==0) read_image (image_file_name, image_buffer, &rows, &cols); MPI_Bcast(&rows, 1,MPI_INT,0,MPI_COMM_WORLD); MPI_Bcast(&cols, 1,MPI_INT,0,MPI_COMM_WORLD); local_size= rows*cols/THREADS; MPI_Scatter(image_buffer, local_size,MPI_UNSIGNED_LONG, local_image_buffer, local_data_size, MPI_UNSIGNED_LONG, 0, MPI_COMM_WORLD); /* Execute the wavelet on the Hardware*/ process_image(fp_id, local_image_buffer, local_output_buffer, rows, cols); MPI_Gather(local_output_buffer,local_size,MPI_UNSIGNED_LONG, output_image_buffer, local_data_size, MPI_UNSIGNED_LONG, 0,MPI_COMM_WORLD); if(MYTHREAD==0) write_image (output_file_name, output_image_buffer, rows, cols); Each Cray XD1 chassis consists of 6 dual-processor Opteron nodes 2 Opteron processors (Total 12) 1 Xilinx Virtex-II Pro 50 (Total 6) Applications can be parallelized across the 6 FPGAs using MPI Data are distributed across the 6 FPGAs
106
106 rasclib_algorithm_request_t ar; int alg_id; int res; strcpy(ar.alg_id,“Wavelet"); ar.num_devices = 1;.. /* Read Image */. /* initialize daubechies coefficients */. rasclib_resource_alloc(&ar, 1); algorithm_id = rasclib_algorithm_open(“Wavelet"); res = rasclib_algorithm_alg_reg_write (alg_id, “coeff0", coeff[0]); res = rasclib_algorithm_alg_reg_write (alg_id, “coeff1", coeff[1]); res = rasclib_algorithm_alg_reg_write (alg_id, “coeff2", coeff[2]); res = rasclib_algorithm_alg_reg_write (alg_id, “coeff3", coeff[3]); res = rasclib_algorithm_send (alg_id, "a_in", Image_Buff, SIZE); Parameter Passing Small parameters Connect to Algorithm Defined Registers (alg_def_reg0 - alg_def_reg7) Pass parameter mapping to software through an extractor directive, type REG_IN: -- extractor REG_IN: coeff0 64 u alg_def_reg0[63:0] -- extractor REG_IN: coeff1 64 u alg_def_reg1[63:0] -- extractor REG_IN: coeff2 64 u alg_def_reg2[63:0] -- extractor REG_IN: coeff3 64 u alg_def_reg3[63:0] Large Arrays Dedicate a portion of an SRAM bank for the parameter array Pass parameter array mapping to software with an extractor comment of type SRAM: -- extractor SRAM:a_in 2048 64 sram[0] 0x00 in u fixed DWT on SGI-Altix (Main Program)
107
107 rasclib_algorithm_go (alg_id); res = rasclib_algorithm_receive (alg_id, "d_out", out_Buff, SIZE); rasclib_algorithm_commit (alg_id); rasclib_algorithm_wait (alg_id); rasclib_algorithm_close (alg_id); Results Read-Back Small parameters Connect to Algorithm Defined Registers Pass parameter mapping to software through an extractor directive, type REG_OUT Use the API function rasclib_algorithm_reg_read Large Arrays Dedicate a portion of an SRAM bank for the parameter array Pass parameter array mapping to software with an extractor comment of type SRAM: -- extractor SRAM:d_out 2048 64 sram[1] 0x00 out u fixed DWT on SGI-Altix (cnt’d) (Main Program)
108
108 Time Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 Read DMA012XX AlgorithmX012X Write DMAXX012 Improve performance by overlapping algorithm computation and data loading and unloading Extractor directives are used to tell software: where input/output data arrays are located (SRAM bank + starting index) the sizes of the input/output data arrays which arrays have been enabled for streaming Extractor directive type used: SRAM with attribute stream, e.g.: -- extractor SRAM:a_in 2048 64 sram[0] 0x00 in u stream -- extractor SRAM:d_out 2048 64 sram[1] 0x00 out u stream Streaming (SGI-Altix)
109
Example 5 Image processing: Thin Plate Splines
110
The application: Thin Plate Splines - image analysis of protein gels Image morphing based on natural logarithm computations Essential for comparing protein content Speedup per FPGA: 10-30x. Reduces analysis runtime from days to hours.
111
Host Program - running on Opteron CPU, calling FPGA subroutine Transfer parameter data to QDRAM Start Mitrion program and wait until finished Retrieve computed image data u_64 fpga_mem, i; my_fpga = fpga_open(args); // Use normal XD1 API for most operations... if (!fpga_is_loaded(args)) rtn = fpga_load(args);... // memory map QDRAMs into host address space fpga_mem = fpga_memmap(args); // Upload data to QDRAM memcpy(fpga_mem, parameter_data, sizeof_parameter_data); // Control of mitrion processor is internally handled // with a number of memory mapped registers in the FPGA // Controlling running/stepping/reset etc. mitrion_start(my_fpga); // Start mitrion block mitrion_wait(my_fpga); // wait for block to finish // Fetch results from QDRAM memcpy(image_coordinates, fpga_mem, sizeof_image_data);
112
FPGA program (1/3) - accelerated subroutine in Mitrion-c // Options: -cpp #define RAMType mem uint:64 [ 0x100000 ] #include "grint_lib.lqd" #include "logarithm_rwhile.lqd" (Fix, RAMType) readFix(RAMType m, uint:24 basicOffset, uint:24 fixOffset) { uint:32 memOffset = basicOffset + fixOffset; (result, m2) = _memread(m, memOffset); } (result, m2); (RAMType, RAMType, RAMType, RAMType) main (RAMType Am, RAMType Bm, RAMType Cm, RAMType Dm) { Fix py; // parameter vectors Fix px; Fix koeffx; Fix koeffy; // read paramters from external RAM (px, py, koeffx, koeffy, Aml) = foreach(index in ) { (x, Am2) = readFix(Am, PX_OFF, index) ; (y, Am3) = readFix(Am2, PY_OFF, index) ; (kx, Am4) = readFix(Am3, KOEFFX_OFF, index); (ky, Am5) = readFix(Am4, KOEFFY_OFF, index); } (x, y, kx, ky, Am5); Aut = _wait(Aml); Cut = grintpolc(Cm, px, py, koeffx, koeffy); } (Aut, Bm, Cut, Dm); readFix fetches input data from QDRAM Definition of RAM type Start of program. Matches external RAM interface of the XD1: 4 banks of 1M word each
113
FPGA program (2/3) - accelerated subroutine in Mitrion-c RAMType grintpolc ( RAMType coords, // out Fix px, Fix py, Fix koeffx, Fix koeffy ) { imDonel = foreach(y in ) { uint:32 lineoff = y*XSIZE; imDone2l = foreach(x in ) { (distx, disty) = foreach(px, py, koeffx, koeffy in px, py, koeffx, koeffy) { Fix dx = px - int2fix(x); Fix dy = py - int2fix(y); Fix r2 = fixmul(dx,dx) + fixmul(dy,dy); Fix ext = if(r2 == 0) 0 else { Fix ln = fixln(r2); ext = fixmul(r2,ln); } ext; Input arguments (the image) for Thin Plate Splines transform Major compute intensive part: high precision ln computation
114
FPGA program (3/3) - accelerated subroutine in Mitrion-c Fix rx = fixmul(ext, koeffx); Fix ry = fixmul(ext, koeffy); } (rx, ry); Fix distcoordx = sum(distx); Fix distcoordy = sum(disty); // distcoordx and distcoordy is the coordinated // of the pixels to be fetched from the distorted image uint:32 index = x + lineoff; int:32 x32 = (distcoordx >>> 8); // convert into Fix16.16 int:32 y32 = (distcoordy >>> 8); // convert into Fix16.16 watch x32; watch y32; bits:64 word = [x32, y32]; imDone3 = _memwrite(coords, index, word); } imDone3; imDone2 = _wait(imDone2l); } imDone2; imDone = _wait(imDonel); } imDone; Output argument is the distorted image Output arguments (distorted image coordinates) are written to QDRAM
115
115 Program Development Environments Challenges
116
116 Application Development for Reconfigurable Computers Program Entry Compilation Execution Platform mapping Debugging & Verification
117
117 Tasks Addressed in This Presentation Program Entry Compilation Execution Platform mapping Debugging & Verification
118
118 Program Program Entry
119
119 Platform Mapping SW/HW Partitioning Software (executed in the microprocessor system) Hardware (executed in the reconfigurable processor system) Program
120
120 SW/HW Partitioning & Coding Traditional Approach Specification SW/HW Partitioning SW Coding HW Coding SW Compilation HW Compilation SW ProfilingHW Profiling
121
121 SW/HW Partitioning & Coding New Approach Specification SW/HW Coding SW Compilation HW Compilation SW ProfilingHW Profiling SW/HW Partitioning
122
122 Platform Mapping FPGA mapping Software Hardware Program FPGA 1 FPGA 2 FPGA 3 FPGA 4
123
123 Example of FPGA Mapping add FPGA multip ly divide add multip ly divide FPGA 1FPGA 2 add multip ly divide FPGA 2FPGA 1
124
124 add multip ly divide FPGA 1FPGA 2 FPGA Mapping in SRC void fpga1(int64_t a, int64_t b, int64_t *sum, int mapno) { int64_t c, temp; send_to_bridge(b); c = a * const1; recv_from_bridge(&temp); *sum = temp+Mult; } void fpga2() { int64_t a, d; recv_from_bridge(&a); d = a/const2; send_to_bridge(d); } Makefile MAPFILES = FPGA1.mc FPGA2.mc PRIMARY = FPGA1.mc SECONDARY = FPGA2.mc CHIP2 = FPGA2.mc a FPGA1.mc FPGA2.mc b sum
125
125 FPGA Mapping in VIVA TM By changing the attributes one can specify where an object is to be located
126
126 Platform Mapping FPGA-FPGA data transfer & synchronization Software Hardware Program FPGA 1 FPGA 2 FPGA 3 FPGA 4
127
127 FPGA 1FPGA 2 64 computati on 2 computati on 1 void fpga1(int64_t a, b, c, *d) { send_to_bridge(a, b, c); computation1 recv_from_bridge(d); } void fpga2() { int64_t a,b,c,d; recv_from_bridge(&a, &b, &c); computation2 send_to_bridge(d); } FPGA-FPGA Data Transfer in SRC FPGA1.mc FPGA2.mc a b c d
128
128 32 words 64 bits 64 32 words FIFO FPGA-FPGA Data Transfer in SRC Bridge Port
129
129 FPGA-FPGA Data Transfer in VIVA TM Special partitioning objects placed between the modules to be synthesized automatically map the relevant lines between the FPGAs. For designs mapped over several FPGAs: The system description must include those FPGAs over which the design is to be mapped,
130
130 Platform Mapping Use of Internal and External Memories Software Hardware Program FPGA 1 FPGA 2 FPGA 3 FPGA 4 OCM OCM – On-Chip Memory LM – Local Memory SM – Shared Memory SM LM
131
131 Using On-Chip Memory (OCM) in SRC void sum(int64_t a[], int *c, int mapno) { BANK_A_ALLOC(AL, int64_t, SIZE); ocm_a [SIZE]; int i; cm2obm_0(AL, a, byteLength); wait_server_0(); for(i=0; i<SIZE; i++) { ocm_a[i] = AL[i]; } for(i=0; i<SIZE; i++) { tmp = ocm_a[i] + tmp; } } FPGA SM (OBM) 64 32 AL[] ocm_a[] OCM computations c
132
132 Using On-Chip Memory (OCM) in VIVA TM Special Objects under the Memory Subsystem of the library allows the programmer to use the on chip memory of the Xilinx Virtex II chip
133
133 Platform Mapping I/O Software Hardware Program FPGA 1FPGA 2 FPGA 3 FPGA 4 SM LM OCM SRC StarBridge
134
134 Main program Function_1(a, d, e) Function_2(d, e, f) Function_1 Function_2 Macro_1(a, b, c) Macro_2(b, d) Macro_2(c, e) Macro_3(s, t) Macro_1(n, b) Macro_4(t, k) FPGA …… Macro_1 Macro_2 a b c de FPGA contents after the Function_1 call Program in C or Fortran Run Time Reconfiguration in SRC
135
135 Run-time Reconfiguration in VIVA TM Reconfigurati on is possible by using the spawn object. By specifying the FileName attribute a VIVA executable (.vex file) or a VIVA project can be loaded onto the same or a different FPGA.
136
136 Ideal Program Entry Program Entry Function
137
137 Actual Program Entry SW/HW Partitioning Data Transfers & Synchronization Use of Internal and External Memories Sequence of Run-time Reconfigurations Use of FPGA Resources (multipliers, μP cores) Preferred Architectures Program Entry Function FPGA Mapping SW/HW Interface
138
138 Not implemented Manual Entry Compiler Automated SRC Star Bridge FPGA-FPGA Partitioning P-FPGA Partitioning FPGA-FPGA Data Transfer P-FPGA Data Transfer Computation-Data transfer Overlapping Choosing component version Evolution and the current status of tools and other vendors.........
139
139 Debugging & Verification
140
140 Application MAP Runtime Library ComListCodeWrapperCode User Logic Subroutine For MAP MAP Board Execution MAP Board Data & Flags User FPGAs Control Processor On-boardMemory User Logic Registers & Flags Logic Macro Logic Macro Logic Macro Logic Macro ComList Processor DMA Engine
141
141 Emulator MAP Emulator + DFG Simulator Application MAP Runtime Library ComListCodeWrapperCode User Logic Subroutine For MAP Data & Flags User FPGAs Control Processor On-boardMemory User Logic Registers & Flags C Code Macro C Code Macro C Code Macro C Code Macro ComList Processor DMA Engine
142
142 MAP Emulator + Verilog Simulator Emulator Application MAP Runtime Library ComListCodeWrapperCode User Logic Subroutine For MAP Data & Flags User FPGAs Control Processor On-boardMemory User Logic Registers & FlagsVCS Verilo g Macro Verilo g Macro Verilo g Macro Verilo g Macro ComList Processor DMA Engine
143
143 X86 System in VIVA TM The FileIn Object as it appears when the x86 system is loaded
144
144 X86 System in VIVA TM FileIn object as it appears when the FPGA system description is loaded.
145
145 Debugging in VIVA TM Data can be viewed with the help of widgets, which are basically input and output ‘horns’ placed in a worksheet. Various display options are available to view data, options to include the kind of view desired by the viewer and the data viewed can be switched between HEX or INT.
146
146 IA-32 Linux Machine RTL Generation and Integration with Core Services Design Synthesis (Synplify Pro, Amplify) Design Verification Behavioral Simulation (VCS, Modelsim) Static Timing Analysis (ISE Timing Analyzer).v,.vhd.edf.ncd,.pcf.bin Metadata Processing (Python).v,.vhd.cfg Altix Device Programming (RASC Abstraction Layer, Device Manager, Device Driver) Real-time Verification (gdb).c Design Implementation (ISE) HLL Design Entry (Handel-C, Mitrion C, Viva) Debugging in the SGI Environment
147
Compiler, Simulator And Debugger
148
148 Programming Environments Summary
149
149 SRC Programming Environment + very easy to learn and use + standard ANSI C + hides implementation details + good support for debugging + vendor and user libraries + very well integrated environment + good use of 3rd party tools + in production use for over 3 years with constant improvements - subset of C - legacy C code requires rewriting - C limitations in describing HW (paralellism, data types) - closed environment, limited portability of codes to HW platforms other than SRC
150
150 Star Bridge Programming Environment Viva + drag-and-drop program entry + standard and user libraries + separation of designs/programs from system/platform descriptions = portability of codes + support for multiple platforms under development - does not follow any established standards - no textual description = limited scalability of codes - control signals (e.g., handshaking between the adjacent library cells) must be specified explicitly - no clear mechanism to call HW functions from SW
151
151 + drag-and-drop program entry + extensive libraries of DSP components + good support for debugging - graphical description = limited scalability of codes - control signals (e.g., handshaking between the adjacent library cells) must be specified explicitly - limited library support for applications other than DSP Cray Programming Environment based on Simulink/System Generator
152
152 + graphical programming language (drag-and-drop program entry) + extensive libraries of DSP components + single environment (MATLAB™/Simulink™) to analyze, visualize, implement, debug, verify + efficient resource usage - graphical description = limited scalability of codes - limited library support for applications other than DSP Cray Programming Environment based on DSPLogic
153
153 Cray XD1 and SGI Environments based on Mitrion-C + high-level C-like language easy to learn by an HPC programmer + ease of describing paralellism and non-standard (variable size) data types + small amount of Mitrion-C generates large number of lines of HDL code + suitable for describing classical complex HPC problems + Mitrion-C code portable between Cray XD1 and SGI - new and yet untested - non-standard, no support for legacy codes - language describes only what happens in a single FPGA - currently, no mechanisms to use HDL macros
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.