Presentation is loading. Please wait.

Presentation is loading. Please wait.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #8 – Reconfigurable.

Similar presentations


Presentation on theme: "CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #8 – Reconfigurable."— Presentation transcript:

1 CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #8 – Reconfigurable Networking

2 Lect-08.2CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Recap – Genetic Pattern Matching Comparing strings by edit distance Motivation: The Human Genome Project Do two genetic strings match? How are they related? When biologists characterize a new sequence, they want to compare it to the (growing) database of known sequences Abstraction: What is the cost of transforming s into t Given – costs for insertion, deletion, substitution

3 Lect-08.3CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Alphabet and Costs Alphabet Letters in the string. For DNA, there are four: A (Adenine) C (Cytosine) T (Thymine) G (Guanine) Transformation Costs Insert:1, Delete:1, Substitute:2, match:0 Type of comparison One target to many sources One target to one source

4 Lect-08.4CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Substitution Example baboon CostMoveWord bab|on bobon boubon bourbon Delete ‘o’ Substitute ‘o’ Insert ‘u’ Insert ‘r’ Match? Total cost: 1 2 1 1 0 5bourbon

5 Lect-08.5CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Dynamic Programming Solution Source sequence: s 1, s 2, … s m Target sequence: t 1, t 2, … t n d i,j = distance between subsequence s 1, s 2, …s i and subsequence t 1, t 2, …t j, where d 0,0 = 0 d i,0 = d i-1,0 + Delete(s i ) d 0,j = d 0,j-1 + Insert(t j ) d i-1,j + Delete(s i ) d i,j = min d i,j-1 + Insert(t j ) d i-1,j-1 + Substitute(s i, t j ) Distance(Source, Target) = d m,n

6 Lect-08.6CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Dynamic Programming Example b a b o o n 0 1 2 3 4 5 6 b 1 0 1 2 2 3 4 o 2 1 2 2 2 2 3 u 3 2 3 3 3 3 4 r 4 3 4 4 3 4 5 b 5 3 4 4 4 5 5 o 6 4 5 5 4 5 6 n 7 5 6 6 5 6 5

7 Lect-08.7CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Parallelism on the Anti-Diagonal b a b o o n 0 1 2 3 4 5 6 b 1 0 1 2 2 3 4 o 2 1 2 2 2 2 3 u 3 2 3 3 3 3 4 r 4 3 4 4 3 4 5 b 5 3 4 4 4 5 5 o 6 4 5 5 4 5 6 n 7 5 6 6 5 6 5 di,jdi,j d i-1, j d i, j-1 d i-1, j-1

8 Lect-08.8CprE 583 – Reconfigurable ComputingSeptember 13, 2007 SCin SDin TCout TDout If (SCin != 0) and (TCin != 0) PEDist + Substitute(SCin, TCin) PEDist  min TDin + Delete(SCin) SDin + Insert(TCin) else-if (SCin != 0) PEDist  SDin else-if (TCin != 0) PEDist  TDin endif SCout  SCin TCout  TCin SDout  PEDist TDout  PEDist Bidirectional PE Example Bidir PE PEDist Bidir PE PEDist Bidir PE PEDist SCout SDout TCin TDin

9 Lect-08.9CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Bidirectional Summary 16 CLBs/PE 384 PEs/Board 2,100 Million Cells/sec Requires 2*(m+n) PEs Uses only half the processors at any one time Must stream both source and target for each comparison Makes comparison against large DB impractical

10 Lect-08.10CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Genetic Search Performance Nearly linear scaling in cell updates per second (CUPS) Need to reuse array for large patterns HardwareCUPS Splash 2 x1643,000M Splash 23,000M Splash 1370M P-NAC (34)500M λ 0.60μ 2.0μ CM-2 (64K)150M CM-5 (32)33M SPARC 101.2M SPARC 10.87M ? ? 0.40μ 0.75μ Area 500Mλ 2 x17x16 500Mλ 2 x16 420Mλ 2 x32 7.8Mλ 2 x34 1.6GMλ 2 273Mλ 2 CUP/λ 2 s 0.32 0.38 0.028 1.9 0.00075 0.0032

11 Lect-08.11CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Outline Recap – Pattern Matching on Splash-2 The Field-Programmable Port Extender (FPX) FPX Architecture FPX Programming Model FPX Applications Pattern Matching Packet Classification Rule Processing

12 Lect-08.12CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Application – Network Processing Networking applications well-suited for reconfigurable hardware Target signatures change often Massive quantities of stream-based data Repetitive operations Connecting up to a realistic networking environment is hard Washington University experimental setup one of the best Shows importance of both memory and processing capability Numerous experiments performed over the past five years

13 Lect-08.13CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Network Routing with the FPX FPX Modules distributed across each port of a switch IP packets (over ATM) enter and depart line card Packet fragments processed by modules Advantages: New protocols implemented directly in silicon Easy to upgrade in the field IP Packets

14 Lect-08.14CprE 583 – Reconfigurable ComputingSeptember 13, 2007 FPX Hardware Device

15 Lect-08.15CprE 583 – Reconfigurable ComputingSeptember 13, 2007 FPX Hardware in a WUGS-20 Switch

16 Lect-08.16CprE 583 – Reconfigurable ComputingSeptember 13, 2007 FPGA-based Router FPX module contains two FPGAs NID – network interface device Performs data queuing RAD – reprogrammable application device Specialized control sequences

17 Lect-08.17CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Reprogrammable Application Device Spatial Re-use of FPGA Resources Modules implemented using FPGA logic Module logic can be individually reprogrammed Shared Access to off-chip resources Memory Interfaces to SRAM and SDRAM Common Datapath to send and receive data

18 Lect-08.18CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Architecture of the FPX RAD Large Xilinx FPGA Attaches to SRAM and SDRAM Reprogrammable over network Provides two user-defined Module Interfaces NID Provides Utopia Interfaces between switch & line card Forwards cells to RAD Programs RAD

19 Lect-08.19CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Architecture of the FPX (cont.)

20 Lect-08.20CprE 583 – Reconfigurable ComputingSeptember 13, 2007 FPX SRAM Provide low latency for fast table-lookups Zero Bus Turnaround (ZBT) allows back-to-back read / write operations every 10ns Dual, Independent Memories 36-bit wide bus

21 Lect-08.21CprE 583 – Reconfigurable ComputingSeptember 13, 2007 FPX SDRAM Dual, independent SDRAM memories 64-bit wide, 100 MHz 64Mb / Module : 128 Mb total [expandable] Burst-based transactions [1-8 word transfers] Latency of 14 cycles to Read/Write 8-word burst

22 Lect-08.22CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Routing Traffic Flows Functions Check packets for errors Process commands Control, status, & reprogramming Implement per-flow forwarding NID LineCardSwitch EC VC ccp Traffic flows routed among Switch Line Card RAD.Switch RAD.Linecard

23 Lect-08.23CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Typical Flow Configurations VC EC VC ccp EC VC RAD Switch RAD LineCard Switch Default Flow Action (Bypass) VC EC VC ccp EC VC RAD Switch RAD LineCard Switch Ingress Processing (IP Routing) VC EC VC ccp EC VC RAD Switch RAD LineCard Switch Full RAD Processing (Packet Routing and Reassembly) VC EC VC ccp EC VC RAD Switch RAD LineCard Switch (Per-flow Output Queueing) Egress Processing VC EC VC ccp EC VC LineCardSwitch RAD Switch RAD LineCard Full Loopback Testing (System Test) VC EC VC ccp EC VC RAD Switch RAD LineCard Switch Partial Loopback Testing (Egress Flow Processing Test)

24 Lect-08.24CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Reprogramming Logic NID programs at boot from EPROM Switch Controller writes RAD configuration memory to NID Configuration file for RAD arrives transmitted over network via control cells Switch Controller issues {Full/Partial} reconfigure command NID reads RAD config memory to program RAD Performs complete or partial reprogramming of RAD

25 Lect-08.25CprE 583 – Reconfigurable ComputingSeptember 13, 2007 FPX Interfaces Provides Well defined Interface Utopia-like 32-bit fast data interface Flow control allows back-pressure Flow Routing Arbitrary permutations of packet flows through ports Dynamically Reprogrammable Other modules continue to operate even while new module is being reprogrammed Memory Access Shared access to SRAM and SDRAM Request/Grant protocol

26 Lect-08.26CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Pattern Matching using the FPX Use Hardware to detect a pattern in data Modify packet based on match Pipeline operation to maximize throughput

27 Lect-08.27CprE 583 – Reconfigurable ComputingSeptember 13, 2007 “Hello, World” Module Function

28 Lect-08.28CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Logical Implementation VCI Match Append “WORLD” to payload New Cell

29 Lect-08.29CprE 583 – Reconfigurable ComputingSeptember 13, 2007 The Wrapper Concept App Wrapper

30 Lect-08.30CprE 583 – Reconfigurable ComputingSeptember 13, 2007 AAL5 Encapsulation Payload is packed in cells Padding may be added 64 bit Trailer at end of cell Trailer contains CRC-32 Last Cell indication bit (last bit of PTI field)

31 Lect-08.31CprE 583 – Reconfigurable ComputingSeptember 13, 2007 HelloBob Module UDP Processor Echo UDPHello Bob IP Processor Cell Processor Frame Processor SRAM Interface OutputInput

32 Lect-08.32CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Results: Performance Operating Frequency: 119 MHz. 8.4ns critical path Well within the 10ns period RAD's clock. Targeted to RAD’s V1000E-FG680-7 Maximum packet processing rate: 7.1 Million packets per second. (100 MHz)/(14 Clocks/Cell) Circuit handles back-to-back packets Slice utilization: 0.4% (49/12,288 slices) Less than one half of one percent of chip resources Search technique can be adapted for other types of data matching and modification Regular expressions Parsing image content …

33 Lect-08.33CprE 583 – Reconfigurable ComputingSeptember 13, 2007 CAM-based Packet Matching Sample Packet: Source Address = 128.252.5.5 (dotted.decimal) Destination Address = 141.142.2.2 (dotted.decimal) Source Port = 4096 (decimal) Destination Port = 50 (decimal) Protocol = TCP (6) Payload = “Consolidate your loans. CALL NOW” Payload Lists = { General SPAM (0), Save Money SPAM (1) } Content Vector = “00000011” (binary) = x”03” (hex) 7 103 3971 Src IP (hex) = 80FC0505 Dest IP (hex) = 8D8E0202 Src Port = 1000 Dest Port = 0050 Proto = 06 08 40 72 Con- tent = 03 111104

34 Lect-08.34CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Sample Filter Source Address = 128.252.0.0 / 16 Destination Address = 141.142.0.0 / 16 Source Port = Don’t Care Destination Port = 50 Protocol = TCP (6) Payload includes general SPAM (List 0) 71033971 Src IP (hex) = 80FC0505 Dest IP (hex) = 8D8E0202 Src Port = 1000 Dest Port = 0050 Proto = 06 0 8 40 72 Src IP value = 80FC0000 Dest IP (hex) = 8D8E0000 Src Port = 0000 Dest Port = 50 Proto = 06 Src IP (hex) = FFFF0000 Dest IP (hex) = FFFF0000 Src Port = 0000 Dest Port = FFFF Proto = FF Value Mask: 1=care 0=don’t care IP Packet Con- ten t= 01 Con- ten t= 01 Con- tent= 03 DROP the packet : It matches the filter

35 Lect-08.35CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Packet Classifier with FlowID CAM MASK [1] CAM VALUE [1] CAM MASK [2] CAM VALUE [2] CAM MASK [3] CAM VALUE [3] CAM MASK [N] CAM VALUE [N] Flow ID [1] 112 bits Flow ID [2] Flow ID [3] Flow ID [N] Flow ID... 16 bits Value Comparators Mask Matchers Priority Encoder Resulting Flow Identifier Flow List Source Address Destination Address 16 bits Payload Match Bits Source Port Dest. Port Protocol - - CAM Table - - Bits in IP Header

36 Lect-08.36CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Fast IP Lookup Algorithm Function Search for best matching prefix using Trie algorithm PrefixNext Hop * 01* 4 7 10*2 110*9 0001*1 1011*0 00110*5 01011*3 0 1 0 00 0 0 0 11 1 11 1 1 1 1

37 Lect-08.37CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Hardware Implementation in the FPX SRAM 1 SRAM 2 IP Lookup Engine counter On-Chip Cell Store SRAM 1 Interface Control Cell Processor Packet Reassembler RAD FPGA NID FPGA Extract IP Headers Remap VCIs for IP packets LC SW Request Grant 0 1 0 00 0 0 0 11 1 11 1 1 1 1

38 Lect-08.38CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Time (cycles) Generate Address Pipelined FIPL Operations Throughput : Optimized by interleaving memory accesses Operate 5 parallel lookups t_pipelined_lookup = 550ns / 5 = 110 ns Throughput = 9.1 Million packets / second Latch ADDR into SRAM SRAM D < M[A] Latch Data into FPGA Compute Time (cycles) Space (Parallel lookup units on FPGA) Generate Address

39 Lect-08.39CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Other Modules Implemented IPv4 CAM Filter 104 Bit header matching Fast IP Lookup (FIPL) Longest Prefix Match MAE-West at 10M pkts/second Packet Content Scanner Reg. Expression Search Data Queueing Per-flow queue in SDRAM IPv6 Tunneling Module Tunnels IPv6 over IPv4 Statistics Module Event counter Traffic Generator Per-flow mixing Video Recoder Motion JPEG Embedded Processor KCPSM

40 Lect-08.40CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Summary Field Programmable Port Extender (FPX) Network-accessible Hardware Reprogrammable Application Device Module Deployment Modules implement fast processing on data flow Network allows Arbitrary Topologies of distributed systems Project Website http://www.arl.wustl.edu/arl/projects/fpx/


Download ppt "CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #8 – Reconfigurable."

Similar presentations


Ads by Google