Project Introduction 4 Port Layer-2/3 Output Queued Switch Design Ethernet (Layer-2), IPv4, ICMP, and ARP Programmable Routing Tables – Longest Prefix Match, Exact Match Register support for Switch Fwd On/Off, Statistics, Queue Status, etc. Layer-2 Broadcast, and limited Layer-3 Multicast support Limited support for Access Control Highly Modular Design for future expandability
Available Data Bandwidth Memory bandwidth: 32 bits * 25 MHz = 800 Mbits/sec CFPGA to Ingress FIFO/Control Block bandwidth: 32 bits * 25 MHz / 4 = 200 Mbits/sec Packet Queue to Egress bandwidth: 32 bits * 25 MHz / 4 = 200 Mbits/sec Packet Processing Requirements 4 ports operating at 10 Mbits/sec => 40 Mbits/sec Minimum size packet 64 Byte => 512 bits 512 bits / 40 Mbits/sec = 12.8 us Internal clock is 25 MHz 12.8 us * 25 MHz = 320 clocks to process one packet Bandwidth Analysis
Data Flow Diagram Output Queued Shared Memory Switch Round Robin Scheduling Packet Processing Engine provides L2/L3 functionality Coarse Pipelined Arch. at the Block Level
Master Arbiter Round Robin Scheduling of service to Each Input and Output Interfaces Rest of the Design with Control FPGA Co-ordinates activities of all high level blocks Maintains Queue Status for each Output
Ingress FIFO Control Block Interfaces three blocks Control FPGA Forwarding Engine Packet Buffer Controller Dual Packet Memories for coarse pipelining Responsible for Packet Replication for Broadcast
Packet Processing Engine Overview Goals Features – L3/L2/ICMP/ARP Processing Performance Requirements – 78Kpps Fit within 60% of Single User FPGA Block Modularity / Scalability Verification / Design Ease Actual Support for all required features + L2 broadcast, L3 multicast, LPM, Statistics and Policing (coarse access control) Performance Achieved – 234Kpps ( worst case 69Kpps for ICMP echo requests 1500bytes ) Requires only 12% of Single UFPGA resources Highly Modular Design for design/verification/scalability ease
Pkt Processing Engine Block Diagram Forwarding Master State Machine First Level Parsing Packet Memory0 ARP ProcessingL3 Processing Native Packet To Packet Buffer Packet Memory1 ICMP ProcessingL2 Processing Statistics and Policing From CFPGA
Forwarding Master State Machine Responsible for controlling individual processing blocks Request/Grant Scheme for future expandability Initiates a Request for Packet to Ingress FIFO and then assigns to responsible agents based on packet contents Replication of MSM to provide more throughput
L3 Processing Engine Parsing of the L3 Information: Src/Dest Addr, Protocol Type, Checksum, Length, TTL Longest Prefix Match Engine Mask Bits to represent the prefix. Lookup Key is Dest Addr Associated Info Table (AIT) Indexed using the entry hit AIT provides Destination Port Map, Destination L2 Addr, Statistics Bucket Index Request/Done scheme to allow for expandability (e.g. future m-way Trie implementation project) ICMP Support Engine Request (if Dest Addr is Routers IP Address + Protocol Type is ICMP) Total 85 cycles for Packet Processing with 80% of the cycles spent on Table Lookup If using 4-way trie, total processing time can be reduced to less than 30 cycles.
L2 Processing Engine If there is any processing problems with ARP, ICMP, and/or L3, then L2 switching is done Exact Match Engine Re-use of the LPM match engine but with Mask Bits set to all 1’s. Associated Info Table (AIT) Indexed using the entry hit AIT provides Destination Port Map, and Statistics Bucket Index Request/Done scheme to allow for expandability (e.g. future Hash implementation project) Learning Engine removed because of Switch/Router Hardware Verification problems (HP Switch bug) Total 76 cycles for Packet Processing with over 80% of the cycles spent on Table Lookup If using Hashing Function, total processing time can be reduced to less than 20 cycles.
Packet Buffer Interface Interfaces with Master Arbiter and Forward Engine Output Queued Switch Statically Assigned Single Queue per port Off-chip ZBT SRAM on NetFPGA board
Control Block Typical Register Rd/Wr Functionality Status Register Control Register (forwarding disable, reset) Router’s IP Addresses (port 1-4) Queue Size Registers Statistics Registers Layer-2 Table Programming Registers Layer-3 Table Programming Registers
Verification Three Levels of Verification Performed Simulations: Module Level – to verify the module design intent and bus functional model System Level – using the NetFPGA verification environment for packet level simulations Hardware Verification Ported System Level tests to create tcpdump files for NetFPGA traffic server Very good success on Hardware with all System Level tests passing. Only one modification required (reset generation) after Hardware Porting Demo - Greg can provide lab access to anyone interested
Synthesis Overview Design was ported to Altera EP20K400 Device Logic Elements Utilized – 5833 (35% of Total LEs) RAM ESBs Used – 46848 (21% of Total ESBs) Max Design Clock Frequency ~ 31MHz No Timing Violations Design Block Name Flip-flops (Actual) Ram bits (Actual) Gates (Actual) Main Arbiter7101500 Memory Controller10902000 Control Block60805000 Ingress FIFO Controller60640001200 Switching and Routing Engine92514000 Total17737800023700
Conclusion Easy to achieve “required” performance in an OQ Shared Memory Switch in NetFPGA Modularity of the design allows more interesting and challenging future projects Design/Verification Environment was essential to meet schedule NetFPGA is an excellent design exploration platform