(Stanford University) (Cambridge University)

(Stanford University) (Cambridge University)
NetFPGA Workshop Day 1 Presented by: Hosted by: Manolis Katevenis at FORTH, Crete September , 2010 Jad Naous Andrew W. Moore (Stanford University) (Cambridge University)

Tutorial Outline Background The Stanford Base Reference Router
Introduction The NetFPGA Platform The Stanford Base Reference Router Motivation: Basic IP review Demo1: Reference Router running on the NetFPGA The Enhanced Reference Router Motivation: Understanding buffer size requirements in a router Demo 2: Observing and controlling the queue size How does the NetFPGA work Utilities Reference Designs Inside the NetFPGA Hardware The Life of a Packet Through the NetFPGA Hardware Datapath Interface to software: Exceptions and Host I/O Exercise: Drop Nth Packet Concluding Remarks Using NetFPGA for research and teaching

Section I: Motivation

What is the NetFPGA? A line-rate, flexible, open networking platform for teaching and research

NetFPGA consists of… Four elements: NetFPGA board
Tools + reference designs Contributed projects Community NetFPGA Board

NetFPGA board Networking Software running on a standard PC
PC with NetFPGA Networking Software running on a standard PC CPU Memory PCI A hardware accelerator built with Field Programmable Gate Array driving Gigabit network links FPGA Memory 1GE NetFPGA Board

Tools + reference designs
Compile designs Verify designs Interact with hardware Reference designs: Router (HW) Switch (HW) Network Interface Card (HW) Router Kit (SW) SCONE (SW)

Example Contributed Projects
Contributor OpenFlow switch Stanford University Packet generator NetFlow Probe Brno University NetThreads University of Toronto zFilter (Sp)router Ericsson Traffic Monitor University of Catania DFA UMass Lowell More projects:

Community Wiki Documentation (slowly growing)
Encourage users to contribute Forums Support by users for users Active community – 10s to 100s of posts per week

NetFPGA’s Defining Characteristics
Line-Rate Processes back-to-back packets Without dropping packets At full rate of Gigabit Ethernet Links Operating on packet headers For switching, routing, and firewall rules And packet payloads For content processing and intrusion prevention Open-source Hardware Similar to open-source software Full source code available BSD-Style License But harder, because Hardware modules must meeting timing Verilog & VHDL Components have more complex interfaces Hardware designers need high confidence in specification of modules

Test-Driven Design Regression tests Example: Internet Router
Have repeatable results Define the supported features Provide clear expectation on functionality Example: Internet Router Drops packets with bad IP checksum Performs Longest Prefix Matching on destination address Forwards IPv4 packets of length bytes Generates ICMP message for packets with TTL <= 1 Defines how packets with IP options or non IPv4 … and dozens more … Every feature is defined by a regression test

Who, How, Why Who uses the NetFPGA? How do they use the NetFPGA?
Teachers Students Researchers How do they use the NetFPGA? To run the Router Kit To build modular reference designs IPv4 router 4-port NIC Ethernet switch, … Why do they use the NetFPGA? To measure performance of Internet systems To prototype new networking systems

What you will learn Overall picture of NetFPGA
How reference designs work How you can work on a project NetFPGA Design Flow Directory Structure, library modules and projects How to utilize contributed projects Interface/Registers How to verify a design (Simulation and Regression Tests) Things to do when you get stuck AND… You can start your own projects!

Section II: Demo Basic Use

Basic Uses of NetFPGA Recap Internet Protocol and Routing Demonstrate
How you can use the NetFPGA as a router See routing in action

What is IP? IP (Internet Protocol) IP Packet
Protocol used for communicating data across packet-switched networks Divides data into a number of packets (IP packet) IP Packet Header (IP Header) including: Source IP address Destination IP address

IP Header Data Hdr 32 Data Options (if any) Destination Address
16 32 4 1 Data Options (if any) Destination Address Source Address Header Checksum Protocol TTL Fragment Offset Flags Fragment ID Total Packet Length T.Service HLen Ver 20 bytes

IP Address Used to uniquely identify a device (such as a computer) from all other devices on a network Two parts Identifier of a particular network on the Internet Identifier of a particular device within a network All packets, except the ones for the same network, first go to their gateway (router) and are transferred to the destination via routers.

Basic Operation of an IP Router
D E F R5 D R5 F R3 E D Next Hop Destination

What does a router do? R3 A B C R1 R2 R4 D E F R5 32 Data
16 32 4 1 Data Options (if any) Destination Address Source Address Header Checksum Protocol TTL Fragment Offset Flags Fragment ID Total Packet Length T.Service HLen Ver 20 bytes D R5 F R3 E D Next Hop Destination

What does a router do? A B C R1 R2 R3 R4 D E F R5

Basic Components of an IP Router
Software Hardware Management & CLI Routing Protocols Control Plane Routing Table Datapath per-packet processing Forwarding Table Switching

Per-packet processing in an IP Router
1. Accept packet arriving on an incoming link. 2. Lookup packet destination address in the forwarding table to identify outgoing port(s). 3. Manipulate IP header: e.g., decrement TTL, update header checksum. 5. Buffer packet in the output queue. 6. Transmit packet onto outgoing link.

Generic Datapath Architecture
Header Processing Data Hdr Lookup IP Address Update Header Data Hdr Queue Packet Forwarding Table IP Address Next Hop Buffer Memory

CIDR and Longest Prefix Matches
The IP address space is broken into line segments. Each line segment is described by a prefix. A prefix is of the form x/y where x indicates the prefix of all addresses in the line segment, and y indicates the length of the segment. e.g. The prefix 128.9/16 represents the line segment containing addresses in the range: … 142.12/19 65/8 128.9/16 232-1 216

Classless Interdomain Routing (CIDR)
/24 /24 /20 /20 Most specific route = “longest matching prefix” 128.9/16 232-1

Techniques for LPM in hardware
Linear search Slow Direct lookup Currently requires too much memory Updating a prefix leads to many changes Tries Deterministic lookup time Easily pipelined but require multiple memories/references TCAM (Ternary CAM) Simple and widely used but have lower density than RAM and need more power Gradually being replaced by algorithmic methods

An IP Router on NetFPGA Linux user-level Software processes Verilog on
Hardware Management & CLI Linux user-level processes Routing Protocols Exception Processing Routing Table Verilog on NetFPGA PCI board Forwarding Table Switching

Reference Router running on the NetFPGA
Demo 1 Reference Router running on the NetFPGA

Hardware Setup for Demo #1
CPU x2 NIC Video Server PCI-e PCI-e GE GE Net-FPGA GE PCI Internet Router Hardware GE GE Server delivers streaming HD video through a chain of NetFPGA Routers GE Net-FPGA GE Internet Router Hardware GE GE GE … CPU x2 NIC PCI-e GE GE Net-FPGA GE Video Display PCI Internet Router Hardware GE GE CAD Tools GE

Topology 32 Video Server Video Client Shortest Path Demo 1 .1.1 .4.1
.7.1 .10.1 .13.1 .16.1 .1.2 .4.2 .7.2 .10.2 .13.2 .16.2 .3.1 .6.2 .9.2 .12.2 .15.2 .17.1 .2.1 .3.2 .6.1 .9.1 .12.1 .15.1 .30.2 .5.1 .8.1 .11.1 .14.1 .18.1 .30.1 .26.1 .23.1 .18.2 .27.2 .24.2 .21.2 .20.1 .29.1 .27.1 .24.1 .21.1 Explain the reason why we chose this toplogy Explain how we will run the video. Will it be projected on the screen or will we ask users to do it themselves? .28.2 .25.2 .22.2 .19.2 .28.1 .25.1 .22.1 .19.1 Video Server Video Client Shortest Path 32 32

Working IP Router 34 Objectives
Demo 1 Objectives Become familiar with Stanford Reference Router Observe PW-OSPF re-routing traffic around a failure 34

Step 1 – Observe the Routing Tables
Demo 1 The router is already configured and running on your machines The routing table has converged to the routing decisions with minimum number of hops Next, break a link … 35

Step 2 - Dynamic Re-routing
Demo 1 Any PC can stream traffic through multiple NetFPGA routers in the ring topology to any other PC Key: Example: To stream mplayer video from server 4.1, type: ./mp eth1 of Host PC X.Y NetFPGA Router # * * * * 6 7 8 9 19.1 22.1 25.1 28.1 1.1 * * 16.1 13.1 10.1 7.1 4.1 5 4 3 2 1 * * * * 36

Step 3 - Dynamic Re-routing
Demo 1 Break the link between video server and video client Routers re-route traffic around the broken link and video continues playing .1.1 .4.1 .7.1 .10.1 .13.1 .16.1 .1.2 .4.2 .7.2 .10.2 .13.2 .16.2 .3.1 .6.2 .9.2 .12.2 .15.2 .2.1 .3.2 .6.1 .9.1 .12.1 .15.1 .17.1 .30.2 .5.1 .8.1 .11.1 .14.1 .18.1 .30.1 .26.1 .23.1 .18.2 .27.2 .24.2 .21.2 .29.1 .20.1 .27.1 .24.1 .21.1 .28.2 .25.2 .22.2 .19.2 .28.1 .25.1 .22.1 .19.1 37

Section III: Demo Advanced Use

Advanced Uses of NetFPGA
Introduction on TCP and Buffer Sizes Demonstrate NetFPGA used for real time measurement See TCP Saw tooth in real time

Buffer Requirements in a Router
Buffer size matters: Small queues reduce delay Large buffers are expensive Theoretical tools predict requirements Queuing theory Large deviation theory Mean field theory Yet, there is no direct answer Flows have a closed-loop nature Question arises on whether focus should be on equilibrium state or transient state Having said that, one might think the buffer sizing problem must be very well understood. After all, we are equipped with tools like queueing theory, large deviations theory, and mean field theory which are focused on solving exactly this type of problem. You would think this is simply a matter of understanding the random process that describes the queue occupancy over time. Unfortunately, this is not the case. The closed-loop nature of the flows, and the fact that flows react to the state of the system makes it necessary to use control theoretic tools, but those tools emphasize on the equilibrium state of the system, and fail to describe transient delays.

Rule-of-thumb Universally applied rule-of-thumb: Context
Source Router Destination C 2T Universally applied rule-of-thumb: A router needs a buffer size: 2T is the two-way propagation delay (or just 250ms) C is capacity of bottleneck link Context Mandated in backbone and edge routers Appears in RFPs and IETF architectural guidelines Already known by inventors of TCP [Van Jacobson, 1988] Has major consequences for router design So if the problem is not easy, what do people do in practice? Buffer sizes in today’s Internet routers are set based on a rule-of-thumb which says If we want the core routers to have 100% utilization, The buffer size should be greater than or equal to 2TxC Here 2T is the two way propagation delay of packets going through the router And C is the capacity of the target link. This rule is mandated in backbone and edge routers, and Appears in RFPs and IETF architectural guidelines. It has been known almost since the time TCP was invented. Note that if the capacity of the network is increased, based on this rule, we need to increase the buffer size linearly with capacity. We don’t expect the propagation delay changed that much over time, but we expect the capacity to grow very rapidly, Therefore, this rule can have major consequences in router design, and that’s exactly why today’s routers have so much buffering as I showed you a few moments ago.

The Story So Far 10,000 20 1,000,000 # packets at 10Gb/s
After this relatively long introduction, let me give an overview of the rest of my presentation. I'll talk about three different rules for sizing router buffers. The first rule is the rule-of-thumb which I just described. As I mentioned, this rule is based on the assumption that we want to have 100% link utilization at the core links. The second rule is a more recent result proposed by Appenzeller, Keslassy, and McKeown which basically challenges the original rule-of-thumb. Based on this rule if we have N flows going through the router, we can reduce the buffer size by a factor of sqrt(N) The underlying assumption is that we have a large number of flows, and the flows are desynchronized. Finally, the third rule which I’ll talk about today, says that If we are willing to sacrifice a very small amount of throughput, i.e. if having a throughput less than 100% is acceptable, We might be able to reduce the buffer sizes significantly to just O(log(W)) packets. Here W is the maximum congestion window size. If we apply each of these rules to a 10Gb/s link We will need to buffer 1,000,000 packets based on the first rule, About 10,000 packets based on the 2nd one, And only 20 packets based on the 3rd rule. For the rest of this presentation I’ll show you the intuition behind each of these rules; and Will provide some evidence that validates the rule. Let’s start with the rule-of-thumb. Assume: Large number of desynchronized flows; 100% utilization Assume: Large number of desynchronized flows; <100% utilization

Exploring Buffer Sizes
Need to reduce buffer size and measure occupancy Not possible in commercial routers So, we will use the NetFPGA instead Objective: Use the NetFPGA to understand how large a buffer we need for a single TCP flow.

Why 2TxC for a single TCP Flow?
Rule for adjusting W If an ACK is received: W ← W+1/W If a packet is lost: W ← W/2 Only W packets may be outstanding

Time Evolution of a Single TCP Flow
Time evolution of a single TCP flow through a router. Buffer is 2T*C Time evolution of a single TCP flow through a router. Buffer is < 2T*C Here is a simulation of a single TCP flow with a buffer size equal to the bandwidth delay product. As you can see, the congestion window changes according to a sawtooth shape, and varies between 140, and 280. On the bottom graph we can see the queue occupancy. As you can see, when the congestion window is halved the buffer occupancy becomes zero, and the two curves change similarly. Note that since the pipe is full at all times, and the link utilization remains at 100%. Now, on the other hand, when the buffer size is less than the bandwidth delay product, we can see that When the congestion window is halved, the queue occupancy goes to zero and Remains at zero for a while before the congestion window is increased again, and can fill up the pipe. During this time, we see a reduction in link utilization.

Demo 2 Buffer Sizing Experiments using the NetFPGA Router

Hardware Setup for Demo #2
… CPU x2 NIC PCI-e GE GE Net-FPGA GE Video Client PCI Internet Router Hardware Server delivers streaming HD video to adjacent client GE GE GE CPU x2 NIC Video Server PCI-e PCI-e GE GE

Topology eth1 connects your host to your NetFPGA Router
Demo 2 eth1 connects your host to your NetFPGA Router nf2c2 routes to nf2c1 (your adjacent server) eth2 serves web and video traffic to your neighbor nf2c0 & nf2c3 (the network ring) are unused .2.1 .5.1 .8.1 .11.1 .4.1 .7.1 .10.1 .13.1 .1.1 .1.2 .4.2 .7.2 .10.2 .13.2 .2.2 .5.2 .8.2 .11.2 .29.1 .14.2 .29.2 .14.1 .26.2 .23.2 .20.2 .17.2 .28.2 .25.2 .22.2 .19.2 .16.2 .16.1 .28.1 .25.1 .22.1 .19.1 .26.1 .23.1 .20.1 .17.1 This configuration allows you to modify and test your router without affecting others

Enhanced Router Objectives Execution Observe router with new modules
Demo 2 Objectives Observe router with new modules New modules: rate limiting, event capture Execution Run event capture router Look at routing tables Explore details pane Start tcp transfer, look at queue occupancy Change rate, look at queue occupancy

Step 1 - Run Pre-made Enhanced Router
Demo 2 Start terminal and cd to “netfpga/projects/ tutorial_router/sw/” Type “./tut_adv_router_gui.pl” A familiar GUI should start

Step 2 - Explore Enhanced Router
Demo 2 Click on the Details tab A similar pipeline to the one seen previously shown with some additions

Enhanced Router Pipeline
Demo 2 MAC RxQ CPU Input Arbiter Output Port Lookup TxQ Output Queues Rate Limiter Event Capture Two modules added Event Capture to capture output queue events (writes, reads, drops) Rate Limiter to create a bottleneck

Step 3 - Decrease the Link Rate
Demo 2 To create bottleneck and show the TCP “sawtooth,” link-rate is decreased. In the Details tab, click the “Rate Limit” module Check Enabled Set link rate to 1.953Mbps 53

Step 4 – Decrease Queue Size
Demo 2 Go back to the Details panel and click on “Output Queues” Select the “Output Queue 2” tab Change the output queue size in packets slider to 16

Step 5 - Start Event Capture
Demo 2 Click on the Event Capture module under the Details tab This should start the configuration page

Step 6 - Configure Event Capture
Demo 2 Check Send to local host to receive events on the local host Check Monitor Queue 2 to monitor output queue of MAC port1 Check Enable Capture to start event capture

Step 7 - Start TCP Transfer
Demo 2 We will use iperf to run a large TCP transfer and look at queue evolution Start a terminal and cd to “netfpga/projects/tutorial_router/sw” Type “./iperf.sh” 57

Step 8 - Look at Event Capture Results
Demo 2 Click on the Event Capture module under the Details tab. The sawtooth pattern should now be visible.

Queue Occupancy Charts
Demo 2 Observe the TCP/IP sawtooth Leave the control windows open

Section IV: How does the NetFPGA Work

Integrated Circuit Technology
Full-custom Design Complementary Metal Oxide Semiconductor (CMOS) Semi-custom ASIC Design Gate array Standard cell Programmable Logic Device Programmable Array Logic Field Programmable Gate Arrays Processors

Look-Up Tables Combinatorial logic is stored in Look-Up Tables (LUTs)
Also called Function Generators (FGs) Capacity is limited only by number of inputs, not complexity Delay through the LUT is constant A B C D Z 1 . Combinatorial Logic A B C D Z Diagram From: Xilinx, Inc

Xilinx CLB Structure Each slice has four outputs
Two registered outputs, two non-registered outputs Two BUFTs associated with each CLB, accessible by all 16 CLB outputs Carry logic run vertically Signals run upward Two independent carry chains per CLB Slice 0 LUT Carry D Q CE PRE CLR LUT Carry D Q CE PRE CLR The major parts of a slice include two look-up tables (LUTs), two sequential elements, and carry logic. The LUTs are known as the F LUT and the G LUT. The sequential elements can be programmed to be either registers or latches. The next several slides cover the LUT, carry logic, and flip-flops in detail. Diagram From: Xilinx, Inc.

Field Programmable Gate Arrays
CLB Primitive element of FPGA Routing Module Global routing Local interconnect Macro Blocks Block Memories Microprocessor I/O Block

NetFPGA Package Utilities Simulation Synthesis Registers
Verilog Libraries (shared modules) Projects (reference and contributed)

Simulation and Synthesis
Simulation (nf_run_test.pl) Allows simulation from command line or GUI Uses backend libraries (Perl and Python) to create packets for simulation Synthesis (make) In the projects synth directory Automatically includes Xilinx Coregen components from shared libraries Includes all Xilinx Coregen components form a projects synth directory (.xco)

Shared Verilog Libraries (modules)
Located at netfpga/lib/verilog Specify shared libraries in project.xml Any project can use any module Local modules in a project’s src dir over rides a shared library If arp_reply is found both in shared library and project’s src directory only the project’s src directory version is used

Register System Project XML (project.xml)
Found in project/include directory Specifies shared libraries and location of registers in pipeline Each module with registers has an XML file Specifies the register names and widths Register files automatically created using nf_register_gen.pl Perl header files C header files Verilog file defining registers

Reference Projects Easily extend and add modules Currently
Reference NIC Reference Router Reference Switch Router KIT Router Buffer Sizing

Full System Components
SCONE PW-OSPF Java GUI Software Driver nf2c0 nf2c1 nf2c2 nf2c3 ioctl PCI Bus DMA Registers CPU RxQ CPU TxQ CPU RxQ CPU TxQ CPU RxQ CPU TxQ nf2_reg_grp CPU RxQ CPU TxQ NetFPGA user data path MAC TxQ MAC RxQ MAC TxQ MAC RxQ MAC TxQ MAC RxQ MAC TxQ MAC RxQ Ethernet

Reference Router Pipeline
MAC RxQ CPU Input Arbiter Output Port Lookup TxQ Output Queues Five stages Input Input arbitration Routing decision and packet modification Output queuing Output Packet-based module interface Pluggable design

Section V: Life of a Packet Through Hardware

Life of a Packet through the Hardware
port0 port2 y x IP packet

Inter-Module Communication
Using “Module Headers”: Ctrl Word (8 bits) Data Word (64 bits) x Module Hdr Contain information such as packet length, input port, output port, … … … y Last Module Hdr Eth Hdr IP Hdr … 0x10 Last word of packet

Inter-Module Communication
Module i Module i+1 data ctrl wr rdy

Dst MAC = port 0, Ethertype = IP
MAC Rx Queue MAC Rx Queue IP Hdr: IP Dst: , TTL: 64, Csum:0x3ab4 Eth Hdr: Dst MAC = port 0, Ethertype = IP Data

Dst MAC = port 0, Ethertype = IP
Rx Queue Rx Queue IP Hdr: IP Dst: , TTL: 64, Csum:0x3ab4 Eth Hdr: Dst MAC = port 0, Ethertype = IP Data Pkt length, input port = 0 0xff

Input Arbiter Rx Q 7 Input Arbiter Pkt … Rx Q 1 Pkt Rx Q 0 Pkt

Output Port Lookup Output Port Lookup Data Pkt length, 0xff
IP Hdr: IP Dst: , TTL: 64, Csum:0x3ab4 EthHdr: Dst MAC = 0 Src MAC = x, Ethertype = IP Data Pkt length, input port = 0 0xff

Output Port Lookup Output Port Lookup 5- Add output port header
1- Check input port matches Dst MAC Output Port Lookup 0xff Pkt length, input port = 0 output port = 4 Pkt length, input port = 0 6- Modify MAC Dst and Src addresses 2- Check TTL, checksum EthHdr: Dst MAC = 0 Src MAC = x, Ethertype = IP EthHdr: Dst MAC = nextHop Src MAC = port 4, Ethertype = IP 3- Lookup next hop IP & output port (LPM) IP Hdr: IP Dst: , TTL: 64, Csum:0x3ab4 IP Hdr: IP Dst: , TTL: 63, Csum:0x3ac2 7-Decrement TTL and update checksum 4- Lookup next hop MAC address (ARP) Data

Output Queues Output Queues OQ0 OQ4 Pkt OQ7

EthHdr: Dst MAC = nextHop Src MAC = port 4,
MAC Tx Queue MAC Tx Queue IP Hdr: IP Dst: , TTL: 64, Csum:0x3ab4 IP Dst: , TTL: 63, Csum:0x3ac2 EthHdr: Dst MAC = nextHop Src MAC = port 4, Ethertype = IP Data Pkt length, input port = 0 output port = 4 0xff

EthHdr: Dst MAC = nextHop Src MAC = port 4,
MAC Tx Queue MAC Tx Queue 0xff Pkt length, input port = 0 output port = 4 EthHdr: Dst MAC = nextHop Src MAC = port 4, Ethertype = IP IP Hdr: IP Dst: , TTL: 64, Csum:0x3ab4 IP Hdr: IP Dst: , TTL: 63, Csum:0x3ac2 Data

Exception Packet Example: TTL = 0 or TTL = 1
Packet has to be sent to the CPU which will generate an ICMP packet as a response Difference starts at the Output Port lookup stage

Exception Packet Path Software PCI Bus SCONE Java GUI Driver DMA
NetFPGA SCONE Java GUI Driver CPU RxQ TxQ nf2_reg_grp user data path DMA Registers nf2c0 nf2c1 nf2c2 nf2c3 ioctl MAC PW-OSPF Ethernet

Output Port Lookup Output Port Lookup
1- Check input port matches Dst MAC Output Port Lookup 0xff Pkt length, input port = 0 Pkt length, input port = 0 output port = 1 2- Check TTL, checksum – EXCEPTION! EthHdr: Dst MAC = 0, Src MAC = x, Ethertype = IP IP Hdr: IP Dst: , TTL: 1, Csum:0x3ab4 3- Add output port module Data

Output Queues Output Queues OQ0 OQ1 OQ2 Pkt OQ7

CPU Tx Queue CPU Tx Queue Data IP Hdr:
IP Dst: , TTL: 64, Csum:0x3ab4 IP Dst: , TTL: 1, Csum:0x3ab4 EthHdr: Dst MAC = 0, Src MAC = x, Ethertype = IP Data Pkt length, input port = 0 output port = 1 0xff

CPU Tx Queue CPU Tx Queue Data 0xff Pkt length, input port = 0
output port = 1 EthHdr: Dst MAC = 0, Src MAC = x, Ethertype = IP IP Hdr: IP Dst: , TTL: 1, Csum:0x3ab4 Data

ICMP Packet For the ICMP packet, the packet arrives at the CPU Rx Queue from the PCI Bus It follows the same path as a packet from the MAC until it reaches the Output Port Lookup The OPL module sees the packet is from the CPU Rx Queue 1 and sets the output port directly to 0 The packet then continues on the same path as the non-exception packet to the Output Queues and then MAC Tx queue 0

ICMP Packet Path Software PCI Bus PW-OSPF Java GUI Driver DMA SCONE
NetFPGA PW-OSPF Java GUI Driver CPU RxQ TxQ nf2_reg_grp user data path DMA Registers nf2c0 nf2c1 nf2c2 nf2c3 ioctl MAC SCONE Ethernet

NetFPGA-Host Interaction
Linux driver interfaces with hardware Packet interface via standard Linux network stack Register reads/writes via ioctl system call with wrapper functions: readReg(nf2device *dev, int address, unsigned *rd_data); writeReg(nf2device *dev, int address, unsigned *wr_data); eg: readReg(&nf2, OQ_NUM_PKTS_STORED_0, &val);

NetFPGA to host packet transfer 1. Packet arrives – forwarding table sends to CPU queue 2. Interrupt notifies driver of packet arrival 3. Driver sets up and initiates DMA transfer PCI Bus

NetFPGA to host packet transfer (cont.) 4. NetFPGA transfers packet via DMA 5. Interrupt signals completion of DMA PCI Bus 6. Driver passes packet to network stack

Host to NetFPGA packet transfers 3. Interrupt signals completion of DMA 2. Driver sets up and initiates DMA transfer PCI Bus 1. Software sends packet via network sockets Packet delivered to driver

Register access 2. Driver performs PCI memory read/write PCI Bus 1. Software makes ioctl call on network socket ioctl passed to driver

Packet transfers shown using DMA interface Alternative: use programmed IO to transfer packets via register reads/writes slower but eliminates the need to deal with network sockets

Section VI: Exercise

Drop 1 in N Packets 99 Objectives Execution
Exercise 1 Objectives Add counter and FSM to the code Synthesize and test router Execution Open drop_nth_packet.v Insert counter code Synthesize After synthesis, test the new system. 99

New Reference Router Pipeline
Exercise 1 MAC RxQ CPU RxQ MAC RxQ CPU RxQ MAC RxQ CPU RxQ MAC RxQ CPU RxQ One module added Drop Nth Packet to drop every Nth packet from the reference router pipeline Input Arbiter Output Port Lookup Event Capture Drop Nth Packet Output Queues MAC TxQ CPU TxQ Rate Limiter CPU TxQ MAC TxQ CPU TxQ MAC TxQ CPU TxQ MAC TxQ 100

Step 1 - Open the Source 101 We will modify the Verilog
Exercise 1 We will modify the Verilog source code to add a counter to the drop_nth_packet module Open terminal Type “xemacs netfpga/projects/tutorial_router/src/drop_nth_packet.v 101

Step 2 - Add Counter to Module
Exercise 1 Add counter using the following signals: counter –16 bit output signal that you should increment on each packet pulse rst_counter – reset signal (a pulse input) inc_counter – increment (a pulse input) Search for insert counter (ctrl+s insert counter, Enter) Insert counter and save (ctrl+x+s) 102

Step 3 - Build the Hardware
Exercise 1 Start terminal, cd to “netfpga/projects/ tutorial_router/synth” Run “make clean” Start synthesis with “make” 103

Step 4 – Test your Router Exercise 1 You can watch the number of received and sent packets to watch the module drop every Nth packet. Ping a local machine (i.e ) and watch for missing pings To run your router: 1- Enter the directory by typing: cd netfpga/projects/tutorial_router/sw 2- Run the router by typing: ./tut_adv_router_gui.pl --use_bin ../../../bitfiles/tutorial_router.bit To set the value of N (which packet to drop) type regwrite 0x N – replace N with a number (such as 100) To enable packet dropping, type: To disable packet dropping, type: regwrite 0x x1 regwrite 0x x0

Step 5 – Measurements Exercise 1 Determine iperf TCP throughput to neighbor’s server for each of several values of N Similar to Demo 2, Step 8 cd netfpga/projects/tutorial_router/sw ./iperf.sh Ping x.2 (where x is your neighbor’s server) TCP throughput with: Drop circuit disabled TCP Throughput = ________ Mbps Drop one in N = 1,000 packets Drop one in N = 100 packets Drop one in N = 10 packets Explain why TCPs throughput is so low given that only a tiny fraction of packets are lost 105

Section VII: Concluding Remarks

NetFPGAs are used: To run laboratory courses on network routing
Professors teach courses (Stanford, Cambridge, Rice, ...) To teach students how to build real Internet routers Train students to build routers (Cisco, Juniper, Huawei, .. ) To research how new features in the network Build network services for data centers (Google, UCSD.. ) To prototype systems with live traffic That measure buffers (while maintaining throughput, ..) To help hardware vendors understand device requirements Use of hardware (Xilinx, Micron, Cypress, Broadcom, ..)

Running the Router Kit User-space development, 4x1GE line-rate forwarding
Usage #1 OSPF BGP My Protocol user kernel Routing Table CPU Memory PCI “Mirror” IPv4 Router 1GE Fwding Table Packet Buffer FPGA Memory 1GE

Enhancing Modular Reference Designs
Usage #2 NetFPGA Driver PW-OSPF Verilog EDA Tools (Xilinx, Mentor, etc.) Design Simulate Synthesize Download CPU Memory Java GUI Front Panel (Extensible) PCI In Q Mgmt IP Lookup L2 Parse L3 Out Q 1GE Verilog modules interconnected by FIFO interfaces 1GE FPGA 1GE 1GE My Block Memory 1GE

(1GE MAC is soft/replaceable)
Creating new systems Usage #3 NetFPGA Driver 1GE My Design (1GE MAC is soft/replaceable) Verilog EDA Tools (Xilinx, Mentor, etc.) Design Simulate Synthesize Download CPU Memory PCI 1GE FPGA 1GE 1GE Memory 1GE

NetFPGA Platform Major Components Interfaces Memories FPGA Resources
4 Gigabit Ethernet Ports PCI Host Interface Memories 36Mbits Static RAM 512Mbits DDR2 Dynamic RAM FPGA Resources Block RAMs Configurable Logic Block (CLBs) Memory Mapped Registers

NetFPGA Cube Systems PCs assembled from parts
Stanford University Cambridge University Pre-built systems available Accent Technology Inc. Details are in the Guide

Rackmount NetFPGA Servers
NetFPGA inserts in PCI or PCI-X slot 2U Server (Dell 2950) 1U Server (Accent Technology Inc.) Thanks: Brian Cashman for providing machine

Stanford NetFPGA Cluster
Statistics Rack of 40 1U PCs with NetFPGAs Manged Power Console LANs Provides 4*40=160 Gbps of full line-rate processing bandwidth

Acknowledgments NetFPGA Team at Stanford University (Past and Present): Nick McKeown, Glen Gibb, Jad Naous, David Erickson, G. Adam Covington, John W. Lockwood, Jianying Luo, Brandon Heller, Paul Hartke, Neda Beheshti, Sara Bolouki, James Zeng, Jonathan Ellithorpe, Sachidanandan Sambandan

Special thanks to our Partners:
Patrick Lysaght, Veena Kumar, Paul Hartke, Anna Acevedo Xilinx University Program (XUP) Past NetFPGA Tutorial Presented At: SIGMETRICS See:

Thanks to our Sponsors:
Support for the NetFPGA project has been provided by the following companies and institutions Disclaimer: Any opinions, findings, conclusions, or recommendations expressed in these materials do not necessarily reflect the views of the National Science Foundation or of any other sponsors supporting this project.

(Stanford University) (Cambridge University)

Similar presentations

Presentation on theme: "(Stanford University) (Cambridge University)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

(Stanford University) (Cambridge University)

Similar presentations

Presentation on theme: "(Stanford University) (Cambridge University)"— Presentation transcript:

Similar presentations

About project

Feedback