Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures.

Slides:



Advertisements
Similar presentations
Nick McKeown Spring 2012 Lecture 4 Parallelizing an OQ Switch EE384x Packet Switch Architectures.
Advertisements

Fast Buffer Memory with Deterministic Packet Departures Mayank Kabra, Siddhartha Saha, Bill Lin University of California, San Diego.
Optimal-Complexity Optical Router Hadas Kogan, Isaac Keslassy Technion (Israel)
1 Statistical Analysis of Packet Buffer Architectures Gireesh Shrimali, Isaac Keslassy, Nick McKeown
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 High Speed Router Design Shivkumar Kalyanaraman Rensselaer Polytechnic Institute
Router Architecture : Building high-performance routers Ian Pratt
Nick McKeown CS244 Lecture 6 Packet Switches. What you said The very premise of the paper was a bit of an eye- opener for me, for previously I had never.
Routers with a Single Stage of Buffering Sundar Iyer, Rui Zhang, Nick McKeown High Performance Networking Group, Stanford University,
A Scalable Switch for Service Guarantees Bill Lin (University of California, San Diego) Isaac Keslassy (Technion, Israel)
Algorithm Orals Algorithm Qualifying Examination Orals Achieving 100% Throughput in IQ/CIOQ Switches using Maximum Size and Maximal Matching Algorithms.
Making Parallel Packet Switches Practical Sundar Iyer, Nick McKeown Departments of Electrical Engineering & Computer Science,
Analysis of a Statistics Counter Architecture Devavrat Shah, Sundar Iyer, Balaji Prabhakar & Nick McKeown (devavrat, sundaes, balaji,
The Concurrent Matching Switch Architecture Bill Lin (University of California, San Diego) Isaac Keslassy (Technion, Israel)
Analyzing Single Buffered Routers Sundar Iyer, Rui Zhang, Nick McKeown (sundaes, rzhang, High Performance Networking Group Departments.
Analysis of a Packet Switch with Memories Running Slower than the Line Rate Sundar Iyer, Amr Awadallah, Nick McKeown Departments.
1 Architectural Results in the Optical Router Project Da Chuang, Isaac Keslassy, Nick McKeown High Performance Networking Group
1 OR Project Group II: Packet Buffer Proposal Da Chuang, Isaac Keslassy, Sundar Iyer, Greg Watson, Nick McKeown, Mark Horowitz
Using Load-Balancing To Build High-Performance Routers Isaac Keslassy, Shang-Tse (Da) Chuang, Nick McKeown Stanford University.
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion MSM.
CSIT560 by M. Hamdi 1 Course Exam: Review April 18/19 (in-Class)
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion The.
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Scaling.
1 Internet Routers Stochastics Network Seminar February 22 nd 2002 Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University.
Nick McKeown 1 Memory for High Performance Internet Routers Micron February 12 th 2003 Nick McKeown Professor of Electrical Engineering and Computer Science,
1 EE384Y: Packet Switch Architectures Part II Load-balanced Switches Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University.
COMP680E by M. Hamdi 1 Course Exam: Review April 17 (in-Class)
Ph. D Oral Examination Load-Balancing and Parallelism for the Internet Stanford University Ph.D. Oral Examination Tuesday, Feb 18 th 2003 Sundar Iyer
1 Achieving 100% throughput Where we are in the course… 1. Switch model 2. Uniform traffic  Technique: Uniform schedule (easy) 3. Non-uniform traffic,
1 Netcomm 2005 Communication Networks Recitation 5.
Analysis of a Memory Architecture for Fast Packet Buffers Sundar Iyer, Ramana Rao Kompella & Nick McKeown (sundaes,ramana, Departments.
Surprise Quiz EE384Z: McKeown, Prabhakar ”Your Worst Nightmares in Packet Switching Architectures”, 3 units [Total time = 15 mins, Marks: 15, Credit is.
7/15/2015HY220: Ιάκωβος Μαυροειδής1 HY220 Schedulers.
1 Growth in Router Capacity IPAM, Lake Arrowhead October 2003 Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University.
Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.
Localized Asynchronous Packet Scheduling for Buffered Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York Stony Brook.
1 IP routers with memory that runs slower than the line rate Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford.
Load Balanced Birkhoff-von Neumann Switches
Designing Packet Buffers for Internet Routers Friday, October 23, 2015 Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford.
EE384y EE384Y: Packet Switch Architectures Part II Scaling Crossbar Switches Nick McKeown Professor of Electrical Engineering and Computer Science,
Designing Packet Buffers for Router Linecards Sundar Iyer, Ramana Kompella, Nick McKeown Reviewed by: Sarang Dharmapurikar.
Winter 2006EE384x1 EE384x: Packet Switch Architectures I Parallel Packet Buffers Nick McKeown Professor of Electrical Engineering and Computer Science,
Routers. These high-end, carrier-grade 7600 models process up to 30 million packets per second (pps).
Packet Forwarding. A router has several input/output lines. From an input line, it receives a packet. It will check the header of the packet to determine.
Nick McKeown1 Building Fast Packet Buffers From Slow Memory CIS Roundtable May 2002 Nick McKeown Professor of Electrical Engineering and Computer Science,
Stress Resistant Scheduling Algorithms for CIOQ Switches Prashanth Pappu Applied Research Laboratory Washington University in St Louis “Stress Resistant.
CS640: Introduction to Computer Networks Aditya Akella Lecture 20 - Queuing and Basics of QoS.
Nick McKeown Spring 2012 Lecture 2,3 Output Queueing EE384x Packet Switch Architectures.
Winter 2006EE384x1 EE384x: Packet Switch Architectures I a) Delay Guarantees with Parallel Shared Memory b) Summary of Deterministic Analysis Nick McKeown.
T. S. Eugene Ngeugeneng at cs.rice.edu Rice University1 COMP/ELEC 429 Introduction to Computer Networks Lecture 18: Quality of Service Slides used with.
Techniques for Fast Packet Buffers Sundar Iyer, Ramana Rao, Nick McKeown (sundaes,ramana, Departments of Electrical Engineering & Computer.
048866: Packet Switch Architectures
Buffered Crossbars With Performance Guarantees Shang-Tse (Da) Chuang Cisco Systems EE384Y Thursday, April 27, 2006.
1 A quick tutorial on IP Router design Optics and Routing Seminar October 10 th, 2000 Nick McKeown
Techniques for Fast Packet Buffers Sundar Iyer, Ramana Rao, Nick McKeown (sundaes,ramana, Departments of Electrical Engineering & Computer.
Block-Based Packet Buffer with Deterministic Packet Departures Hao Wang and Bill Lin University of California, San Diego HSPR 2010, Dallas.
The Fork-Join Router Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford University
Input buffered switches (1)
Techniques for Fast Packet Buffers Sundar Iyer, Nick McKeown Departments of Electrical Engineering & Computer Science, Stanford.
1 Building big router from lots of little routers Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford University.
scheduling for local-area networks”
Instructor Materials Chapter 6: Quality of Service
EE384Y: Packet Switch Architectures Scaling Crossbar Switches
Buffer Management and Arbiter in a Switch
Packet Forwarding.
Addressing: Router Design
Parallelism in Network Systems Joint work with Sundar Iyer
© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Chapter 6: Quality of Service Connecting Networks.
Write about the funding Sundar Iyer, Amr Awadallah, Nick McKeown
Techniques and problems for
Techniques for Fast Packet Buffers
Presentation transcript:

Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures

Reducing the size of the SRAM Intuition If we use a lookahead buffer to peek at the requests “in advance”, we can replenish the SRAM cache only when needed. This increases the latency from when a request is made until the byte is available. But because it is a pipeline, the issue rate is the same.

Algorithm 2.Compute: Determine which queue will run into “trouble” soonest. green! 1.Lookahead: Next Q(b – 1) + 1 arbiter requests are known. Q(b-1) + 1 Requests in Lookahead Buffer b - 1 Q Queues 3.Replenish: Fetch b bytes for the “troubled” queue. Q b - 1 Queues

Winter 2008 EE384x 4 Example Q=4, b=4 t = 0 ; Green Critical Requests in lookahead buffer Queues t = 1 Queues Requests in lookahead buffer t = 2 Queues Requests in lookahead buffer t = 3 Requests in lookahead buffer t = 4 ; Blue Critical Requests in lookahead buffer t = 5 Requests in lookahead buffer t = 6 Requests in lookahead buffer t = 7 Requests in lookahead buffer t = 8 ; Red Critical Requests in lookahead buffer

Claim Patient Arbiter: With an SRAM of size Q(b – 1), a requested byte is available no later than Q(b – 1) + 1 after it is requested. Example: 160Gb/s linecard, b =2560, Q =512: SRAM = 1.3MBytes, delay bound is 65  s (equivalent to 13 miles of fiber).

Controlling the latency What if application can only tolerate a latency l < Q(b – 1) + 1 timeslots? Algorithm: Serve a queue once every b timeslots in the following order: 1. If there is an earliest critical queue, replenish it. 2. If not, then replenish the queue that will have the largest deficit l timeslots in the future.

Pipeline Latency, x SRAM Size Queue Length for Zero Latency Queue Length for Maximum Latency Queue length vs. Pipeline depth Q=1000, b = 10

Some Real Implementations Cache size can be further optimized based on  Using embedded DRAM  Dividing queues across ingress, egress  Grouping queues into aggregates (10 X 1G is a 10 G stream)  Knowledge of the scheduler pattern  Granularity of the read scheduler Most Implementations – Q < 1024 (original buffer cache), Q < 16K (modified schemes) – 15+ unique designs – 40G to 480G streams – Cache Sizes < 32 Mb (~ 8mm 2 in 28nm implementation)

Sundar Iyer Winter 2012 Lecture 8b Statistics Counters EE384 Packet Switch Architectures

What is a Statistics Counter? When a packet arrives, it is first classified to determine the type of the packet. Depending on the type(s), one or more counters corresponding to the criteria are updated/incremented

Examples of Statistics Counters Packet switches maintain counters for – Route prefixes, ACLs, TCP connections – SNMP MIBs Counters are required for – Traffic Engineering Policing & Shaping – Intrusion detection – Performance Monitoring (RMON) – Network Tracing

Stanford University Determine and analyze techniques for building very high speed (>100Gb/s) statistics counters and support an extremely large number of counters. Motivation: Build high speed counters OC192c = 10Gb/s; Counters/pkt (C) = 10; R eff = 100Gb/s; Total Counters (N) = 1 million; Counter Size (M) = 64 bits; Counter Memory = N*M = 64Mb; MOPS =150M reads, 150M writes

Question Can we use the same packet-buffer hierarchy to build statistics counters? Ans: Yes – Modify MDQF Algorithm – Numbers of Counters = N – Consider counters in base 1 Mimics cells of size 1 bit per second – Create Longest Counter First Algorithm – Maximum size of counter (base 1): ~ b(2 + ln N) – Maximum size of counter (base 2): ~ log 2 b (2 + ln N)

Stanford University 14 Optimality of LCF-CMA Can we do better? Theorem: LCF-CMA, is optimal in the sense that it minimizes the size of the counter maintained in SRAM

Stanford University 15 Minimum Size of the SRAM Counter Theorem: (speed-down factor b>1) LCF-CMA, minimizes the size of counter (in bits) in the SRAM to: f(b, N) = log 2 ln [b/(b-1)] ln bN ________ ( )  (log log N)=

Implementation Numbers Example: OC192 Line Card:R = 10Gb/sec. No. of counters per packet:C = 10 R eff = R*C:100Gb/s Cell Size :64bytes T eff = 5.12/2 :2.56ns DRAM Random Access Time :51.2ns Speed-down Factor, b >= T/T eff :20 Total Number of Counters:1000, 1 million Size of Counters:64 bits, 8 bits

Implementation Examples 10 Gb/s Line Card Valuesf(b,N)Brute ForceLCF-CMA N=1000 M=8 8SRAM=8Kb DRAM=0 SRAM=8Kb DRAM=8Kb N=1000 M=64 8SRAM=64Kb DRAM=0 SRAM=8Kb DRAM=64Kb N= M=64 9SRAM=64Mb DRAM=0 SRAM=9Mb DRAM=64Mb

Conclusion The SRAM size required by LCF-CMA is a very slow growing function of N There is a minimal tradeoff in SRAM memory size The LCF-CMA technique allows a designer to arbitrarily scale the speed of a statistics counter architecture using existing memory technology

Sundar Iyer Winter 2012 Lecture 8c QoS and PIFO EE384 Packet Switch Architectures

20 Contents 1.Parallel routers: work-conserving (Recap) 2.Parallel routers: delay guarantees

21 Memory The Parallel Shared Memory Router C A Departing Packets R R Arriving Packets A5 A4 B1 C1 A1 C3 A5 A4 Memory 1 K=8 C3 At most one operation – a write or a read per time slot B B3 C1 A1 A3 B1

22 Problem : – How can we design a parallel output-queued work- conserving router from slower parallel memories? Problem  Theorem 1: (sufficiency)  A parallel output-queued router is work-conserving with 3N –1 memories that can perform at most one memory operation per time slot

23 Re-stating the Problem There are K cages which can contain an infinite number of pigeons. Assume that time is slotted, and in any one time slot – at most N pigeons can arrive and at most N can depart. – at most 1 pigeon can enter or leave a cage via a pigeon hole. – the time slot at which arriving pigeons will depart is known For any router: – What is the minimum K, such that all N pigeons can be immediately placed in a cage when they arrive, and can depart at the right time?

24  Only one packet can enter or leave a memory at time t Intuition for Theorem 1 Only one packet can enter a memory at time t Time = t DT=t+X DT=t  Only one packet can enter or leave a memory at any time Memory

25 Proof of Theorem 1 When a packet arrives in a time slot it must choose a memory not chosen by 1.The N – 1 other packets that arrive at that timeslot. 2.The N other packets that depart at that timeslot. 3.The N - 1 other packets that can depart at the same time as this packet departs (in future). Proof: – By the pigeon-hole principle, 3N –1 memories that can perform at most one memory operation per time slot are sufficient for the router to be work-conserving

26 Memory The Parallel Shared Memory Router C A Departing Packets R R Arriving Packets A5 A4 B1 C1 A1 C3 A5 A4 From theorem 1, k = 7 memories don’t suffice.. but 8 memories do Memory 1 K=8 C3 At most one operation – a write or a read per time slot B B3 C1 A1 A3 B1

27 Summary - Routers which give 100% throughput 2NR 2NR/kNk None Maximal2NR6NR3R2N MWM NR2NR2RNCrossbarInput Queued None2NR 1BusC. Shared Mem. Switch Algorithm Switch BW Total Mem. BW Mem. BW per Mem 1 # Mem.Fabric NoneNRN(N+1)R(N+1)RNBusOutput-Queued P Shared Mem. C. Sets4NR2N(N+1)R2R(N+1)/kNkClosPPS - OQ C. Sets4NR 4RN C. Sets6NR3NR3RN Edge Color4NR3NR3RN Xbar C. Sets2NR3NR3NR/kkBus C. Sets4NR 4NR/kNk Clos Time Reserve * 3NR6NR3R2N Crossbar PPS – Shared Memory DSM (Juniper) CIOQ (Cisco) 1 Note that lower mem. bandwidth per memory implies higher random access time, which is better

28 Contents 1.Parallel routers: work-conserving 2.Parallel routers: delay guarantees

29 Delay Guarantees one output, many logical FIFO queues 1 m Weighted fair queueing sorts packets constrained traffic PIFO models  Weighted Fair Queueing  Weighted Round Robin  Strict priority etc. one output, single PIFO queue Push In First Out (PIFO) constrained traffic push

30 Delay Guarantees Problem : – How can we design a parallel output-queued router from slower parallel memories and give delay guarantees?  This is difficult because  The counting technique depends on being able to predict the departure time and schedule it.  The departure time of a cell is not fixed in policies such as strict priority, weighted fair queueing etc.

31 Theorem 2  Theorem 2: (sufficiency)  A parallel output-queued router can give delay guarantees with 4N –2 memories that can perform at most one memory operation per time slot.

32 Intuition for Theorem 2 N=3 port router Departure Order … … Departure Order N-1 packets before cell at time of insertion DT = 3 DT= 2DT= 1 … N-1 packets after cell at time of insertion PIFO: 2 windows of memories of size N-1 that can’t be used FIFO: Window of memories of size N-1 that can’t be used Departure Order

33 Proof of Theorem 2 A packet cannot use the memories: 1.Used to write the N-1 arriving cells at t. 2.Used to read the N departing cells at t. Time = t DT=t DT=t+T Cell C Before C After C 3.Will be used to read the N-1 cells that depart before it. 4.Will be used to read the N-1 cells that depart after it.

34 Summary- Routers which give delay guarantees Marriage2NR6NR3R2N -2NR2NR2NR/kNk - NR2NR2RNCrossbar Input Queued None2NR 1BusC. Shared Mem. Switch Algorithm Switch BW Total Memory BW Mem. BW per Mem. # Mem.Fabric NoneNRN(N+1)R(N+1)RNBusOutput-Queued P. Shared M C. Sets6NR3N(N+1)R3R(N+1)/kNkClosPPS - OQ C. Sets6NR 6RN C. Sets8NR4NR4RN Edge Color5NR4NR4RN Xbar C. Sets2NR4NR4NR/kkBus C. Sets6NR 6NR/kNk Clos Time Reserve 3NR6NR3R2N Crossbar PPS – Shared Memory DSM (Juniper) CIOQ (Cisco)

Backups 35

36 Distributed Shared Memory Router The central memories are moved to distributed line cards and shared Memory and line cards can be added incrementally From theorem 1, 3N –1 memories which can perform one operation per time slot i.e. a total memory bandwidth of  3NR suffice for the router to be work-conserving Switch Fabric Line Card 1Line Card 2Line Card N R RR Memories

37 Corollary 1 Problem: – What is the switch bandwidth for a work- conserving DSM router? Corollary 1: (sufficiency) – A switch bandwidth of 4NR is sufficient for a distributed shared memory router to be work- conserving Proof 1: – There are a maximum of 3 memory accesses and 1 port access

38 Corollary 2 Problem: – What is the switching algorithm for a work-conserving DSM router? Bus: No algorithm needed, but impractical Crossbar: Algorithm needed because only permutations are allowed Corollary 2: (existence) – An edge coloring algorithm can switch packets for a work-conserving distributed shared memory router Proof : – Follows from König’s theorem - Any bipartite graph with maximum degree  has an edge coloring with  colors