1 Miscellaneous Topics EE122 Fall 2012 Scott Shenker Materials with thanks to Jennifer Rexford, Ion Stoica, Vern.

Slides:



Advertisements
Similar presentations
CCNA3: Switching Basics and Intermediate Routing v3.0 CISCO NETWORKING ACADEMY PROGRAM Switching Concepts Introduction to Ethernet/802.3 LANs Introduction.
Advertisements

COS 461 Fall 1997 Routing COS 461 Fall 1997 Typical Structure.
Deconstructing Datacenter Packet Transport Mohammad Alizadeh, Shuang Yang, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker Stanford University.
PFabric: Minimal Near-Optimal Datacenter Transport Mohammad Alizadeh Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker.
1 Congestion Control EE122 Fall 2012 Scott Shenker Materials with thanks to Jennifer Rexford, Ion Stoica, Vern Paxson.
1 Networking Acronym Smorgasbord: , DVMRP, CBT, WFQ EE122 Fall 2011 Scott Shenker Materials with thanks to.
1 Internet Networking Spring 2004 Tutorial 7 Multicast Routing Protocols.
School of Information Technologies Internet Multicasting NETS3303/3603 Week 10.
588 Section 6 Neil Spring May 11, Schedule Notes – (1 slide) Multicast review –(3slides) RLM (the paper you didn’t read) –(3 slides) ALF & SRM –(8.
CMPE 150- Introduction to Computer Networks 1 CMPE 150 Fall 2005 Lecture 22 Introduction to Computer Networks.
CMPE 150- Introduction to Computer Networks 1 CMPE 150 Fall 2005 Lecture 23 Introduction to Computer Networks.
Chapter 4 IP Multicast Professor Rick Han University of Colorado at Boulder
Slide Set 15: IP Multicast. In this set What is multicasting ? Issues related to IP Multicast Section 4.4.
CS 268: IP Multicast Routing Kevin Lai April 22, 2001.
Internet Networking Spring 2002
1 EE 122: Multicast Ion Stoica TAs: Junda Liu, DK Moon, David Zats (Materials with thanks to Vern Paxson, Jennifer.
1 IP Multicasting. 2 IP Multicasting: Motivation Problem: Want to deliver a packet from a source to multiple receivers Applications: –Streaming of Continuous.
EE689 Lecture 12 Review of last lecture Multicast basics.
Katz, Stoica F04 EECS 122: Introduction to Computer Networks Multicast Computer Science Division Department of Electrical Engineering and Computer Sciences.
CS335 Networking & Network Administration Tuesday, April 20, 2010.
1 Reliability & Flow Control Some slides are from lectures by Nick Mckeown, Ion Stoica, Frans Kaashoek, Hari Balakrishnan, and Sam Madden Prof. Dina Katabi.
Spanning Tree and Wireless EE122 Discussion 10/28/2011.
Multicast EECS 122: Lecture 16 Department of Electrical Engineering and Computer Sciences University of California Berkeley.
Spanning Tree and Multicast. The Story So Far Switched ethernet is good – Besides switching needed to join even multiple classical ethernet networks Routing.
© J. Liebeherr, All rights reserved 1 IP Multicasting.
CSE679: Multicast and Multimedia r Basics r Addressing r Routing r Hierarchical multicast r QoS multicast.
Lecture 1, 1Spring 2003, COM1337/3501Computer Communication Networks Rajmohan Rajaraman COM1337/3501 Textbook: Computer Networks: A Systems Approach, L.
CIS 725 Wireless networks. Low bandwidth High error rates.
© Janice Regan, CMPT 128, CMPT 371 Data Communications and Networking Multicast routing.
Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.
Multicast Routing Protocols NETE0514 Presented by Dr.Apichan Kanjanavapastit.
Brierley 1 Module 4 Module 4 Introduction to LAN Switching.
CS 268: IP Multicast Routing Ion Stoica April 5, 2004.
CSC 600 Internetworking with TCP/IP Unit 8: IP Multicasting (Ch. 17) Dr. Cheer-Sun Yang Spring 2001.
Broadcast and Multicast. Overview Last time: routing protocols for the Internet  Hierarchical routing  RIP, OSPF, BGP This time: broadcast and multicast.
Multicast Routing Algorithms n Multicast routing n Flooding and Spanning Tree n Forward Shortest Path algorithm n Reversed Path Forwarding (RPF) algorithms.
1 Lecture 14 High-speed TCP connections Wraparound Keeping the pipeline full Estimating RTT Fairness of TCP congestion control Internet resource allocation.
Chapter 22 Network Layer: Delivery, Forwarding, and Routing Part 5 Multicasting protocol.
1 Advanced Topics in Congestion Control EE122 Fall 2012 Scott Shenker Materials with thanks to Jennifer Rexford,
Computer Science 6390 – Advanced Computer Networks Dr. Jorge A. Cobb Deering, Estrin, Farinacci, Jacobson, Liu, Wei SIGCOMM 94 An Architecture for Wide-Area.
Multicast Routing Protocols. The Need for Multicast Routing n Routing based on member information –Whenever a multicast router receives a multicast packet.
© J. Liebeherr, All rights reserved 1 Multicast Routing.
Cisco 3 - Switching Perrine. J Page 16/4/2016 Chapter 4 Switches The performance of shared-medium Ethernet is affected by several factors: data frame broadcast.
Björn Landfeldt School of Information Technologies NETS 3303 Networked Systems Multicast.
Networking Fundamentals. Basics Network – collection of nodes and links that cooperate for communication Nodes – computer systems –Internal (routers,
1 Spring Semester 2009, Dept. of Computer Science, Technion Internet Networking recitation #7 DVMRP.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.
T. S. Eugene Ngeugeneng at cs.rice.edu Rice University1 COMP/ELEC 429 Introduction to Computer Networks Lecture 21: Multicast Routing Slides used with.
Wireless. 2 A talks to B C senses the channel – C does not hear A’s transmission C talks to B Signals from A and B collide Carrier Sense will be ineffective.
1 IP Multicasting Relates to Lab 10. It covers IP multicasting, including multicast addressing, IGMP, and multicast routing.
Chapter 21 Multicast Routing
Multicast Communications
Spring 2006CS 3321 Multicast Outline Link-state Multicast Distance-vector Multicast Protocol Independent Multicast.
TCP continued. Discussion – TCP Throughput TCP will most likely generate the saw tooth type of traffic. – A rough estimate is that the congestion window.
EE122: Multicast Kevin Lai October 7, Internet Radio  (techno station)
1 Group Communications: Reverse Path Multicast Dr. Rocky K. C. Chang 19 March, 2002.
MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,
Instructor & Todd Lammle
6.888 Lecture 5: Flow Scheduling
COMP/ELEC 429 Introduction to Computer Networks
Multicast Outline Multicast Introduction and Motivation DVRMP.
(How the routers’ tables are filled in)
PRESENTATION COMPUTER NETWORKS
Distributed Systems CS
Centralized Arbitration for Data Centers
Lecture 17, Computer Networks (198:552)
EE 122: Lecture 13 (IP Multicast Routing)
Implementing Multicast
Optional Read Slides: Network Multicast
Distributed Systems CS
Presentation transcript:

1 Miscellaneous Topics EE122 Fall 2012 Scott Shenker Materials with thanks to Jennifer Rexford, Ion Stoica, Vern Paxson and other colleagues at Princeton and UC Berkeley

Q/A on Project 3 2

Today’s Lecture: Dim Sum of Design Quality-of-Service Multicast Announcements… Wireless addendum Advanced CC addendum 3

4 Extending the Internet Service Model

Internet Service Model Best-Effort: everyone gets the same service –No guarantees Unicast: each packet goes to single destination 5

Extending the service model Better than best-effort: Quality of Service (QoS) More than one receiver: Multicast 6

7 Quality of Service (QoS)

Summary of Current QoS Mechanisms Blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, priority scheduling, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah…. 8

9 Multicast

Motivating Example: Internet Radio Internet concert –More than 100,000 simultaneous online listeners –Could we do this with parallel unicast streams? Bandwidth usage –If each stream was 1Mbps, concert requires > 100Gbps Coordination –Hard to keep track of each listener as they come and go Multicast addresses both problems…. 10

11 Unicast approach does not scale… Backbone ISP Broadcast Center

12 Instead build data replication trees Backbone ISP Broadcast Center Copy data at routers At most one copy of a data packet per link LANs implement link layer multicast by broadcasting Routers keep track of groups in real-time Routers compute trees and forward packets along them

13 Multicast Service Model Receivers join multicast group with address G Sender(s) send data to address G Network routes data to each of the receivers Multicast both delivery and rendezvous mechanism –Senders don’t know list of receivers –The latter is often more important than the former R 0 joins G R 1 joins G R n joins G S R0R0 R1R [G, data] RnRn Net

14 Multicast and Layering Multicast can be implemented at different layers –Link layer  e.g. Ethernet multicast –Network layer  e.g. IP multicast –Application layer  e.g. End system multicast Each layer has advantages and disadvantages –Link: easy to implement, limited scope –IP: global scope, efficient, but hard to deploy –Application: less efficient, easier to deploy [not covered]

15 Multicast Implementation Issues How is join implemented? How is send implemented? How much state is kept and who keeps it?

Link Layer Multicast Join group at multicast address G –NIC normally only listens for packets sent to unicast address A and broadcast address B –After being instructed to join group G, NIC also listens for packets sent to multicast address G Send to group G –Packet is flooded on all LAN segments, like broadcast Scalability: –State: Only host NICs keep state about who has joined –Bandwidth: Requires broadcast on all LAN segments Limitation: just over single LAN 16

17 Network Layer (IP) Multicast Performs inter-network multicast routing –Relies on link layer multicast for intra-network routing Portion of IP address space reserved for multicast –2 28 addresses for entire Internet Open group membership –Anyone can join (sends IGMP message)  Internet Group Management Protocol –Privacy preserved at application layer (encryption) Anyone can send to group –Even nonmembers

How Would YOU Design this? 5 Minutes…. 18

19 IP Multicast Routing Intra-domain (know the basics here) –Source Specific Tree: Distance Vector Multicast Routing Protocol (DVRMP) –Shared Tree: Core Based Tree (CBT) Inter-domain [not covered]

20 Distance Vector Multicast Routing Protocol Elegant extension to DV routing –Using reverse paths! Use shortest path DV routes to determine if link is on the source-rooted spanning tree Three steps in developing DVRMP –Reverse Path Flooding –Reverse Path Broadcasting –Truncated Reverse Path Broadcasting (pruning)

21 Reverse Path Flooding (RPF) If incoming link is shortest path to source Send on all links except incoming Otherwise, drop Issues: (fixed with RPB) Some links (LANs) may receive multiple copies Every link receives each multicast packet s:2 s s s:1 s:3 s:2 s:3 r r

22 Other Problems Flooding can cause a given packet to be sent multiple times over the same link Solution: Reverse Path Broadcasting x x y y z z S S a b duplicate packet

23 Reverse Path Broadcasting (RPB) x x y y z z S S a b 5 6 child link of x for S forward only to child link Parent of z on reverse path Choose single parent for each link along reverse shortest path to source Only parent forwards to child link Identifying parent links –Distance –Lower address as tie- breaker

Even after fixing this, not done This is still a broadcast algorithm – the traffic goes everywhere Need to “Prune” the tree when there are subtrees with no group members Networks know they have members based on IGMP messages Add the notion of “leaf” nodes in tree –They start the pruning process 24

25 Pruning Details Prune (Source,Group) at leaf if no members –Send Non-Membership Report (NMR) up tree If all children of router R send NMR, prune (S,G) –Propagate prune for (S,G) to parent R On timeout: –Prune dropped –Flow is reinstated –Down stream routers re-prune Note: a soft-state approach

26 Distance Vector Multicast Scaling State requirements: –O(Sources  Groups) active state How to get better scaling? –Hierarchical Multicast –Core-based Trees

Core-Based Trees (CBT) Pick “rendevouz point” for the group (called core) Build tree from all members to that core –Shared tree More scalable: –Reduces routing table state from O(S x G) to O(G) 27

28 Use Shared Tree for Delivery Group members: M1, M2, M3 M1 sends data root M1 M2 M3 control (join) messages data

Core-Based Tree Approach Build tree from all members to core or root –Spanning tree of members Packets are broadcast on tree –We know how to broadcast on trees Requires knowing root per group 29

30 Barriers to Multicast Hard to change IP –Multicast means changes to IP –Details of multicast were very hard to get right Not always consistent with ISP economic model –Charging done at edge, but single packet from edge can explode into millions of packets within network

Multicast vs Caching If delivery need not be simultaneous, caching (as in CDNs) works well, and needs no change to IP 31

32 Announcements

Clarification on Homework 3 All links in the homework are full duplex. –Can send full line rate in both directions simultaneously 33

Announcements HW4 will just be a nongraded worksheet Review sessions (i.e., extended office hours) will be scheduled during class time of reading week –Will ask you to send in questions beforehand…. In response to several queries, sometime after the SDN lecture I will give a short (and very informal) talk on my experience with Nicira. 34

35 Wireless Review

History MACA proposal: basis for RTS/CTS in lecture –MACA proposed as replacement for carrier sense –Contention is at receiver, but CS detects sender! MACAW paper: extended and altered MACA –Implications of data ACKing –Introducing DS in exchange: RTS-CTS-DS-Data-ACK  Shut up when hear DS or CTS (DS sort of like carrier sense) –Other clever but unused extensions for fairness, etc : uses carrier sense and RTS/CTS –RTS/CTS often turned off, just use carrier sense –When RTS/CTS turned on, shut up when hear either –RTS/CTS augments carrier sense 36

Why Isn’t MACA Enough? That’s what we now discuss, by repeating a few slides from previous lecture In what follows, we assume that basic reception is symmetric (if no interference): –If A is in range of B, then B is in range of A Any asymmetries in reception reflect interference from other senders 37

38 A, C can both send to B but can’t hear each other A is a hidden terminal for C and vice versa Carrier Sense by itself will be ineffective If A is already sending to B, C won’t know to defer… Hidden Terminals: Why MACA is Needed ABC transmit range

39 Exposed Terminals: Alleged Flaw in CS Exposed node: using carrier sense B sends a packet to A C hears this and decides not to send a packet to D Despite the fact that this will not cause interference! Carrier sense prevents successful transmission! ABC D

Key Points in MACA Rationale No concept of a global collision –Different receivers hear different signals –Different senders reach different receivers Collisions are at receiver, not sender –Only care if receiver can hear the sender clearly –It does not matter if sender can hear someone else –As long as that signal does not interfere with receiver Goal of protocol: –Detect if receiver can hear sender –Tell senders who might interfere with receiver to shut up 40

What Does This Analysis Ignore? Data should be ACKed! –In wireless settings, data can be easily lost –Need to give sender signal that data was received How does that change story? Connection is now two-way! –Congestion is at both sender and receiver –Carrier-sense is good for detecting the former –MACA is good for detecting the latter 41

42 Exposed Terminals Revisited Exposed node: with only MACA, both B and C send But when A or D send ACKs, they won’t be heard! Carrier-sense prevents B, C sending at same time ABC D

Overview (oversimplified) Uses carrier sense and RTS/CTS –RTS/CTS often turned off, just use carrier sense If RTS/CTS turned on, shut up when hear either –CTS –Or current transmission (carrier sense) What if hear only RTS, no CTS or transmission? –You can send (after waiting small period)  CTS probably wasn’t sent 43

What Will Be on the Final? General awareness of wireless (lecture, section) Reasoning about a given protocol –If we used the following algorithm, what would happen? You are not expected to know which algorithm to use; we will tell you explicitly: –RTS/CTS –Carrier Sense –Both 44

45 Back to Congestion Control

Quick Review of Advanced CC Full queues: RED Non-congestion losses: ECN High-speeds:Alter constants: HSTCP Fairness:Need isolation Min. flow completion time:Need better adjustment Router-Assisted CC:Isolation: WFQ, AFD Adjustment: RCP 46

47 Why is Scott a Moron? Or why does Bob Briscoe think so?

Giving equal shares to “flows” is silly What if you have 8 flows (to different destinations), and I have 4… –Why should you get twice the bandwidth? What if your flow goes over 4 congested hops, and mine only goes over 1? –Why not penalize you for using more scarce bandwidth? And what is a flow anyway? –TCP connection –Source-Destination pair? –Source? 48

flow rate fairness dismantling a religion draft-briscoe-tsvarea-fair-01.pdf Bob Briscoe Chief Researcher, BT Group IETF-68 tsvwg Mar 2007 status: individual draft final intent:informational intent next:tsvwg WG item after (or at) next draft

Charge people for congestion! Use ECN as congestion markers Whenever I get ECN bit set, I have to pay $$$ No debate over what a flow is, or what fair is… –Just send your packets and pay your bill Can avoid charges by sending when no congestion Idea started by Frank Kelly, backed by much math –Great idea: simple, elegant, effective 50

Why isn’t this the obvious thing to do? Do you really want to sell links to highest bidder? –Will you ever bandwidth when Bill Gates is sending? –He just sends as fast as he can, and pays the bill Can we embed economics at such a low level? –Charging mechanisms usually at higher levels Is this the main problem with congestion control? –Is this a problem worth solving? Bandwidth caps work… This approach is fundamentally right… –…but perhaps not practically relevant 51

52 Datacenter Networks

What makes them special? Huge scale: –100,000s of servers in one location Limited geographic scope: –High bandwidth (10Gbps) –Very low RTT Extreme latency requirements (especially the tail) –With real money on the line Single administrative domain –No need to follow standards, or play nice with others Often “green field” deployment –So can “start from scratch”… 53

This means…. Can design “from scratch” Can assume very low latencies Need to ensure very low queuing delays –Both the average, and the tail…. –TCP terrible at this, even with RED Can assume plentiful bandwidth internally As usual, flows can be elephants or mice –Most flows small, most bytes in large flows… 54

Deconstructing Datacenter Packet Transport Mohammad Alizadeh, Shuang Yang, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker Stanford University U.C. Berkeley/ICSI HotNets

Transport in Datacenters Latency is King – Web app response time depends on completion of 100s of small RPCs – Tail of distribution is particularly important But, traffic also diverse – Mice AND Elephants – Often, elephants are the root cause of latency Large-scale Web Application App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic Alice Who does she know? What has she done? MinnieEric Pics Videos Apps HotNets

Transport in Datacenters Two fundamental requirements – High fabric utilization Good for all traffic, esp. the large flows – Low fabric latency (propagation + switching) Critical for latency-sensitive traffic Active area of research – DCTCP [SIGCOMM’10], D3 [SIGCOMM’11] HULL [NSDI’11], D 2 TCP [SIGCOMM’12] PDQ [SIGCOMM’12], DeTail [SIGCOMM’12] vastly improve performance, but very complex HotNets

Goal of pFabric Subtract complexity, not add to it Start from scratch and ask: – What components do we need for DC CC? – How little mechanism can we get away with? None of what we say here applies to general CC, it is limited to DC setting HotNets

pFabric in 1 Slide HotNets 2012 Packets carry a single priority # e.g., prio = remaining flow size pFabric Switches Very small buffers (e.g., 10-20KB) Send highest priority / drop lowest priority pkts – Give priority to packets with least remaining in flow pFabric Hosts Send/retransmit aggressively Minimal rate control: just prevent congestion collapse 59

DC Fabric: Just a Giant Switch! HotNets 2012 H1 H2 H3 H4 H5 H6 H7 H8 H9 60

HotNets 2012 H1 H2 H3 H4 H5 H6 H7 H8 H9 DC Fabric: Just a Giant Switch! 61

H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9 HotNets 2012 H1 H2 H3 H4 H5 H6 H7 H8 H9 TXRX DC Fabric: Just a Giant Switch! 62

HotNets 2012 DC Fabric: Just a Giant Switch! H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9 TXRX 63

H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9 HotNets 2012 Objective?  Minimize avg FCT DC transport = Flow scheduling on giant switch ingress & egress capacity constraints TXRX 64

Min. FCT with Simple Queue Just use shortest-job first….. But here we have many inputs and many outputs with capacity constraints at both HotNets

“Ideal” Flow Scheduling Problem is NP-hard  [Bar-Noy et al.] – Simple greedy algorithm: 2-approximation HotNets

HotNets 2012 pFabric Design 67

pFabric Switch HotNets 2012 Switch Port  Priority Scheduling send higher priority packets first  Priority Dropping drop low priority packets first small “bag” of packets per-port 68 prio = remaining flow size

Near-Zero Buffers Easiest way to keep delays small? – Have small buffers! Buffers are very small (~1 BDP) – e.g., C=10Gbps, RTT=15µs → BDP = 18.75KB – Today’s switch buffers are 10-30x larger HotNets

pFabric Rate Control Priority scheduling & dropping in fabric also simplifies rate control – Queue backlog doesn’t matter – Priority packets get through HotNets 2012 H1 H2 H3 H4 H5 H6 H7 H8 H9 50% Loss One task: Prevent congestion collapse when elephants collide 70

pFabric Rate Control Minimal version of TCP 1.Start at line-rate Initial window larger than BDP 2.No retransmission timeout estimation Fix RTO near round-trip time 3.No fast retransmission on 3-dupacks Allow packet reordering HotNets

Why does this work? Key observation: Need the highest priority packet destined for a port available at the port at any given time. Priority scheduling  High priority packets traverse fabric as quickly as possible What about dropped packets?  Lowest priority → not needed till all other packets depart  Buffer larger than BDP → more than RTT to retransmit HotNets

Evaluation HotNets % of flows 3% of bytes 5% of flows 35% of bytes 54 port fat-tree: 10Gbps links, RTT = ~12µs Realistic traffic workloads – Web search, Data mining * From Alizadeh et al. [SIGCOMM 2010] <100KB >10MB 73

Evaluation: Mice FCT (<100KB) HotNets 2012 Average99 th Percentile Near-ideal: almost no jitter 74

Evaluation: Elephant FCT (>10MB) HotNets 2012 Congestion collapse at high load w/o rate control 75

Summary pFabric’s entire design: Near-ideal flow scheduling across DC fabric Switches – Locally schedule & drop based on priority Hosts – Aggressively send & retransmit – Minimal rate control to avoid congestion collapse HotNets