Presentation on theme: "Improving Internet Availability. Availability of Other Services Carrier Airlines (2002 FAA Fact Book) –41 accidents, 6.7M departures –99.9993% availability."— Presentation transcript:
Improving Internet Availability
Availability of Other Services Carrier Airlines (2002 FAA Fact Book) –41 accidents, 6.7M departures – % availability 911 Phone service (1993 NRIC report +) –29 minutes per year per line –99.994% availability Std. Phone service (various sources) –53+ minutes per line per year –99.99+% availability Credit: David Andersen job talk
Internet Availability Various studies (Paxson, Andersen, etc.) show the Internet is at about 2.5 nines More critical (or at least availability-centric) applications on the Internet At the same time, the Internet is getting more difficult to debug –Increasing scale, complexity, disconnection, etc. Is it possible to get to 5 nines of availability? If so, how? What role should the network play?
Inherent Availability vs. Reactive Diagnosis What happens when a failure occurs? (At least) three options –Nothing –Automatic masking/recovery –Diagnosis + Semi-manual intervention (Augustin, Renata) When is automatic recovery appropriate? What features for diagnosis should the network provide?
(How) should the network provide inherent availability? Idea: compute backup in advance –No dynamic routing, just dynamic forwarding –End systems (routers, hosts, proxies) detect failures and send hints to deflect packets –Kind of like fast reroute…but a bit more extreme Various proposals in this space –Multi-router configurations, e.g.
Path Splicing: Main Idea Step 1 (Perturbations): Run multiple instances of the routing protocol, each with slightly perturbed versions of the configuration Step 2 (Parallelization): Allow traffic to switch between instances at any node in the protocol t s Compute multiple forwarding trees per destination. Allow packets to switch slices midstream.
Availability: Paths vs. Content What definitions of availability are appropriate? –Downtime Fraction of time that path exists between endpoints Fraction of time that endpoints can communicate on any path –Transfer time How long must I wait to get content? (Perhaps this makes more sense in delay-tolerant networks, bittorrent-style protocols, etc.) Some applications depend more on availability of content, rather than uptime/availability of any particular Internet path or host
Diagnosis User or operator takes over when the network doesnt fix things automatically Diagnosis will never be fully automatic –Task: put functions in place to make network (mal)functions as intuitive as possible –Make the operators (or users) more efficient…
(How) should the network support diagnosis? More network support means potentially more information to users and operators –…potentially at the cost of performance –Forwarding performance, filters, or measurment/monitoring? What functions should the router (or other on- path elements) provide?
Data-Plane Accountability Problem: Network elements drop packets, fail, and otherwise give rise to poor performance One Solution: In-Band Path Diagnosis Routers keep track of number of packets seen per flow Each router stamps each packet with current flow counter value If current counter value does not equal routers expected packet count for that flow, router marks packet IP Header New Shim Header Transport header High-level Overview
Scalability vs. Reactivity Various ways to get more data –More frequent monitoring –More data types –More vantage points Advantages –More paths, links, services, etc. –Potentially faster reaction But…data reduction is key –Operators/users are not at a loss for data about the network. They need ways to process it. –More monitoring data means more overhead (storage, bandwidth, etc.)
Active vs. Passive Monitoring Active monitoring can provide more direct indicators of path quality, service availability, etc. –But…cant monitor all possible paths What combination of active and passive monitoring is appropriate?
What role should end systems/cooperation play? Various previous work in peer-to-peer troubleshooting –Tomography –NetProfiler / CoopNet (Padmanabhan) –Cooperative troubleshooting (Wang) –Sharing IDS logs In what contexts do these make sense? –Internet –Wireless settings
Problem: Insecurity Cant trust the control plane –BGP: Route hijacks (intentional and unintentional) –DNS: Insecure name resolution Cant trust the data plane –No guarantee for where packets will go No accountability or auditing capabilities No strong forms of identity
Security: To-Do Data plane security –No assurances about where traffic will actually go –Monitoring/stemming unwanted traffic is hard Control plane security –Defense against route hijacks, etc. Accountability (spoofing prevention, auditing, etc.) –For data-plane performance –For unwanted traffic
Problem: Manageability Too easy to misconfigure the network Correct operation depends on correct configuration –Can future networks be intrinsically robust?
Management: To-Do Automated provisioning Configuration, management, and maintenance at a higher layer of abstraction Fast, distributed fault detection Where possible, eliminate knobs without eliminating flexibility
Problem: Scale Increasing number of users, end hosts, etc. Network connectivity has become a commodity –At the same time, the network is becoming more difficult to manage –Network providers must keep adding customers –Cost of bandwidth, equipment is plummeting –Management costs begin to dominate
Scale: To-Do Scalable addressing that permits multihoming –Traffic engineering, fast updates, etc. –Related topic: mobility Scalable mechanisms for path diversity (path selection, etc.)
Designing for Selfishness: Goals Providers, producers and consumers must benefit from participating –Without eyeballs, content has no value –Without content, the eyeballs will bail out –Without a network, eyeballs cant meet content –Without content or eyeballs, no need for a network
Internet Wish-List Availability Accountability Mobility Manageability/Intrinsic Correctness Support for monitoring Assurances about traffic
What Has Worked? Packet switching Layering Congestion control
What Might We Revisit? Single-path routing Monitoring support –Better traffic sampling algorithms to cope with evolving requirements (its no longer just about billing) Naming –Poor support for mobility –Poor support for naming content Addressing –Very poor correspondence to identity Business models/selfishness
Possible Outcome: Many Internets Run many different networks simultaneously on the same infrastructure –No clear distinction between architecture and services –Develop specialized architectures for specialized applications Application or topology-specific routing protocols Virtualization of physical resources as a tool for building new networks –Virtual link establishment and virtual routers –Substrate for deploying overlays is new waist –This substrate is the new Internet