Managing Services and Networks Using a Peer-to-peer Approach Henning Schulzrinne (with Vishal Singh and other IRT members) Dept. of Computer Science Columbia.

Slides:



Advertisements
Similar presentations
Cs/ee 143 Communication Networks Chapter 6 Internetworking Text: Walrand & Parekh, 2010 Steven Low CMS, EE, Caltech.
Advertisements

©2012 ClearOne Communications. Confidential and proprietary. COLLABORATE ® Video Conferencing Networking Basics.
Security in VoIP Networks Juan C Pelaez Florida Atlantic University Security in VoIP Networks Juan C Pelaez Florida Atlantic University.
IST 201 Chapter 9. TCP/IP Model Application Transport Internet Network Access.
Precept 3 Host Configuration 1 Peng Sun. What TCP conn. running? Commands netstat [-n] [-p] [-c] (Linux) lsof -i -P (Mac) ss (newer version of netstat)
P2P Distributed Fault Diagnosis for SIP Services Henning Schulzrinne, Kyung-Hwa Kim Dept. of Computer Science, Columbia University, New York, NY Kai Miao.
11 TROUBLESHOOTING Chapter 12. Chapter 12: TROUBLESHOOTING2 OVERVIEW  Determine whether a network communications problem is related to TCP/IP.  Understand.
Module 10: Troubleshooting Network Access. Overview Troubleshooting Network Access Resources Troubleshooting LAN Authentication Troubleshooting Remote.
QoS Solutions Confidential 2010 NetQuality Analyzer and QPerf.
1 CCNA 2 v3.1 Module 9. 2 Basic Router Troubleshooting CCNA 2, Module 9.
1 Fall 2005 Internetworking: Concepts, Architecture and TCP/IP Layering Qutaibah Malluhi CSE Department Qatar University.
DYSWIS1 Managing (VoIP) Applications – DYSWIS Henning Schulzrinne Dept. of Computer Science Columbia University July 2005.
Oct MMNS (San Jose) Distributed Self Fault-Diagnosis for SIP Multimedia Applications Kai X. Miao (Intel) Henning Schulzrinne (Columbia U.) Vishal.
SIMPLEStone – A presence server performance benchmarking standard SIMPLEStone – A presence server performance benchmarking standard Presented by Vishal.
Chapter 23: ARP, ICMP, DHCP IS333 Spring 2015.
Deployment of the VoIP Servers BY: Syed khaja Najmuddin Ahmed Anil Kumar Marikukala.
Managing DHCP. 2 DHCP Overview Is a protocol that allows client computers to automatically receive an IP address and TCP/IP settings from a Server Reduces.
Support Protocols and Technologies. Topics Filling in the gaps we need to make for IP forwarding work in practice – Getting IP addresses (DHCP) – Mapping.
Module 1: Reviewing the Suite of TCP/IP Protocols.
CN2668 Routers and Switches Kemtis Kunanuraksapong MSIS with Distinction MCTS, MCDST, MCP, A+
CECS 5460 – Assignment 3 Stacey VanderHeiden Güney.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 Troubleshooting Your Network Networking for Home and Small Businesses.
1 ISP Help Desk Working at a Small-to-Medium Business or ISP – Chapter 2.
Do You See What I See (DYSWIS)? or Leveraging end systems to improve network reliability Henning Schulzrinne Dept. of Computer Science Columbia University.
S305 – Network Infrastructure Chapter 5 Network and Transport Layers.
1 Automated Fault diagnosis in VoIP 31st March,2006 Vishal Kumar Singh and Henning Schulzrinne.
Network Protocols. Why Protocols?  Rules and procedures to govern communication Some for transferring data Some for transferring data Some for route.
Chapter 1 Overview Review Overview of demonstration network
What is a Protocol A set of definitions and rules defining the method by which data is transferred between two or more entities or systems. The key elements.
Exploring the Packet Delivery Process Chapter
Cisco 1 - Networking Basics Perrine. J Page 19/17/2015 Chapter 9 What transport layer protocol does TFTP use? 1.TCP 2.IP 3.UDP 4.CFTP.
Common Devices Used In Computer Networks
Cisco – Chapter 11 Routers All You Ever Wanted To Know But Were Afraid to Ask.
Operating Systems Lesson 10. Networking Communications protocol is the set of standard rules for ◦ Data representation ◦ Signaling ◦ Authentication ◦
Lec4: TCP/IP, Network management model, Agent architectures
© 2002, Cisco Systems, Inc. All rights reserved..
Objectives: Chapter 5: Network/Internet Layer  How Networks are connected Network/Internet Layer Routed Protocols Routing Protocols Autonomous Systems.
Module 12: Routing Fundamentals. Routing Overview Configuring Routing and Remote Access as a Router Quality of Service.
Chap 9 TCP/IP Andres, Wen-Yuan Liao Department of Computer Science and Engineering De Lin Institute of Technology
1 © 2003, Cisco Systems, Inc. All rights reserved. CCNA 2 Module 9 Basic Router Troubleshooting.
Connecting to a Network Lesson 5. Objectives Understand the OSI Reference Model and its relationship to Windows 7 networking Install and configure networking.
C HAPTER 9 Supporting TCP/IP, DNS using Windows XP.
P2P Distributed Fault Diagnosis for SIP Services Henning Schulzrinne, Kyung-Hwa Kim Dept. of Computer Science, Columbia University, New York, NY Kai Miao.
1 TCP/IP Internetting ä Subnet layer ä Links stations on same subnet ä Often IEEE LAN standards ä PPP for telephone connections ä TCP/IP specifies.
By: Aleksandr Movsesyan Advisor: Hugh Smith. OSI Model.
1 Internet Control Message Protocol (ICMP) Used to send error and control messages. It is a necessary part of the TCP/IP suite. It is above the IP module.
OS Services And Networking Support Juan Wang Qi Pan Department of Computer Science Southeastern University August 1999.
CCNA 2 Week 9 Router Troubleshooting. Copyright © 2005 University of Bolton Topics Routing Table Overview Network Testing Troubleshooting Router Issues.
First, by sending smaller individual pieces from source to destination, many different conversations can be interleaved on the network. The process.
Switch Features Most enterprise-capable switches have a number of features that make the switch attractive for large organizations. The following is a.
Monitoring Troubleshooting TCP/IP Chapter 3. Objectives for this Chapter Troubleshoot TCP/IP addressing Diagnose and resolve issues related to incorrect.
1 Version 3.1 Module 6 Routed & Routing Protocols.
NETGEAR CONFIDENTIAL FVS338 ProSafe VPN Firewall 50.
ERICSON BRANDON M. BASCUG Alternate - REGIONAL NETWORK ADMINISTRATOR HOW TO TROUBLESHOOT TCP/IP CONNECTIVITY.
NETGEAR CONFIDENTIAL FVX538 ProSafe VPN Firewall 200.
1 Connectivity with ARP and RARP. 2 There needs to be a mapping between the layer 2 and layer 3 addresses (i.e. IP to Ethernet). Mapping should be dynamic.
Firewalls A brief introduction to firewalls. What does a Firewall do? Firewalls are essential tools in managing and controlling network traffic Firewalls.
Quality of Service for Real-Time Network Management Debbie Greenstreet Product Management Director Texas Instruments.
1/30/2008 International SIP 2008 (Paris) Peer-to-Peer-based Automatic Fault Diagnosis in VoIP Henning Schulzrinne (Columbia U.) Kai X. Miao (Intel)
KYUNG-HWA KIM HENNING SCHULZRINNE 12/09/2008 INTERNET REAL-TIME LAB, COLUMBIA UNIVERSITY DYSWIS.
Chapter 5. An IP address is simply a series of binary bits (ones and zeros). How many binary bits are used? 32.
Communication Networks NETW 501 Tutorial 2
ITMT Windows 7 Configuration Chapter 5 – Connecting to a Network ITMT 1371 – Windows 7 Configuration 1.
COMP1321 Digital Infrastructure Richard Henson March 2016.
Network Layer IP Address.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 ISP Help Desk Working at a Small-to-Medium Business or ISP – Chapter.
Firewalls, Network Address Translators(NATs), and H.323
Chapter Objectives In this chapter, you will learn:
Troubleshooting a Network
Planning and Troubleshooting Routing and Switching
Presentation transcript:

Managing Services and Networks Using a Peer-to-peer Approach Henning Schulzrinne (with Vishal Singh and other IRT members) Dept. of Computer Science Columbia University July 2007

Overview The transition in IT cost metrics End-to-end application-visible reliability still poor (~ 99.5%) –even though network elements have gotten much more reliable –particular impact on interactive applications (e.g., VoIP) –transient problems Lots of voodoo network management Existing network management doesn’t work for VoIP and other modern applications Need user-centric rather than operator-centric management Proposal: peer-to-peer management –“Do You See What I See?” Using VoIP as running example -- most complex consumer application –but also applies to IPTV and other services Also use for reliability estimation and statistical fault characterization

Network management  transition in cost balance Total cost of ownership –Ethernet port cost  $10 –about 80% of Columbia CS’s system support cost is staff cost about $2500/person/year  2 new PCs/year much of the rest is backup & license for spam filters Does not count hours of employee or son/daughter time PC, Ethernet port and router cost seem to have reached plateau –just that the $10 now buys a 100 Mb/s port instead of 10 Mb/s All of our switches, routers and hosts are SNMP-enabled, but no suggestion that this would help at all

VoIP user experience Only % call attempt success –“Keynote was able to complete VoIP calls 96.9% of the time, compared with 99.9% for calls made over the public network. Voice quality for VoIP calls on average was rated at 3.5 out of 5, compared with 3.9 for public-network calls and 3.6 for cellular phone calls. And the amount of delay the audio signals experienced was 295 milliseconds for VoIP calls, compared with 139 milliseconds for public-network calls.” (InformationWeek, July 11, 2005) Mid-call disruptions common Lots of knobs to turn –Separate problem: manual configuration

Issues in automated VoIP diagnosis Increasingly complex and diverse network elements Complex interactions & relationships between different network elements Different run time bindings for each application usage instance –e.g., different calls may use different DNS, SIP proxy servers, media path Problem in one network element may manifest itself as user perceived failure of another element

Circle of blame OS VSP app vendor ISP must be a Windows registry problem  re-install Windows probably packet loss in your Internet connection  reboot your DSL modem must be your software  upgrade probably a gateway fault  choose us as provider

Diagnostic undecidability symptom: “cannot reach server” more precise: send packet, but no response causes: –NAT problem (return packet dropped)? –firewall problem? –path to server broken? –outdated server information (moved)? –server dead? 5 causes  very different remedies –no good way for non-technical user to tell Whom do you call?

VoIP diagnosis What is automated VoIP diagnosis? –Determining failures in network: when, where –Automatically finding the root cause of the failure: why Why VoIP diagnosis? –networks are complex --> difficult to troubleshoot problems –automatic fault diagnosis reduces human intervention Issues in VoIP diagnosis –Detecting failures/faults –Finding the cause of failure, determining dependency relationships among different components for diagnosis Solution steps and approaches

Traditional network management model SNMP X “management from the center”

Old assumptions, now wrong Single provider (enterprise, carrier) –has access to most path elements –professionally managed Problems are hard failures & elements operate correctly –element failures (“link dead”) –substantial packet loss Mostly L2 and L3 elements –switches, routers –rarely APs Problems are specific to a protocol –“IP is not working” Indirect detection –MIB variable vs. actual protocol performance End systems don’t need management –DMI & SNMP never succeeded –each application does its own updates

Management element inspection configuration fault location network understanding we’ve only succeeded here what causes the most trouble?

Managing the protocol stack RTP UDP/TCP IP SIP no route packet loss TCP neg. failure NAT time-out firewall policy protocol problem playout errors media echo gain problems VAD action protocol problem authorization asymmetric conn (NAT)

Types of failures Hard failures –connection attempt fails –no media connection –NAT time-out Soft failures (degradation) –packet loss (bursts) access network? backbone? remote access? –delay (bursts) OS? access networks? –acoustic problems (microphone gain, echo)

Examples of additional problems ping and traceroute no longer works reliably –WinXP SP 2 turns off ICMP –some networks filter all ICMP messages Early NAT binding time-out –initial packet exchange succeeds, but then TCP binding is removed (“web-only Internet”) policy intent vs. failure –“broken by design” –“we don’t allow port 25” vs. “SMTP server temporarily unreachable”

Fault localization Fault classification – local vs. global –Does it affect only me or does it affect others also? Global failures –Server failure e.g. SIP proxy, DNS failure, database failures –Network failures Local failures –Specific source failure, e.g., node A cannot make call to anyone –Specific destination or participant failure, e.g., no one can make call to node B –Locally observed but global failures, e.g., DNS service failed, but only B observed it

Proposal: “Do You See What I See?” Each node has a set of active and passive measurement tools Use intercept (NDIS, pcap) –to detect problems automatically e.g., no response to HTTP or DNS request –gather performance statistics (packet jitter) –capture RTCP and similar measurement packets Nodes can ask others for their view –possibly also dedicated “weather stations” Iterative process, leading to: –user indication of cause of failure –in some cases, work-around (application-layer routing)  TURN server, use remote DNS servers Nodes collect statistical information on failures and their likely causes

Architecture “not working” (notification) inspect protocol requests (DNS, HTTP, RTCP, …) “DNS failure for 15m” orchestrate tests contact others ping can buddy reach our resolver? notify admin ( , IM, SIP events, …) request diagnostics

Solution approach Store context information of past failures experienced by each node –E.g., specific server that was acting as the proxy server (for my call which failed) Store location of past failures instances –LAN, domain, subnet –First hop at each layer e.g., switch (MAC), default gateway (IP), domain’s proxy (application layer), Failure count for each network element (statistical) Last failure timestamp for each network element Last known-good timestamp for each network element –why do I need to test the proxy for you, my call just went through Temporal correlation of past failures –“proxy seems to be failing after DNS fails” Each node has a runtime dependency list based on past failures and diagnostic tests

Solution architecture DNS Server P2P Service Provider 1 Service Provider 2 P1 P2 P3 Domain A P5 P4 P6 P7 P8 DNS Test PESQ Test SIP Server SIP Test Call Failed at P1 Nodes in different domains cooperating to determine cause of failure

Solution architecture: logical view Dependencies encoded as decision tree, static and dynamic rules Admin input [Dependency relationships and tests (XML) ] Triggers to perform TESTS. (Peer selection and Probe selection. Alerts Dependency graph generation [Bayesian network based, Inference, other models ] Failures in Network Decision Tree updates Test results The above figure shows logical entities and separation of dependency graph generation and distributed diagnostic infrastructure (enclosed in blue).

Solution requirements Request-response protocol between the node which experiences the failure and the peer nodes Nodes needto perform diagnostic tests (probes), probe selection based on cost/result Encoding the dependency relationship into a decision tree –from an expert e.g., as XML Peer node discovery, based on –Location (local network, domain) –Capability to perform tests (based on specific tests) Dependency graph generation and updation, based on –Network failure events –Diagnostic test results correlated with failures

Failure detection tools STUN server –what is your IP address? ping and traceroute Transport-level liveness and QoS –open TCP connection to port –send UDP ping to port –measure packet loss & jitter TBD: Need scriptable tools with dependency graph –initially, we’ll be using ‘make’ TBD: remote diagnostic –fixed set (“do DNS lookup”) or –applets (only remote access) media RTP UDP/TCP IP

Test & probe selection Which diagnostic probe to run? network layer or application layer and for what kind of failures. A probe covering broad range of failures can give faster but less accurate results e.g., ping vs. TCP connect vs. SIP OPTIONS tests Cost of probing –messages –CPU overhead

Dependency classification Functional dependency –At generic service level e.g., SIP proxy depends on DB service, DNS service Structural dependency –Configuration time e.g., Columbia CS SIP proxy is configured to use mysql database on host metro-north Operational dependency –Runtime dependencies or run time bindings e.g., the call which failed was using failover SIP server obtained from DNS which was running on host a.b.c.d in IRT lab

Dependency classifications: layered approach Vertical and lateral dependencies –application depends on other application layer services e.g., SIP service depends on DB, DNS service as well as lower layer services –OSI layers as service dependency layers Application layer service also depends on transport layer service which in turn depends on network layer service –MAC layer: access point, switch –Network layer: router –Application layer: DNS, SIP, database Topology-based dependency –e.g., calls from CS domain depends on specific SIP server –calls from lab phones depends on specific switches and routers

Dependency Graph

Dependency Graph Encoded to Decision Tree A C B D A Failed, Use Decision Tree Yes Invokes Decision Tree for C No Yes Invokes Decision Tree for B Invokes Decision Tree for D Cause Not Known Report, Add new Dependency A B C D A = SIP Call C = SIP Proxy B = DNS Server D = Connectivity

Diagnostic tests - a bit more detail SIP proxy –Proxy server availability SIP PING –Call routing availability INVITE tests –Call path determination SIP TraceRoute Media path –Quality related Speech quality degradation - MOS Echo jitter- MOS, PESQ QoS – RTCP –NAT/Firewall Checking binding expiration. Firewall failure to open a port - One way media. –which firewall in the path? SIP signaling ?

VoIP example Call failure – possible causes –SIP proxy server database authentication –Media path failure Gateway –Specific call legs – ERL, authentication, etc. –DNS server failure –End station failure –Network failure, e.g., router, switch failure Different calls will have different run time dependencies

Diagnostic tests, cont’d DNS tests DHCP Switch/router –ARP/RARP/multicast –BGP failures Conference mixers Gateway –Echo return loss- readings- analysis DB XCAP server tests Presence service availability tests

Current work Building decision tree system Using JBoss Rules (Drools 3.0)

Future work Learning the dependency graph from failure events and diagnostic tests Learning using random or periodic testing to identify failures and determine relationships Self healing Predicting failures Protocols for labeling event failures --> enable automatically incorporating new devices/applications to the dependency system Decision tree (dependency graph) based event correlation

Failure statistics Which parts of the network are most likely to fail (or degrade) –access network –network interconnects –backbone network –infrastructure servers (DHCP, DNS) –application servers (SIP, RTSP, HTTP, …) –protocol failures/incompatibility Currently, mostly guesses End nodes can gather and accumulate statistics

Conclusion Hypothesis: network reliability as single largest open technical issue  prevents (some) new applications Existing management tools of limited use to most enterprises and end users Transition to “self-service” networks –support non-technical users, not just NOCs running HP OpenView or Tivoli Need better view of network reliability