Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lessons from a SIP Wireless Deployment Jonathan Rosenberg Chief Scientist.

Similar presentations


Presentation on theme: "Lessons from a SIP Wireless Deployment Jonathan Rosenberg Chief Scientist."— Presentation transcript:

1 Lessons from a SIP Wireless Deployment Jonathan Rosenberg Chief Scientist

2 SIP2003 Lessons 2 Background Market: 2.5g Wireless Network Initial Applications: Instant Messaging (IM) and Presence Subscriber Sizing: 500k Initially, Scaling up to Several Million PoPs: 12 Regional PoPs, 2 Centralized Data Centers Servers: Between 100 and 200 Separate Server Processes

3 SIP2003 Lessons 3 Lessons Summary Data Distribution and Management Is Hard Network-wide Diagnostics Are Essential UDP Non-invites and Failover Interactions

4 SIP2003 Lessons 4 The Data Distribution Problem SIP Applications Depend on Many Pieces of Data Provisioned data Buddy lists White/black lists Call forwarding numbers Soft-state data Presence Registrations Many Parties Interested in Writing the Data Wireless handset updates its buddy list Web application updates a buddy list Customer care updates a buddy list Many Parties Interested in Reading the Data Wireless handset, to get their current buddy list Web application, to display the current buddy list Customer care, to tell a customer who is on their buddy list Presence server, to support subscriptions to the buddy list Many Parties Interested in Finding Out Changes to the Data Handset – for buddy list synchronization Presence server: to send a SUBSCRIBE request to a new participant Other applications

5 SIP2003 Lessons 5 Requirements for Data Distribution Network Element Requirements for Data “Close” to the element, for performance reasons Replicated and consistent across all elements in a cluster within a pop Replicated to other pops to provide pop failover Soft-state data replicated to backup servers for failover support Operator Requirements for Data Data survives crashes of any or all network elements Data can be read/written by provisioning and customer support systems Data can be accessed by provisioning and customer support from a single access point, independent of network scale and size Data writes are validated before being propagated Data propagation to elements survives network faults (IP router goes down), element failures, etc. Distribution of provisioned data has minimal to no impact on element performance (i.e., A bulk-load cannot take down a running system) Recovery from data distribution failures needs to be possible

6 SIP2003 Lessons 6 Key Lessons The Requirements for “Closeness” and “Performance” Conflict with Consistency Requirements Ultimately, the data gets replicated across a potentially large number of elements Large scale replication with transactional integrity is very costly in terms of performance Seek compromise data distribution methods that provide good performance with reduced consistency The Data Distribution Piece Is at Least As Hard, If Not Harder, Than Getting the SIP Pieces Right Try to Solve This Problem Generally, Not Separately for Each Application

7 SIP2003 Lessons 7 Network Wide Diagnostics Problem Statement Joe calls customer service. He says his phone doesn’t work. When asked what the problem was, he reports that his IM never reached his intended target. He sent it yesterday or perhaps the day before. The Challenge Find the element which failed and identify the specific problem in the deployed production network, without affecting performance of the network.

8 SIP2003 Lessons 8 Why Is This Challenging? There Are a Multitude of Elements at the “SIP Layer” A variety of proxies A variety of databases A variety of gateways There Are a Multitude of Elements at Other Layers A variety of routers A variety of GGSNs (Gateway GPRS Support Node)/PDSNs(Packet Data Serving Node) A variety of base stations A variety of ethernet switches Continuous Logging Is Not Possible Performance implications You Cannot Replicate the State of the Network When the Failure Occurred Too many users and other variables

9 SIP2003 Lessons 9 What is the Solution Design for Diagnostics Stimulate Your System Engineer for Evolution Know Your Network

10 SIP2003 Lessons 10 Design for Diagnostics Extensive “Triggered” Logging Look for conditions that may indicate an error SIP transaction timeout SIP request failure Database timeout Corrupted database data On those conditions, produce mass amounts of trace data Execution stacks Message contents May Need to Store Trace Data in Memory in Sliding Window Sometimes an error on one place caused an error in another Careful Draining of Trace Data Cannot affect runtime performance Centralized Repository for Trace Data Don’t want to have to go to each of the machines Push it to a single place with well- identified correlation identifiers Don’t Forget the Handset! The handset is part of the network It should generate trace data too upon failure! Related To, but Not the Same As Fault Management This is something the network operations guys can’t fix

11 SIP2003 Lessons 11 Stimulate Your System The Best Problems Are the Ones You Find Before Your Customers Do! Look for Problems Through Active “Probing” of the Network A usage which triggers the logging of data about how it was processed in each element Usage must be a normal one SIP “Probe” Extensions Headers that ask proxies and user agents to generate tracing information about message handling May also designate a destination for sending the data Alternatively, attach it to the message What if its lost? Security Issues Must carefully authenticate the sender of a probed message Otherwise, a great source of dos and other attacks Continuously Send Probes For each use case of your network For each pop or site Vary the transmission times and contents wherever possible IETF Work Just Begun Develop requirements for such probes

12 SIP2003 Lessons 12 Engineer for Evolution Once You Find the Bug and Prepare a Fix, What Then? Need to Upgrade the Affected Servers Cannot affect run time performance Must be easy to do (so you can do it often!) Must be easy to undo Solution: Automated Software Upgrade Basic Process Vendor sends operator a new version Operator types “install version” at the centralized management console Console determines which servers are affected For each server, gracefully terminates it one at a time Remotely installs upgrade Old one not removed Updates configuration files if needed Remotely verifies upgrade Restarts server, and goes to the next one Old Server Versions and Configurations Are Kept, Rollback Is Allowed Process Must Be Automated and Easy Model: Quicken

13 SIP2003 Lessons 13 Know Your Network Experience Is Ultimately the Only Way to Find Problems The People Who Design Elements Are Usually Not the Ones Who Have Experience Running Networks of Them Put Processes in Place to Feed Back Experience to the Developers and Architects

14 SIP2003 Lessons 14 Non-INVITE UDP Failover Problem A SIP non-invite request is sent through a chain of proxies The final proxy has failed Upon transaction timeout, each of them generates a 408 The “winning” 408 depends on relative timing Would like to mark the server as failed so it is not tried again How does each proxy know if the failure was its own next-hop, or some other server downstream? Timeout can occur first anywhere in the chain Downstream 408s are discarded because transaction has timed out Timeout MSG 408 P P P P P P

15 SIP2003 Lessons 15 Solutions Use TCP TCP will provide a hop-by-hop acknowledgement for the data If next hop fails, your TCP connection reports errors Bring Back 100 Responses for Non-invite Tells a proxy that the next hop got the request Means proxy was alive at the beginning of the transaction Next hop considered dead if no 100 is received Con: extra message traffic Extended Transactions Two transaction timeouts Currently defined one Longer one used to wait for 408 responses from downstream nodes If 408 is received before second timeout, but after first, failure is not the next hop If no 408 is received before second timeout, downstream element has failed Con: additional memory requirements for holding on to state of the transaction Conclusion: Needs to Be Worked in IETF

16 SIP2003 Lessons 16 Summary Building a Large Scale Distributed SIP Network Is Hard Many of the Problems Are Not Specific to SIP, and Show up in Any Similar System IP networks Email networks Key General Lessons Data distribution is hard Worry about diagnostics SIP Lesson Non-invite failover problem

17 Information Resource Jonathan Rosenberg Chief Scientist +1 973.952.5000 jdrosen@dynamicsoft.com


Download ppt "Lessons from a SIP Wireless Deployment Jonathan Rosenberg Chief Scientist."

Similar presentations


Ads by Google