Self Healing Wide Area Network Services Bhavjit S Walha Ganesh Venkatesh.

Slides:



Advertisements
Similar presentations
HIERARCHY REFERENCING TIME SYNCHRONIZATION PROTOCOL Prepared by : Sunny Kr. Lohani, Roll – 16 Sem – 7, Dept. of Comp. Sc. & Engg.
Advertisements

Optimizing Buffer Management for Reliable Multicast Zhen Xiao AT&T Labs – Research Joint work with Ken Birman and Robbert van Renesse.
Ranveer Chandra , Kenneth P. Birman Department of Computer Science
Tam Vu Remote Procedure Call CISC 879 – Spring 03 Tam Vu March 06, 03.
Computer Science 1 ShapeShifter: Scalable, Adaptive End-System Multicast John Byers, Jeffrey Considine, Nicholas Eskelinen, Stanislav Rost, Dmitriy Zavin.
EIGRP routing protocol Omer ben-shalom Omer Ben-Shalom: Must show how EIGRP is dealing with count to infinity problem Omer Ben-Shalom: Must.
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #4 Mobile Ad-Hoc Networks AODV Routing.
ZIGZAG A Peer-to-Peer Architecture for Media Streaming By Duc A. Tran, Kien A. Hua and Tai T. Do Appear on “Journal On Selected Areas in Communications,
Internet Networking Spring 2006 Tutorial 12 Web Caching Protocols ICP, CARP.
Internetworking Different networks –Different bit rates –Frame lengths –Protocols.
Secure Multicast Xun Kang. Content Why need secure Multicast? Secure Group Communications Using Key Graphs Batch Update of Key Trees Reliable Group Rekeying.
Multiple constraints QoS Routing Given: - a (real time) connection request with specified QoS requirements (e.g., Bdw, Delay, Jitter, packet loss, path.
Scalable Application Layer Multicast Suman Banerjee Bobby Bhattacharjee Christopher Kommareddy ACM SIGCOMM Computer Communication Review, Proceedings of.
Exploring Tradeoffs in Failure Detection in P2P Networks Shelley Zhuang, Ion Stoica, Randy Katz HIIT Short Course August 18-20, 2003.
1 Intro To Encryption Exercise Problem What may be the problem with a central KDC?
Distributed Systems Distributed Algorithms 1 Brendan Tangney Distributed Systems (Distributed Algorithms – A Problem Based Learning Approach)
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #13 Web Caching Protocols ICP, CARP.
ITIS 6010/8010 Wireless Network Security Dr. Weichao Wang.
Internet Networking Spring 2002 Tutorial 13 Web Caching Protocols ICP, CARP.
Anonymous Gossip: Improving Multicast Reliability in Mobile Ad-Hoc Networks Ranveer Chandra (joint work with Venugopalan Ramasubramanian and Ken Birman)
Delivery, Forwarding, and Routing
Multicast Transport Protocols: A Survey and Taxonomy Author: Katia Obraczka University of Southern California Presenter: Venkatesh Prabhakar.
 Structured peer to peer overlay networks are resilient – but not secure.  Even a small fraction of malicious nodes may result in failure of correct.
The Zone Routing Protocol (ZRP)
Study of the Relationship between Peer to Peer Systems and IP Multicasting From IEEE Communication Magazine January 2003 學號 :M 姓名 : 邱 秀 純.
1 Napster & Gnutella An Overview. 2 About Napster Distributed application allowing users to search and exchange MP3 files. Written by Shawn Fanning in.
Performance Evaluation and Improvement of an Ad Hoc Wireless Network Takayuki Yamamoto Graduate School of Engineering Science, Osaka University, Japan.
M. Menelaou CCNA2 DYNAMIC ROUTING. M. Menelaou DYNAMIC ROUTING Dynamic routing protocols can help simplify the life of a network administrator Routing.
© Janice Regan, CMPT 128, CMPT 371 Data Communications and Networking BGP, Flooding, Multicast routing.
A Randomized Error Recovery Algorithm for Reliable Multicast Zhen Xiao Ken Birman AT&T Labs – Research Cornell University.
1 Spring Semester 2009, Dept. of Computer Science, Technion Internet Networking recitation #3 Mobile Ad-Hoc Networks AODV Routing.
1 3-Oct-15 Distance Vector Routing CCNA Exploration Semester 2 Chapter 4.
Ad-hoc On-Demand Distance Vector Routing (AODV) and simulation in network simulator.
© 2002, Cisco Systems, Inc. All rights reserved..
Prophet Address Allocation for Large Scale MANETs Matt W. Mutka Dept. of Computer Science & Engineering Michigan State University East Lansing, USA IEEE.
Dynamic Source Routing in ad hoc wireless networks Alexander Stojanovic IST Lisabon 1.
A Routing Underlay for Overlay Networks Akihiro Nakao Larry Peterson Andy Bavier SIGCOMM’03 Reviewer: Jing lu.
7/26/ Design and Implementation of a Simple Totally-Ordered Reliable Multicast Protocol in Java.
Load-Balancing Routing in Multichannel Hybrid Wireless Networks With Single Network Interface So, J.; Vaidya, N. H.; Vehicular Technology, IEEE Transactions.
Peer Pressure: Distributed Recovery in Gnutella Pedram Keyani Brian Larson Muthukumar Senthil Computer Science Department Stanford University.
Rushing Attacks and Defense in Wireless Ad Hoc Network Routing Protocols ► Acts as denial of service by disrupting the flow of data between a source and.
SRL: A Bidirectional Abstraction for Unidirectional Ad Hoc Networks. Venugopalan Ramasubramanian Ranveer Chandra Daniel Mosse.
Securing Passwords Against Dictionary Attacks Presented By Chad Frommeyer.
Ad Hoc Network.
ECE 544 Project3 Group 9 Brien Range Sidhika Varshney Sanhitha Rao Puskuru.
1 VLM 2 : A Very Lightweight Mobile Multicast System For Wireless Sensor Networks Anmol Sheth, Brian Shucker and Richard Han University of Colorado, Department.
RIP Routing Protocol. 2 Routing Recall: There are two parts to routing IP packets: 1. How to pass a packet from an input interface to the output interface.
Relying on Safe Distance to Achieve Strong Partitionable Group Membership in Ad Hoc Networks Authors: Q. Huang, C. Julien, G. Roman Presented By: Jeff.
Netprog: Chat1 Chat Issues and Ideas for Service Design Refs: RFC 1459 (IRC)
Fault Tolerance (2). Topics r Reliable Group Communication.
CSC 8420 Advanced Operating Systems Georgia State University Yi Pan Transactions are communications with ACID property: Atomicity: all or nothing Consistency:
TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).
1 Roie Melamed, Technion AT&T Labs Araneola: A Scalable Reliable Multicast System for Dynamic Wide Area Environments Roie Melamed, Idit Keidar Technion.
Coping with Link Failures in Centralized Control Plane Architecture Maulik Desai, Thyagarajan Nandagopal.
ECE 544 Protocol Design Project 2016 Nirali Shah Thara Philipson Nithin Raju Chandy.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.
RIP.
Internet Networking recitation #4
Internet Networking recitation #12
ECE 544 Project3 Team member.
An Introduction to Computer Networking
Chat Refs: RFC 1459 (IRC).
ECE 544 Project3 Team member: BIAO LI, BO QU, XIAO ZHANG 1 1.
Viet Nguyen Jianqing Liu Yaqin Tang
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
ECE 544 Project3 Dheeraj Medikonda Ravi Chandra Godavarthi 1.
Presentation transcript:

Self Healing Wide Area Network Services Bhavjit S Walha Ganesh Venkatesh

Layout Introduction Previous Work Issues Solution Preliminary results Problems & Future Extensions Conclusion

Motivation Companies may have servers distributed over a wide area network  Akamai Content Distribution Network.  Distributed web-servers Manual monitoring may not be feasible Centralized control – may lead to problems in case of a network partition Typical server applications  May crash due software bugs  Little state is retained  Simple restart is thus sufficient

Motivation … What if peers monitored each others health?  In case a crash is detected - try and restart.  No central monitoring station involved. Loosely based on a worm  Resilient to sporadic failures  Spreads to uninfected nodes  But No backdoor involved May not always shift to new nodes

Introduction Previous Work Issues Solution Preliminary results Problems & Future Extensions Conclusion

Medusa All nodes a part of a Multicast Group  Each node is thus in touch with all other nodes through Heatbeat messages.  Nodes send regular updates to the multicast tree All communication through reliable multicast In case a node goes down  Other nodes try to restart it Request for service sent to multicast group

Medusa Problems Scalability  Assumptions of reliable packet delivery  State information shared with all nodes. Reliable Multicast  Assumes reliable delivery of packets to all nodes  No explicit ACKs The kill operations fail in case of a temporary break in Multicast tree. Security  No way of authenticating packets

Introduction Previous Work Issues Solution Preliminary results Problems & Future Extensions Conclusion

Proposed solution Nodes form peering relationships with only a subset of other nodes.  Exchange Hello packets  Scalable as the degree is fixed No central control No dependence on reliable multicast  Distributed communication protocol  Explicit ACKs for packets Some super-nodes required to be up when booted Power of Randomly-connected graphs graphs

Design Each node continually sends Hello Packets to its peer nodes.  Indicates everything is up and working A timeout indicates something is wrong  Application crash  Network Partition Aim at application crashes  Application should be stateless  No code transfer  Remotely restartable  SSH needed – A login account and distributed keys.

Initialization 3-5 super-nodes form a fully-connected connected graph.  Are expected to be up all the time All nodes have information about their IPs May be under manual supervision May have information about the topology Responsible for forwarding join requests to other nodes

Remote start SSH to a remote node to restart  Remote (re)start attempted after Hello timeout.  Current implementation requires keys to be distributed beforehand Starts a small watchdog program which immediately returns Checks if there is a another copy already running  Current implementation uses ps In case the application start fails, do nothing – wait for retry to restart Possible extension: allow the service to spread

New node comes up… Waits for others to contact it After timeout:  Send JoinRequest to a super-node with the number of peers needed.  Supernode forwards this request to other nodes AddRequest  Some node may ask new node to become its peer  Add to neighbourList and send AddACK Hello  Can add to neighbourList if unsolicited Hello received  Beneficial in case of a short temporary failures After Request-timeout:  Contact another super-node with another JoinRequest.  Timeout can be dynamically specified in JoinRequestACK.

New node comes up… Random Walk. Request forwarded by super-node to 3 random nodes on behalf of new node Each node forwards it to others  Decrease hop count by 1 each time If hop count = 0, check if it can support more nodes  YES! Send AddRequest to new node Add to neighbourList on receiving AddACK.  NO! Ignore the request New node may already have found neighbours  Due to duplicate joinRequest or repair of Network partition  New node thus replies to AddRequest with Die packet.

Shutdown Critical to ensure that all nodes go down 3-way protocol  Send kill to target node  Target node replies with die  Send dieACK to target node. kill  used when multiple copies detected  Possibly to balance load die  Reply to unsolicited Hello No perfect solution in case of a network partition

Global Shutdown… Secret killAll packet  Sent by an external program for complete system shutdown  Forwarded to all neighbours Node does not die until it receives a killACK from everyone  Stops sending hellos immediately  No further restart attempts  Reply only to die, kill and killAll  May send unnecessary traffic Eventually time out on seeing zero neighbours.

Performance Tested on 6 nodes in GradLab Hello interval: 5s Hello timeout: 22s Wait before joinRequest : 10s joinRequest timeout: 20s Hop count: 2 Initial degree request: 3 Super-nodes: 3 Preliminary tests on PlanetLab

Results LAN No timeouts or packet losses observed No duplicate copies killAll works perfectly  Re-start latency: 22s Decreases after a number of restarts  Join latency: 15s PlanetLab  Re-start latency: 27s  Join latency: 21s

Introduction Previous Work Issues Solution Preliminary results Problems and Future Extensions Conclusion

Limitations Security  The packets are not authenticated Stray copies  After a killAll there may be stray copies  Harmless as they do not try to spread  But: prevents another copy from running No new nodes  Node discovery  Why should they be idle in first place?  What to do when the original nodes come back up?  Solution Send regular updates to super-nodes Extra servers can be killed easily

Parameter tweaking Hop count for Random Walk Connectivity  Min-degree to ensure connectivity  Max-degree to spread the failure probability Timeouts  Request timeout Depends on hop-count  Hello timeout Different for WAN & LAN  Global timeout In case of network partition Loss of Kill ACK packets

Conclusion Maintaining High Availability does not always require central control Achieving a global shutdown is problematic Need to explore connectivity requirements to ensure a connected graph at all times.

Thank You !